Difference between revisions of "Elastic search"

Revision as of 11:36, 10 March 2019

Introduction

Adding Elastic search to you existing TS installation, will provide you with freetext searches in data and files.

Files are indexed together with the data in the records, so a record can be found by either their record values (name, phone etc.) or by search hits in files attached to those records. Results are filtered realtime according to the current security model, so no indexing is needed if settings change.

Install

In order to index records and files you will need to complete these steps

Install standalone Elastic search server
Install and configure Tempus Serva file indexing
Configure the Tempus Serva installation

Finally you may want to install optional components to handle OCR (scanned PDF's and images)

Install Elastic search

Java 8 / Elastic search 6

This is the recommended version but requires Java 8.

Follow this guide:

 https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html

Java 7 / Elastic search 1.7

This version is an alternate version.

Install and unpack files

 sudo wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.6.tar.gz
 tar -xvf elasticsearch-1.7.6.tar.gz
 sudo rm elasticsearch-1.7.6.tar.gz

Run as a daemon

 elasticsearch-1.7.6/bin/elasticsearch -d

Test that the service is running

 curl 'http://localhost:9200/?pretty'

Install TS indexing service

Install war file

 cd /usr/share/tomcat6/webapps/
 sudo wget https://www.tempusserva.dk/install/tsFileIndexingService.war

A couple of seconds later you can configure he data connection and paths for OCR librarys

 sudo nano /usr/share/tomcat6/conf/Catalina/localhost/tsFileIndexingService.xml

Restart server after changes

 tstomcatrestart

Enable and test indexing in Tempus Serva

Set the following configurations to true

fulltextIndexData
fulltextIndexFile

Also add port 8080 to the following URL

fulltextFileHandlerURL

Update any record in the TS installation

Tjeck the index is created and that there is a mapping for the solution

 curl 'http://localhost:9200/tempusserva/?pretty'

Next validate that records are found when searched for (replace * with a valid string)

 curl 'http://localhost:9200/tempusserva/_search?pretty&q=*'

Finally validate that the Tempus Serva wrapper also works

 http://<server>/TempusServa/fulltextsearch?subtype=4&term=*

Optional OCR components

Some libraries must be installed (ghostscript is probably allready installed)

 sudo yum install ImageMagick
 sudo yum install ghostscript

Also install tesseract

CentOS/Fedora

 sudo yum install tesseract-ocr

Amazon linux

sudo yum --enablerepo=epel --disablerepo=amzn-main install libwebp
sudo yum --enablerepo=epel --disablerepo=amzn-main install tesseract

Afterwards change the configurations in the file indexer

 sudo nano /usr/share/tomcat6/conf/Catalina/localhost/tsFileIndexingService.xml

The values should be

/usr/bin/tesseract
/usr/bin/convert
/usr/bin/ghostscript

After changing the values restart the server.

Trouble shooting

Status on the file indexing

The file indexer has a stus page that will display information about the state of the indexer

 https://<server>/tsFileIndexingService/execute

The page also constains a goodword "HEALTHY" taht is displayed if the process has not exceeded the specified timeouts.

Controlling timeouts

Timeouts are specified in seconds and should be tuned to CPU size and quality of documents

 <Parameter name="TimeoutTesseract" value="600"/>  
 <Parameter name="TimeoutGhostscript" value="60"/>

Poor quality documents on virtualized environments can easily consume about a minute per page.

Debugging OCR proces

By default output from the external components are written to logfiles, which can be disabled by adding this option

 <Parameter name="SuppressCommandOutput" value="0"/>

Note that there is a switch in configuration file (context.xml) which can disable file deletion on the server

 <Parameter name="DisableFileCleanup" value=""/>

Reindexing

Reindex files

Before reindexing starts may clean up the index (this is optional)

 DELETE FROM lucenedatastore WHERE FieldID > 0;

To reindex execute the statement below using the following parameters

schema of the database (example: "tslive")
file table of the solution (example: "data_solution_file")

 INSERT INTO lucenefilequeue (application,tablename,FileID) 
 SELECT 'tslive', 'data_solution_file', f.ID as FileID FROM data_solution_file as f WHERE f.IsDeleted = 0;

After executing the statement execute the indexing service and wait patiently

Reindex everything

I case your Elastic Search is lost or corrupted it is quite easy to reindex the whole database

 UPDATE lucenedatastore SET IsProcessed = 0;

Note this will just add all data to Elastic serach once more (so no OCR etc.)