Difference between revisions of "Elastic search and OCR"

From TempusServa wiki
Jump to navigation Jump to search
old>Admin
 
(59 intermediate revisions by 3 users not shown)
Line 3: Line 3:
Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.
Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.


Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue.
Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue, so in most cases seperation is not strictly required.


The basic search service requires
The basic search service requires
* TS file indeing service (queue handler)
* TS file indexing service (queue handler)
* Elastic search server (search engine)
* Elastic search server (search engine)
For multitenant setups a single TS file indexing service can service multiple instances, as long as they write requests to the same queue (using DB views). The Elastic search server can also handle multiple applications.


If PDF OCR functionality is needed the following components needs installation too
If PDF OCR functionality is needed the following components needs installation too
Line 15: Line 17:
The above components for OCR must be installed on the file indexing server.
The above components for OCR must be installed on the file indexing server.


=== Setting up basic search service ===
=== Behind the scenes ===
Indexes in Tempus Serva is stored in intermediate tables in the database. ElasticSearch indexes can be dropped an regenerated from the intermediate storage.
 
==== Data indexing ====
Data is mainly indexed in one large text blob.
 
  Submit data > Stored in '''lucenedatastore''' > Transfer to ElasticSearch
 
For solutions using version history data will be reused, in order to minimize the overhead.
 
==== File indexing ====
File indexes points to the record, not the file itself. Likewise permission checks will rely on read access to a record.
 
  Upload file > Stored in '''lucenefilequeue''' > tsFileIndexingService > Transfer to ElasticSearch
 
The indexing service handles files in  various conversion processes
* tsFileIndexingService > Apache Tika (most files)
* tsFileIndexingService > terrasect (tif images)
* tsFileIndexingService > GhostScript > terrasect (PDF images)
 
For multi application installation '''lucenefilequeue''' is made for sharing between applications.
 
==== ElasticSearch structure ====
Multiple applications can share the same ElasticSearch server
 
/ APPLICATION / SOLUTION / RECORD ID
 
Records contains the must general information
* Title
* Content (large text blob)
* SagID
* DataID
* FieldID (in case of subrecords)
* ModifiedAt
* ModifiedBy (UserID)
 
Serahc results are filtered against the Tempus Serva permission engine '''on record level'''.
 
Data and subrecords (such as files) are stored in the same area, with slight adjustment to their record ID: DataID + "f" + FileID
 
/tempusserva/crm/6541
/tempusserva/crm/6541f45
 
== Adding OCR capability ==
OCR components must be installed on the same server as TS file indexing service.
 
Only GhostScript and Terrasect are required to proces PDF files.
 
TEMPORARY FIX: <tomcat>\catalina\catalina.properties add java.io.tmpdir=c:/Temp
 
 
=== Install: ImageMagick binaries ===
Download and unpack "portable" version (recommended c:\ImageMagick)
 
  https://www.imagemagick.org/script/binary-releases.php
 
Register the location of the '''convert''' executeable in web.xml
 
    <context-param>
        <param-name>ExecutableImageMagick</param-name>
        <param-value>c:\ImageMagick\convert</param-value>
    </context-param>
 
Leaving the entry empty will prevent OCR handling of image files: png, jpg, jpeg
 
=== Install: Ghostscript binaries ===
Download and run installer
 
  http://www.ghostscript.com/download/gsdnld.html
 
Note: You are not required to buy a license
 
Register the location of the '''gswin64c''' executeable in web.xml
 
    <context-param>
        <param-name>ExecutableGhostscript</param-name>
        <param-value>c:\Program Files\gs\gs9.20\bin\gswin64c.exe</param-value>
    </context-param>
 
Leaving the entry empty will prevent OCR handling of PDF files
 
=== Install: Tesseract binaries ===
For linux just use install from repository using
 
  sudo yum install tesseract-ocr
 
If you are using Amazon linux please use this instead ([https://stackoverflow.com/questions/38065964/fastest-way-to-install-tesseract-on-elastic-beanstalk thanks for help]).
 
  sudo yum --enablerepo=epel --disablerepo=amzn-main install libwebp
  sudo yum --enablerepo=epel --disablerepo=amzn-main install tesseract
 
For Windows download installer or zip archieve
 
  https://sourceforge.net/projects/tesseract-ocr-alt/files/
 
Register the location of the '''tesseract''' executeable in web.xml
 
    <context-param>
        <param-name>ExecutableTerrasect</param-name>
        <param-value>c:\tesseract\tesseract</param-value>
    </context-param>
 
== Setting up basic search service ==
Note that the Elastic search server can be installed on a seperate server (neither TS file indexing or the application server is required).
 
=== Install: Elastic search server ===
Elastic search server (version 5) will run standalone and will require Java 8 or higher
# Download Elastic search zip archieve
# Unpack files to suitable location
# Start elastic.bat in /bin folder
 
  https://www.elastic.co/downloads/elasticsearch
 
For Linux you can follow the guide in [https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html#_installation_example_with_tar Install with tar]
 
  wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.9.1-x86_64.rpm
  wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.9.1-x86_64.rpm.sha512
  shasum -a 512 -c elasticsearch-8.9.1-x86_64.rpm.sha512
  sudo rpm --install elasticsearch-8.9.1-x86_64.rpm
  sudo rpm -e elasticsearch-8.9.1-x86_64.rpm
 
Alternatively use this script
 
  sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
  sudo sh -c 'curl https://gist.githubusercontent.com/nl5887/b4a56bfd84501c2b2afb/raw/elasticsearch.repo >> /etc/yum.repos.d/elasticsearch.repo'
  sudo yum install -y elasticsearch 
  sudo chkconfig elasticsearch on
 
  sudo nano /etc/elasticsearch/jvm.options
  -Xms256m
  -Xmx256m
 
  sudo service elasticsearch start
 
  curl 'http://localhost:9200/app/_count?pretty&q='y


==== Install: TS file indexing service ====
=== Install: TS file indexing service (TSFIS) ===
For TSFIS to run yo will need a servlet container (Tomcat,JBoss,Oracle AS).
# Download tsFileIndexingService.war
# Dump to webapplication folder on application server
# Change settings in web.xml
#* Database connection strings: If on same server just copy the seeting from your main application
#* ExecutableGhostscript: Path to Ghostscript (see above)
#* ExecutableTerrasect: Path to Terrasect OCR module (see above)
#* ElasticServerAddress: IP or servername where ElastisSearch is installed (see above)
# Restart server (to reload DB credentials)
# Test application at: <server>/tsFileIndexingService/execute


==== Install: Elastic search server ====
=== Network configuration ===
In the event that Elastic search or the file indexer is not on the same server you will need to ensure that
* Open port 3306 fra '''TS file indexing service''' to '''MySQL database''' (normally the application server)
* Open port 2100 fra '''TS file indexing service''' to '''ElasticSearch''' server
* Open port 2100 fra '''Tempus Serva application''' to '''ElasticSearch''' server


Also remember to update configrations for server names
* Elastic search: elasticsearch.yml file > network.host (add IP or servername)
** https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html
* TS file indexing service: web.xml > ElasticServerAddress (elastic search server)
* Tempus Serva application: Server policies
** fulltextFileHandlerURL (file indexing server)
** fulltextElasticBaseURL (elastic search server)


=== Adding OCR capability ===
=== Multi application setup ===


==== Install: Ghostscript binaries ====
# Setup a shared table for '''lucenefilequeue''' using views
#* Delete the lucenefilequeue table in all slave databases
#* Create a view of lucenefilequeue pointing to the master database
# '''TS file indexing service''' must have a user with access to all TS databases


==== Install: Tesseract binaries ====
Multiple instances will have a shard each in the Elastic index

Latest revision as of 14:49, 24 August 2023

Understanding integrated search

The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.

Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue, so in most cases seperation is not strictly required.

The basic search service requires

  • TS file indexing service (queue handler)
  • Elastic search server (search engine)

For multitenant setups a single TS file indexing service can service multiple instances, as long as they write requests to the same queue (using DB views). The Elastic search server can also handle multiple applications.

If PDF OCR functionality is needed the following components needs installation too

  • Ghostscript (PDF to TIFF conversion)
  • Tesseract (OCR library)

The above components for OCR must be installed on the file indexing server.

Behind the scenes

Indexes in Tempus Serva is stored in intermediate tables in the database. ElasticSearch indexes can be dropped an regenerated from the intermediate storage.

Data indexing

Data is mainly indexed in one large text blob.

  Submit data > Stored in lucenedatastore > Transfer to ElasticSearch

For solutions using version history data will be reused, in order to minimize the overhead.

File indexing

File indexes points to the record, not the file itself. Likewise permission checks will rely on read access to a record.

  Upload file > Stored in lucenefilequeue > tsFileIndexingService > Transfer to ElasticSearch

The indexing service handles files in various conversion processes

  • tsFileIndexingService > Apache Tika (most files)
  • tsFileIndexingService > terrasect (tif images)
  • tsFileIndexingService > GhostScript > terrasect (PDF images)

For multi application installation lucenefilequeue is made for sharing between applications.

ElasticSearch structure

Multiple applications can share the same ElasticSearch server

/ APPLICATION / SOLUTION / RECORD ID

Records contains the must general information

  • Title
  • Content (large text blob)
  • SagID
  • DataID
  • FieldID (in case of subrecords)
  • ModifiedAt
  • ModifiedBy (UserID)

Serahc results are filtered against the Tempus Serva permission engine on record level.

Data and subrecords (such as files) are stored in the same area, with slight adjustment to their record ID: DataID + "f" + FileID

/tempusserva/crm/6541
/tempusserva/crm/6541f45

Adding OCR capability

OCR components must be installed on the same server as TS file indexing service.

Only GhostScript and Terrasect are required to proces PDF files.

TEMPORARY FIX: <tomcat>\catalina\catalina.properties add java.io.tmpdir=c:/Temp


Install: ImageMagick binaries

Download and unpack "portable" version (recommended c:\ImageMagick)

  https://www.imagemagick.org/script/binary-releases.php

Register the location of the convert executeable in web.xml

   <context-param>
       <param-name>ExecutableImageMagick</param-name>
       <param-value>c:\ImageMagick\convert</param-value>
   </context-param>

Leaving the entry empty will prevent OCR handling of image files: png, jpg, jpeg

Install: Ghostscript binaries

Download and run installer

  http://www.ghostscript.com/download/gsdnld.html

Note: You are not required to buy a license

Register the location of the gswin64c executeable in web.xml

   <context-param>
       <param-name>ExecutableGhostscript</param-name>
       <param-value>c:\Program Files\gs\gs9.20\bin\gswin64c.exe</param-value>
   </context-param>

Leaving the entry empty will prevent OCR handling of PDF files

Install: Tesseract binaries

For linux just use install from repository using

  sudo yum install tesseract-ocr

If you are using Amazon linux please use this instead (thanks for help).

 sudo yum --enablerepo=epel --disablerepo=amzn-main install libwebp
 sudo yum --enablerepo=epel --disablerepo=amzn-main install tesseract

For Windows download installer or zip archieve

  https://sourceforge.net/projects/tesseract-ocr-alt/files/

Register the location of the tesseract executeable in web.xml

   <context-param>
       <param-name>ExecutableTerrasect</param-name>
       <param-value>c:\tesseract\tesseract</param-value>
   </context-param>

Setting up basic search service

Note that the Elastic search server can be installed on a seperate server (neither TS file indexing or the application server is required).

Install: Elastic search server

Elastic search server (version 5) will run standalone and will require Java 8 or higher

  1. Download Elastic search zip archieve
  2. Unpack files to suitable location
  3. Start elastic.bat in /bin folder
  https://www.elastic.co/downloads/elasticsearch

For Linux you can follow the guide in Install with tar

 wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.9.1-x86_64.rpm
 wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.9.1-x86_64.rpm.sha512
 shasum -a 512 -c elasticsearch-8.9.1-x86_64.rpm.sha512 
 sudo rpm --install elasticsearch-8.9.1-x86_64.rpm
 sudo rpm -e elasticsearch-8.9.1-x86_64.rpm

Alternatively use this script

 sudo rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
 sudo sh -c 'curl https://gist.githubusercontent.com/nl5887/b4a56bfd84501c2b2afb/raw/elasticsearch.repo >> /etc/yum.repos.d/elasticsearch.repo'
 sudo yum install -y elasticsearch  
 sudo chkconfig elasticsearch on
 sudo nano /etc/elasticsearch/jvm.options
 -Xms256m
 -Xmx256m
 sudo service elasticsearch start
 curl 'http://localhost:9200/app/_count?pretty&q='y

Install: TS file indexing service (TSFIS)

For TSFIS to run yo will need a servlet container (Tomcat,JBoss,Oracle AS).

  1. Download tsFileIndexingService.war
  2. Dump to webapplication folder on application server
  3. Change settings in web.xml
    • Database connection strings: If on same server just copy the seeting from your main application
    • ExecutableGhostscript: Path to Ghostscript (see above)
    • ExecutableTerrasect: Path to Terrasect OCR module (see above)
    • ElasticServerAddress: IP or servername where ElastisSearch is installed (see above)
  4. Restart server (to reload DB credentials)
  5. Test application at: <server>/tsFileIndexingService/execute

Network configuration

In the event that Elastic search or the file indexer is not on the same server you will need to ensure that

  • Open port 3306 fra TS file indexing service to MySQL database (normally the application server)
  • Open port 2100 fra TS file indexing service to ElasticSearch server
  • Open port 2100 fra Tempus Serva application to ElasticSearch server

Also remember to update configrations for server names

Multi application setup

  1. Setup a shared table for lucenefilequeue using views
    • Delete the lucenefilequeue table in all slave databases
    • Create a view of lucenefilequeue pointing to the master database
  2. TS file indexing service must have a user with access to all TS databases

Multiple instances will have a shard each in the Elastic index