Elastic search and OCR

Understanding integrated search

The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.

Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue, so in most cases seperation is not strictly required.

The basic search service requires

TS file indeing service (queue handler)
Elastic search server (search engine)

For multitenant setups a single TS file indexing service can service multiple instances, as long as they write requests to the same queue (using DB views). The Elastic search server can also handle multiple applications.

If PDF OCR functionality is needed the following components needs installation too

Ghostscript (PDF to TIFF conversion)
Tesseract (OCR library)

The above components for OCR must be installed on the file indexing server.

Adding OCR capability

Both OCR components must be installed on the same server as TS file indexing service.

Install: Ghostscript binaries

Download and run installer

  http://www.ghostscript.com/download/gsdnld.html

Note: You are not required to buy a license

Install: Tesseract binaries

For linux just use install from repository using

  sudo yum install tesseract-ocr

For Windows download installer or zip archieve

  https://sourceforge.net/projects/tesseract-ocr-alt/files/

Setting up basic search service

Note that the Elastic search server can be installed on a seperate server (neither TS file indexing or the application server is required).

Install: Elastic search server

Elastic search server (version 5) will run standalone and will require Java 8 or higher

Download Elastic search zip archieve
Unpack files to suitable location
Start elastic.bat in /bin folder

  https://www.elastic.co/downloads/elasticsearch

Install: TS file indexing service (TSFIS)

For TSFIS to run yo will need a servlet container (Tomcat,JBoss,Oracle AS).

Download tsFileIndexingService.war
Dump to webapplication folder on application server
Change settings in web.xml
- Database connection strings: If on same server just copy the seeting from your main application
- ExecutableGhostscript: Path to Ghostscript (see above)
- ExecutableTerrasect: Path to Terrasect OCR module (see above)
- ElasticServerAddress: IP or servername where ElastisSearch is installed (see above)
Restart server (to reload DB credentials)
Test application at: <server>/tsFileIndexService/execute

Elastic search and OCR

Contents

Understanding integrated search

Adding OCR capability

Install: Ghostscript binaries

Install: Tesseract binaries

Setting up basic search service

Install: Elastic search server

Install: TS file indexing service (TSFIS)

Multi application setup

Navigation menu

Elastic search and OCR

Understanding integrated search

Adding OCR capability

Install: Ghostscript binaries

Install: Tesseract binaries

Setting up basic search service

Install: Elastic search server

Install: TS file indexing service (TSFIS)

Multi application setup

Navigation menu

Search