Difference between revisions of "Elastic search and OCR"
old>Admin |
old>Admin |
||
Line 16: | Line 16: | ||
The above components for OCR must be installed on the file indexing server. | The above components for OCR must be installed on the file indexing server. | ||
==== Behind the scenes ==== | |||
Submit data > Stored in '''lucenedatastore''' > Transfer to ElasticSearch | |||
Upload file > Stored in '''lucenefilequeue''' > tsFileIndexingService > Transfer to ElasticSearch | |||
* tsFileIndexingService > Apache Tika (most files) | |||
* tsFileIndexingService > terrasect (tif images) | |||
* tsFileIndexingService > GhostScript > terrasect (PDF images) | |||
== Adding OCR capability == | == Adding OCR capability == |
Revision as of 19:45, 5 December 2016
Understanding integrated search
The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.
Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue, so in most cases seperation is not strictly required.
The basic search service requires
- TS file indeing service (queue handler)
- Elastic search server (search engine)
For multitenant setups a single TS file indexing service can service multiple instances, as long as they write requests to the same queue (using DB views). The Elastic search server can also handle multiple applications.
If PDF OCR functionality is needed the following components needs installation too
- Ghostscript (PDF to TIFF conversion)
- Tesseract (OCR library)
The above components for OCR must be installed on the file indexing server.
Behind the scenes
Submit data > Stored in lucenedatastore > Transfer to ElasticSearch
Upload file > Stored in lucenefilequeue > tsFileIndexingService > Transfer to ElasticSearch
- tsFileIndexingService > Apache Tika (most files)
- tsFileIndexingService > terrasect (tif images)
- tsFileIndexingService > GhostScript > terrasect (PDF images)
Adding OCR capability
Both OCR components must be installed on the same server as TS file indexing service.
Install: Ghostscript binaries
Download and run installer
http://www.ghostscript.com/download/gsdnld.html
Note: You are not required to buy a license
Install: Tesseract binaries
For linux just use install from repository using
sudo yum install tesseract-ocr
For Windows download installer or zip archieve
https://sourceforge.net/projects/tesseract-ocr-alt/files/
Setting up basic search service
Note that the Elastic search server can be installed on a seperate server (neither TS file indexing or the application server is required).
Install: Elastic search server
Elastic search server (version 5) will run standalone and will require Java 8 or higher
- Download Elastic search zip archieve
- Unpack files to suitable location
- Start elastic.bat in /bin folder
https://www.elastic.co/downloads/elasticsearch
Install: TS file indexing service (TSFIS)
For TSFIS to run yo will need a servlet container (Tomcat,JBoss,Oracle AS).
- Download tsFileIndexingService.war
- Dump to webapplication folder on application server
- Change settings in web.xml
- Database connection strings: If on same server just copy the seeting from your main application
- ExecutableGhostscript: Path to Ghostscript (see above)
- ExecutableTerrasect: Path to Terrasect OCR module (see above)
- ElasticServerAddress: IP or servername where ElastisSearch is installed (see above)
- Restart server (to reload DB credentials)
- Test application at: <server>/tsFileIndexService/execute
Network configuration
In the event that Elastic search or the file indexer is not on the same server you will need to ensure that
- Open port 3306 fra TS file indexing service to MySQL database (normally the application server)
- Open port 2100 fra TS file indexing service to ElasticSearch server
- Open port 2100 fra Tempus Serva application to ElasticSearch server
Also remember to update configrations for server names
- Elastic search: elasticsearch.yml file > network.host (add IP or servername)
- TS file indexing service: web.xml > ElasticServerAddress (elastic search server)
- Tempus Serva application: Server policies
- fulltextFileHandlerURL (file indexing server)
- fulltextElasticBaseURL (elastic search server)
Multi application setup
- Setup a shared table for lucenefilequeue using views
- Delete the lucenefilequeue table in all slave databases
- Create a view of lucenefilequeue pointing to the master database
- TS file indexing service must have a user with access to all TS databases
Multiple instances will have a shard each in the Elastic index