Difference between revisions of "Elastic search and OCR"
Jump to navigation
Jump to search
old>Admin |
old>Admin |
||
Line 1: | Line 1: | ||
== Understanding integrated search == | == Understanding integrated search == | ||
The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. | The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. | ||
Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but | Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server. | ||
Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue. | Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue. | ||
The basic search service requires | |||
* TS file indeing service | |||
* Elastic search server | |||
If PDF OCR functionality is needed the following components needs installation too | If PDF OCR functionality is needed the following components needs installation too | ||
Line 13: | Line 17: | ||
=== Setting up basic search service === | === Setting up basic search service === | ||
==== Install: TS file indexing service ==== | |||
==== Install: Elastic search server ==== | |||
=== Adding OCR capability === | === Adding OCR capability === | ||
==== Install: Ghostscript binaries ==== | |||
==== Install: Tesseract binaries ==== |
Revision as of 23:13, 28 November 2016
Understanding integrated search
The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.
Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue.
The basic search service requires
- TS file indeing service
- Elastic search server
If PDF OCR functionality is needed the following components needs installation too
- Ghostscript (PDF to TIFF conversion)
- Tesseract (OCR library)
The above components for OCR must be installed on the file indexing server.