Difference between revisions of "Elastic search and OCR"

From TempusServa wiki
Jump to navigation Jump to search
old>Admin
old>Admin
Line 1: Line 1:
== Understanding integrated search ==
== Understanding integrated search ==
The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content.
The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content.
Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but causing a performance degrade.
Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.


Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue.
Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue.
The basic search service requires
* TS file indeing service
* Elastic search server


If PDF OCR functionality is needed the following components needs installation too
If PDF OCR functionality is needed the following components needs installation too
Line 13: Line 17:
=== Setting up basic search service ===
=== Setting up basic search service ===


==== Install: TS file indexing service ====
==== Install: Elastic search server ====




=== Adding OCR capability ===
=== Adding OCR capability ===
==== Install: Ghostscript binaries ====
==== Install: Tesseract binaries ====

Revision as of 23:13, 28 November 2016

Understanding integrated search

The integrated fulltext serach using Elatic search is a internal/active approach to indexing the content. Content will be added to a indexing queue every time it is updated - ensuring allways updated content, but consuming CPU ressources on the indexing server.

Beacuase file indexing is very CPU intensive, the file indexing functionality i seperated into a service that can run on a server seperated from te main application server. Anyway the fileindexer will run from a database queue.

The basic search service requires

  • TS file indeing service
  • Elastic search server

If PDF OCR functionality is needed the following components needs installation too

  • Ghostscript (PDF to TIFF conversion)
  • Tesseract (OCR library)

The above components for OCR must be installed on the file indexing server.

Setting up basic search service

Install: TS file indexing service

Install: Elastic search server

Adding OCR capability

Install: Ghostscript binaries

Install: Tesseract binaries