_Text extraction in DAMS

DAMS has two text extraction engines built into the system: pdftotext and Tesseract OCR.

There are no presently known size thresholds or page limits associated with either of these text extraction tools; however, it should be noted that Tesseract OCR is performed one page at a time. Some non-English Tesseract language packs (out of those presently enabled, most notably Japanese) will take long to process.

Islandora PDFTOTEXT

The pdftotext engine is based on xpdfreader. The software extracts the text content stream from a PDF file, so it doesn't work with a PDF that only consists of images and no searchable text. It is performed automatically upon manual ingest when using the PDF Content Model and generates a FULL_TEXT datastream. pdftotext will run automatically during a batch ingest of PDFs, except when using batch ingest method 7 (Paged Content).

Tesseract OCR

The Tesseract OCR engine uses optical character recognition to extract text from images. You can read further about this technology here: https://github.com/tesseract-ocr/tesseract/wiki. Tesseract OCR is available for use in conjunction with Paged Content:

Manual ingest: While filling out the ingest form at the Book or Publication Issue level, the user may specify whether OCR should be performed and in what language (default is English).
Batch ingest (method 7, Paged Content): One of the supported OCR language must be specified in the manifest (see documentation here).

After ingesting a Book or Publication Issue, the user may use the "Manage" tab to perform or re-perform OCR at the Book or Publication Issue level. Additionally, the user may use the "Manage" tab at the Page level to perform or re-perform OCR for that particular Page.

Text Extraction Engine	DAMS Content Model	Datastream Created	Searchable in DAMS	Extracts Text From Images	Supported Languages	Technology Used
PDFTOTEXT	PDF Content Model	FULL_TEXT	yes	no	Any language that is character rather than symbol based (e.g., Arabic, Japanese)	text stream extraction from PDF file
TESSERACT	Paged Content	OCR	yes	yes	English (eng) Spanish (spa) Portuguese (por) Hindi (hin) French (fra) German (deu) Italian (ita) Japanese (jpn) Latin (lat)	optical character recognition

User-provided full text

You can provide your own full text for an object (user provided) in addition to the system generated one by adding a datastream to the the object with the label FULL_TEXT_CUSTOM and the ID FULL_TEXT_CUSTOM. The user provided full text will not be changed when the OBJ datastream is updated and the system regenerates full text.

During batch ingest of paged content it is also possible to add custom OCR text for each page, cf. the documentation for batch ingesting paged content.