_Text extraction in DAMS
DAMS has two text extraction engines built into the system: pdftotext and Tesseract OCR.
There are no presently known size thresholds or page limits associated with either of these text extraction tools; however, it should be noted that Tesseract OCR is performed one page at a time. Some non-English Tesseract language packs (out of those presently enabled, most notably Japanese) will take long to process.
Islandora PDFTOTEXT
The pdftotext engine is based on xpdfreader. The software extracts the text content stream from a PDF file, so it doesn't work with a PDF that only consists of images and no searchable text. It is performed automatically upon manual ingest when using the PDF Content Model and generates a FULL_TEXT datastream. pdftotext will run automatically during a batch ingest of PDFs, except when using batch ingest method 7 (Paged Content).
Tesseract OCR
The Tesseract OCR engine uses optical character recognition to extract text from images. You can read further about this technology here: https://github.com/tesseract-ocr/tesseract/wiki. Tesseract OCR is available for use in conjunction with Paged Content:
- Manual ingest: While filling out the ingest form at the Book or Publication Issue level, the user may specify whether OCR should be performed and in what language (default is English).
Batch ingest (method 7, Paged Content): One of the supported OCR language must be specified in the manifest (see documentation here).
After ingesting a Book or Publication Issue, the user may use the "Manage" tab to perform or re-perform OCR at the Book or Publication Issue level. Additionally, the user may use the "Manage" tab at the Page level to perform or re-perform OCR for that particular Page.
Text Extraction Engine | DAMS Content Model | Datastream Created | Searchable in DAMS | Extracts Text From Images | Supported Languages | Technology Used |
---|---|---|---|---|---|---|
PDFTOTEXT | PDF Content Model | FULL_TEXT | yes | no | Any language that is character rather than symbol based (e.g., Arabic, Japanese) | text stream extraction from PDF file |
TESSERACT | Paged Content | OCR | yes | yes |
| optical character recognition |
User-provided full text
You can provide your own full text for an object (user provided) in addition to the system generated one by adding a datastream to the the object with the label FULL_TEXT_CUSTOM and the ID FULL_TEXT_CUSTOM. The user provided full text will not be changed when the OBJ datastream is updated and the system regenerates full text.
During batch ingest of paged content it is also possible to add custom OCR text for each page, cf. the documentation for batch ingesting paged content.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.