OCR Data Pipeline

aramhayr
Jul 25, 2024
3 min read

Updated: Feb 3

OCR Processor

The OCR Processor implements Corpus Data Pipeline, the a2-b2-c2-d2 path on the Diagram of the Թվային հումանիտար գիտություններ. The Diagram below shows main modules of the system.

The dotted-line modules are in progress. The rest contains three independent processors:

PDFProcessor - Reads PDF files and converts it into single page JPEG images. Three is an option to slice the page into 2 or 3 columns to improve OCR quality for Dictionaries, which frequently formatted as 2 or 3 column pages
JPGProcessor - A standalone JPEG images processor. Similar to PDFProcessor it slices the page into columns. It is intended for processing manually adjusted jpegs. For example, when the page was not situated strictly vertical or uniformly positioned horizontally at scanning, the PDFProcessor might not be able to slice it correctly automatically
GoogleUpload - Reads JPEG images of a single page and converts them into POT (plain old text) format using Google Cloud upload functionality.
Data Cleansing - This module removes hyphens, fixes typos, characters misinterpretations by OCR, etc.
Data formatting - structures raw data for the purposes of application. For example, if it is Dictionary data it slices the data into dictionary entries; if it is for corpus, the data formatter organizes the data into the set of pages under the book title; if the data is for publishing then it stiches the pages together, adds hyphens, etc.

Major functionality

The system takes PDF or JPEG files, converts them into text and fixes (rather, identifies) typos. It can handle major, if not all scripts known so far: Chinese, Hebrew, Brahmic (Indic), Greek, Latin, Armenian, Georgian, Arabic, Cyrillic, etc. It does not do particularly well for the mixed text. To get a good quality for small sequences of other scripts you need to separate them manually.

The system is not intended for handwriting.

The system can slice multicolumn pages and create correct sequence of text in columns to improve the quality of the result. The system ensures that columns cutting to is done correctly. If it cannot automatically adjust the vertical cutting line it reports on possible failure.

The system also identifies tilt and fixes it: allighs page vertically. Automatic tilt detection is not an exact science. Тhat is why a "force-rotate" mode is implemented. The system can turn the page per specified angle in radians clock- or anticlockwise before slicing the page into columns. This increases the success of automatic page split.

For improving the quality of OCR there are tools for spectrum investigation. You can request page image spectral histogram for resetting levels of brightness for poor quality images.

The system automatically applies a list of regex, in particular, for hyphens removal and ORC typoes correction. You can run these processes on individual or combined (stitched together) pages.

The User's Guide to OCR Data Pipeline has more details on functionality and configuration properties selection.

Performance

The performance of the system is quite good (see Appendix - Hardware Overview):

PDF split - typically 1 page/sec (1 sec per page), with slicing 0.5 page/sec (2 sec per page). For example, it took 17 min to convert Ačaryan-v.1.pdf (43.4 mb), 703 pages (623 pages with 2 columns) into 703 JPEG files.
Image to text conversion (OCR) - typically conversion of one JPEG page (of size of the Ačaryan-v.1 pages) takes 5-7 sec. For example, it took 1 hour 18 min to convert 703 JPEG files of Ačaryan-v.1 into 703 text files. Since conversion takes place on Google Drive the system spends a bit less than 1 sec per file to download them to local file system. That adds another 13 min to processing. In the end conversion of 703 pages takes 1.5 hours.

For more performance information see the User's Guide to OCR Data Pipeline.

Administering Text Processors

Pricing

This is typical pricing per "normal" (standard) pages (~1800 words):

Image type	Conversion	Formatting	Proofreading
Single column page	֏100 \| $0.25	֏200 \| $0.50	֏600 \| $1.50
Multi-column page	N * (֏100 \| $0.25)	N * (֏200 \| $0.50)	N * (֏600 \| $1.50)

Columns description:

Image type - describes the input page layout. If one image contains 2 adjacent pages then it is considered as 2-column page. The letter N in the following columns represents number of columns for the Multi-column pages.
Conversion - assumes conversion to a text "as is" with no formatting, OCR typo fixing, or cleansing (e.g. removing hyphens and other trivial fixes).
Formatting - typical formatting assumes removal of hyphens and some OCR typos
Proofreading - guarantees all OCR typos correction including multilingual embeddings. It is assumed that a page is roughly ~1800 words.

For multilingual texts, dictionaries, poor quality or non-standard page images, etc. the pricing might differ because of manual adjustments.

Up to 10 pages of the first order are free.

Home

Corpus

Guide