User's Guide to OCR Data Pipeline

aramhayr
Oct 1, 2024
16 min read

Updated: Feb 12

Overview

The goal of this document is to provide the guidelines on the tuning and running the OCR Data Pipeline. and share the experience of editing/spell checking of the volume 1 of Ačaryan's Dictionary (Հ.Աճառյան. ՀԱԲ. ՅԱՌԱՋԱԲԱՆ). The technical description of the environment and configuration files are described in the Administering Text Processors. It is about how to install and configure the system. This page is about selecting property values for configuration.

It is written as a walkthrough describing the process of converting volume 1 of Ačaryan's Dictionary from page images PDFs into a plain old text (POT, which nowadays encoded in Unicode) file. Such styles allows to identify best practices for manual activities during conversion.

Using Applications

We assume that the applications installed in ocr-processor/ root directory according to the Directory structure section in the Administering Text Processors. The data/ directory contains all input and output data.

Step A. Splitting PDF into JPEG pages

As the title suggest this step splits the PDF [image] file into pages in JPEG format.

Configuration

Put the PDF file, for example, Ačaryan-1.pdf into ocr-processor/data/ directory and set these properties for PDF processor

source-directory=data
target-root=data
output-folder=jpegs
folder-to-split=split
output-format=jpg
output-extension=.jpg
start-page=1
start-column-page=77 // 2 column pages start in Ačaryan-1.pdf
end-column-page=700 // end of 2 column pages

Note. For all applications the page numbers in file names are not the actual page numbers in the original. For example, the last marked page in Ačaryan-1.pdf is 698. After that there are 2 more pages the title (?) page and the page with publishing info: Պատվեր 356, Տպաքանակ 25000, Հանձնված է արտա-դրություն 18/III 1970 թ., etc. [Mystery: Տպաքանակ of the other 3 volumes is 20000. This can be explained only by intelligently designed UFO hovering over the Երեւանի համալսարանի տպարան, Երեւան, Աբովյան փող. 52 during printing volume 1.] It means that the file names are original page number + 2.

Results

After about 20 min.processing the folder content looks like this:

data

├── Ačaryan-1.pdf

└── jpegs

└── Ačaryan-1

├── p.1.jpg

...

├── p.76.jpg

└── split

├── p.77.jpg

...

├── p.700.jpg

├── p.701.jpg

├── p.702.jpg

├── p.703.jpg

In the jpegs/ folder we have got the pages that should not be split. The rest of the pages about 630, went into the split/ folder.

Step B. Preparing JPEG pages for OCR

This is the most complex functionality of all, but since I managed to guarantee that there are no false positives, it is also the most rewarding - it has the most value added to the OCR process. All negatives go into a specified folder, where the end-user can review the original and the cut page.

The system takes the page images of the previous step, adjusts vertical alignment and/or splits the image along the vertical lines, and puts the split images one after another. The verticality is necessary for accurate cut. If the page is tilted too much, then it is impossible to draw a vertical cutting line in the gap between columns. Non-vertical is too complicated to implement and it is unknown how Google OCR can handle tilted images.

The system can automatically detect the tilt and fix it.

Configuration

The Step B allows:

Splitting the page into 2 or more columns and position them one after the other in the image. This functionality is useful not only for dictionaries but also for books that scanned with 2 adjacent pages together.
Rotating images to a specified radian anti-clock and clockwise. This functionality is useful not only for fixing the tilt, but also for rotating 90° (or 𝝿/2), for example, the double page scans that positioned vertically.
Automatically detecting the tilt and rotating the page to vertical position for column splitter to be able to split the columns accurately.
Checking if the slice goes through the empty space between page columns.

It is worth noting that ## 3 and 4 are not "exact science". In both cases there are false negative outcomes, while for the former the false positives are also possible. Note, that automatic tilt detection takes too much time (several minutes per page).

It is very important to reiterate that #4 guarantees correctness of split - no false positives.

While debugging and testing I ran tilt autodetect for the whole Ačaryan's volume 1 tome - 703 pages. It took about 17 hours (from 2024-09-17 23:02:16 to 2024-09-18 15:53:44). 34 files out of ~620 that slated for split were rotated because the tilt was more than 0.05 radians. I did no accurate or precise testing, but it looked to me that only one page really needed a rotation. The others were over-rotated clockwise (barely visible to the naked eye). However, the new tilt did not prevent correct splitting.

Configuring Step B and executing it is the most challenging. These are some recommendations to overcome the challenges:

Set (guess) the offsets for the first run by reviewing the original pages. Offset is defined in parts of overall width of the image: for example, 0.04 is one twenty fifth, while 0.1 is one tent. It shows how much you step forward towards the right side of the page for then cut the remaining width right in the middle.
If you guessed it correctly and you got manageable number - about a dozen - of pages then you can process them individually. But if you guessed wrong - which I did 3 times out of 4 volumes - then review couple of split pages in the review/ folder, adjust offsets and rerun. You can rerun the process in a bit different mode: copy the list of failed pages into variable list=[] in configuration file and set from-list=yes. Ensure removing the review/ folder or renaming it in the config file. The latter approach gives you a chance to understand the impact of offset changes on failed files.

The optimal approach is:

Run the column splitting on original images.
1. the system will send the images that needs review to the review/ folder. For Ačaryan's volume 1 183 out of ~620 files went into the review/ folder. Along with moving the split and original files into the review folder the system prints out the list of the files in a format that is ready for copy/paste into configuration file for processing. You copy that list into the list property in configuration file, set from-list=yes, and review the review/ folder content. The first run took about 40 min.
2. These are the setting for the first run.
  1. source-directory=data/jpegs/Ačaryan-1/split
  2. target-directory=data/jpegs/Ačaryan-1
  3. odd-offset=0.04
  4. even-offset=0.1
  5. width=0.44
  6. detect-tilt=no
  7. tilting-threshold=0.005
  8. statistical-error=0.06
  9. # These precisions settings actually tell the system to use only upper left quadrant of a page to determine the page tilt:
    1. row-precision=2
    2. column-precision=2
3. You move the correctly split pages from the review/ folder into the target folder, remove them from the list property in configuration file, set offsets for the listed pages:
  1. from-list=yes
  2. list=[p.498.jpg, p.583.jpg, p.599.jpg, p.608.jpg]
  3. odd-offset=0.1
  4. even-offset=0.04

After reviewing the 183 failed files from he first run I split them into 2 groups:

Pages 530 to 700 needed about 240 pixels offset to the right, while odd pages same to the left. These pages needed offset value swap. [Since these pages go sequentially with opposite offset pattern I suspect that they were scanned by different people, on different day, or on a different scanner - or combination of any of the above.]
The remaining 13 pages almost individually, because no obvious pattern popped up. Similarly I processed 2 files failed from the second run 170 pages with the reversed offsets. One file (p.701) needed tilt correction. Another, p.702 was upside down (I discovered this by accident).

Beware of false negatives (see feature #4 above). In rare cases when it is not a regular page: for example, it starts with a title line in the middle of the page, "non-standard" markings like horizontal lines, spots of printing paint (ink), coffee spill stains, etc.

There are files with "borderline splitting" - the cut goes next to the text. There is no even a [white pixel that is between the text and the image frame. I did not reprocess those waiting to see how the OCR is going to handle them.

Important. Before starting a new run ensure that review/ folder is removed or empty

Important. If you cancel the running process Cntl+C some "bad" files might remain in the "good" files folder. Ensure processing them accurately: remove, rerun.

Results

For the volume 2 I set both offsets to the same value - 0.05. Because visually the offsets looked similar for both: odd and even pages. 199 files (out of ~690) failed the split. After review the majority of failed files had almost similar offset, but it is a bit lower. But there were some even files that need bigger offset - 0.06.

Future

Along with minor improvements mostly in error handling I plan on adding contrasting functionality for poor quality scans.

I also have a faster tilt detection algorithm in mind, however, there is no much need for automated fix at the moment.

Step C. OCR

The OCR Step takes the JPEG pages created by previous 2 processors - Steps A and B, sends them to GoogleDrive to convert to POT, and downloads these files into the txts/ folder underneath data/ (and subfolder with relevant tome - the PDF file - names).

Conversion of about 650 files takes a bit more than 3 hours, download - roughly 10 min per volume.

This Step is a 3-step processing: 1) upload to GoogleDrive, 2) convert to POT in specified folder, 3) and download. They can run separately or in a single run. For the logic see the Configuration section below.

Note. The system downloads the whole content of GoogleDrive's [txts] folder.

Configuration

# Folder with original [pdf, jpg] files. Assumption: original PDF filename has dot before the extension. In Step A directory per original file names is created (for example, Ačaryan-1 (see Step A, Results)
-- source-directory=data
-- upload-folder=jpegs
-- mime-type=image/jpeg
-- filter=*.jpg
# Root folder and subfolder to store POT files on GoogleDrive. The pages go into folder named as the original PDF name (for example, Ačaryan-1 (see Step A, Results) underneath <target-root>/<output-folder> on GoogleDrive
-- target-root=ocr
-- output-folder=txts
-- output-extension=txt
# If upload=yes; upload jpegs from local 'source-directory' into the Google 'upload-folder. If from-local=yes, then this property is ignored - automatic upload=no
-- upload=no
# If Yes, convert files in Google 'upload-folder' underneath the 'target-root'. If upload=yes, then this property is ignored - automatic convert=no
-- convert=no
#If Yes, select files for conversion from local 'upload-folder' under the 'source-directory'
-- from-local=yes
#If Yes, select files for downloading from Google 'output-folder' under the 'target-root'. If upload=yes, then this property is ignored - automatic download=no
-- download=yes

Results

The "verticalization" of page in Step B increases quality of OCR. However, some edge cases still are handled manually. For example, the system does not work well with small amounts of foreign script embeddings, which is the case with Ačaryan's dictionary. When you manually create separate images of the foreign (to Armenian) embedded strings it works out perfectly.

It also cannot handle the [conditionally) 2 column pages, such as p.62 in Ačaryan-1, where abbreviations are given in first narrow column and definitions (spelling) in the wider column that a tab length further from the first. The system converts first column, removing empty strings and then - the second. It does not do this consistently - when an abbreviation is long and gets closer to spelling the whole thing goes into one string. For example, the pages from p.37 (30 in the original) to p.76 needs a lot of manual work.

p.156 has 2-3 letters from the column 1 at the beginning of column 2. I either did not notice at review or left deliberately: OCR added these letters in individual strings - noy much manual overhead to clean it up.

Step D. Cleansing data

Cleansing consists of these major operations: 1) Removing hyphenation signs (other regex can be added), and 2) stitching the page texts together into a single (or multiple) TXT file, 3) fixing OCR typos.

There are 3 modes of running the application:

Convert - apply string replacements to matching regex set - pages one by one and put them into the predefined folder
Stitch all or specified multiple intervals of pages together and
1. convert and/or
2. put them into a predefined folder.

You can configure and run in the above 3 modes separately or in combination of modes.

The Armenian Parser [run in spell check mode] detects typos that you fix manually. This can be done in this or next step.

It can first combine pages into single or multiple files (partitions) and then perform requested regex processing.

Configuration

# Folder with original and resulting texts. Assumptions: 1) original filename has only one dot (before the extension)
-- source-directory=data/txts/Ačaryan-1
-- target-directory=data/processed/Ačaryan-1
# Resulting files extensions.
-- extension=txt
# file name filter
-- filter=*.txt
# Pipe ('|') separated list of regEx for pattern replacement regex-file is name of a text file with the list of the same <regEx> | <replacement> format lines.
-- regex-list=(\\p{L})([\\-\\–\\—]\\s+(?:[^\\p{Ll}])+?)(\\p{Ll})||$1$3
# Read regEx patterns from the file
-- regex-file=
# Stitch pages. Put several pages into one file per specified intervals
-- stitch=no
-- page-intervals=1-2, 3-6, 8-36, 37-76, 77-701, 702-703
-- mark-pages=no
-- number-format=------p.
-- number-offset=-2
# In verbose mode the system logs a report on matching regex replacements. This property specifies the prefix to be added to the report file.
-- report-prefix=report
Results

After removing hyphens, automatically fixing typos, or automatically detecting (by the Armenian Parser) them, you start manual editing and formatting. The formats of the original pages can differ. For example, in Ačaryan-1 first 2 pages are title page and information about the book, then you get 4 pages of editors` foreword (3-6), author's introduction (8-36), abbreviations (37-76), 2 column pages (77-701), and publishing information pages at the end. These partitions are defined by page-intervals configuration property (see the Configuration section above).

Note, that inserted page markers can go away during hyphen removal (when the hyphenated words goes cross pages). In the report files matching pattern for hyphen is shown with the last letter before the hyphen and the first letter - after to make the review and identifications of changes locations easier. The number after the matching pattern is the offset in characters in the file.

After automatic and manual editing and formatting the POT files can be used for publishing or storing in corpus. For Dictionaries that will go into Thesaurus the POT file will be cut to the Dictionary entry JSONs and stored in the [NoSQL] database (more precisely in the warehouse and Search Engine index).

For the corpus storage it might be beneficial restoring the original paging and storing the texts as a set of pages for the Search Engine to respond with the document name and page.

Step E. Manual (Visual) review and correction

At this Step you review the automatically generated POTs and fix typos with th help of the Armenian Parser (spellchecker).

As expected the pages 38-77 of Ačaryan-1 that contain abbreviations was messed up. To correct formatting and use the data for automatically creating a special, separate from the existing abbreviations Armenian Parser dictionary (for spell checking Ačaryan's dictionary):

Copy files created in Step D into separate work/ folder in the data/ directory, and
a spreadsheet to bring columns together - at Step D the system at best puts them one after another: you can move the abbreviations and definitions into two adjacent columns, separate them with a unique sign, for example, '==', align them, and put back into POT file for further automatic processing.

To understand some inner working of the Google OCR you can use the CharDetector utility.

The utility identified codes used in the text and displays them in \uHHHH format. It processes given text file, creates a report and writes it down to the end of the file with the original file,

The modes are the CharDetector utility modes:

-m - reports a map of all characters with the count in the square brackets in front of the code and character name
-p - converts all characters in the text into \uHHHH representation and does not write a report
-w - converts space character to \uHHHH (except for regular space \u0020) in the text and writes a report on the conversions
-r - removes no-break-spaces:\u00A0,\u202F,\u2060,\uFEFF, from the text and writes a report on the removals

Why this utility is helpful? During the rounds of painful (even when using the Armenian Parser) spell checking I noticed weird behavior of the parser and dictionary admin. I immediately suspected some non-visible characters as culprits. When I build the map of the characters in pages 38-77 I found that there are 19 \uFEFF - ARABIC PRESENTATION FORMS B | ZERO WIDTH NO-BREAK SPACE invisible characters in the text. I do not know which application inserted it - most likely the OCR.

I also found out this information that seems useful for cleaning up the text:

[ 588 ] - \u0028 - BASIC LATIN | LEFT PARENTHESIS
[ 596 ] - \u0029 - BASIC LATIN | RIGHT PARENTHESIS
[ 457 ] - \u003A - BASIC LATIN | COLON
[ 2 ] - \u003B - BASIC LATIN | SEMICOLON
[ 4 ] - \u005B - BASIC LATIN | LEFT SQUARE BRACKET
[ 2 ] - \u005D - BASIC LATIN | RIGHT SQUARE BRACKET
[ 1 ] - \u005F - BASIC LATIN | LOW LINE
[ 2 ] - \u007E - BASIC LATIN | TILDE
[ 10 ] - \u0589 - ARMENIAN | ARMENIAN FULL STOP
[ 20 ] - \u2013 - GENERAL PUNCTUATION | EN DASH
[ 528 ] - \u2014 - GENERAL PUNCTUATION | EM DASH
[ 2 ] - \u2019 - GENERAL PUNCTUATION | RIGHT SINGLE QUOTATION MARK
[ 19 ] - \uFEFF - ARABIC PRESENTATION FORMS B | ZERO WIDTH NO-BREAK SPACE

[The utility has a scientific side effect - see the Appendix below.]

The number of right parenthesis is bigger than left is an indication of using inumerations like this: 1), a) or ա). It turned out that two extra left square brackets were missinterpretaion of the latin number II. I fixed it on p.60 (these are original pages - I'll address the issue with page assignment below).

Converted semicolons to colons and then all colons to ARMENIAN FULL STOP.

The LOW LINE was denoting page interval: էջ 529-800, in the beginning of p.40. I changed it to dash (minus) even though it looks exactly as LOW LINE in the original.

One TILDE on p.53 was a hyphen and the other page interval sign on p.38

All EN DASHs are converted to EM DASH.

I did not pay much attention to the cases when comma appears instead of dot, some missing or extra spaces. I was pursuing two major goals: fix important typos and create an accurate dictionary of abbreviations for use in spell checking the rest of the volumes.

I removed the ZERO WIDTH NO-BREAK SPACEs from the text: I got 10 less unrecognized by the Armenian Parser words in other category, but got an extra in latin. It came from the fixed other.

Now there are ~510 unrecognized words. The majority are գրաբար that even after converting to revised orthography remain unknown for the Armenian Parser.

It turned out that other parts of the volume 1 also contained the ZERO WIDTH NO-BREAK SPACEs as well as the abbreviations Dictionary I created specifically for the project.

With the help of this utility I detected that these words contain latin letters (it is very hard for me to spot the issue):

Յu
ոuկեդարեան
Sոբ

It is very important to identify such issues at data cleansing because latter the search engine will not find the ոսկեդարեան in the Ačaryan's Dictionary text (try searching in this text). Same issue can occur with letter 0 and some other in both directions.

At the end I achieved this quality for pp.8-36 of volume 1 - this is the spell checker's report:

armWordsProcessed 7923

linesProcessed 244

misspelledArmenian 235

misspelledLatin 177

misspelledOther 15

misspelledRussian 17

wordsTagged 7594

Keep in mind that the Dictionary is written in Classic orthography and heuristic conversion to Revised (see in the Armenian Parser) and - in addition - even the intro contains a lot of grabar. I did my best to ensure that the words reported as "misspelled" above are written correctly. I did not include them into even Ačaryan specific parser dictionaries, to not pollute hem with rarely used words. I did add Lagarde, Hübschmann, etc. because they might come up in other parts of the books.

You can check out the result at the Հ.Աճառյան. ՀԱԲ. ՅԱՌԱՋԱԲԱՆ - I just copied the newly prepared source text into the page. The formatting is gone - I did not want spending time restoring it.

Just to understand my struggles I am putting down the the Armenian Parser) report for the pp.77-701 before:

armWordsProcessed 218297

linesProcessed 11522

misspelledArmenian 20982

misspelledLatin 15019

misspelledOther 3165

misspelledRussian 70

wordsTagged 184737

and after fixing

Step F. Formatting for target applications

Appendix

The CharDetector utility can be used for calculating the distribution of letters in the text (does the distribution have any linguistics importance?). For example, this is the distribution of Armenian letters in the Բնագրի վերածումը մուտքային տվյալների article:

Small letters

1048 - \u0561 | ARMENIAN SMALL LETTER AYB

642 - \u0576 | ARMENIAN SMALL LETTER NOW

515 - \u0580 | ARMENIAN SMALL LETTER REH

467 - \u0565 | ARMENIAN SMALL LETTER ECH

461 - \u0578 | ARMENIAN SMALL LETTER VO

377 - \u0582 | ARMENIAN SMALL LETTER YIWN

317 - \u056B | ARMENIAN SMALL LETTER INI

247 - \u0574 | ARMENIAN SMALL LETTER MEN

233 - \u057F | ARMENIAN SMALL LETTER TIWN

221 - \u056F | ARMENIAN SMALL LETTER KEN

208 - \u0575 | ARMENIAN SMALL LETTER YI

122 - \u057E | ARMENIAN SMALL LETTER VEW

118 - \u0568 | ARMENIAN SMALL LETTER ET

109 - \u0569 | ARMENIAN SMALL LETTER TO

109 - \u057D | ARMENIAN SMALL LETTER SEH

101 - \u0562 | ARMENIAN SMALL LETTER BEN

93 - \u0563 - | ARMENIAN SMALL LETTER GIM

86 - \u057C | ARMENIAN SMALL LETTER RA

82 - \u0584 | ARMENIAN SMALL LETTER KEH

81 - \u0581 | ARMENIAN SMALL LETTER CO

79 - \u056C | ARMENIAN SMALL LETTER LIWN

77 - \u056E | ARMENIAN SMALL LETTER CA

75 - \u0570 | ARMENIAN SMALL LETTER HO

61 - \u0571 | ARMENIAN SMALL LETTER JA

59 - \u0567 | ARMENIAN SMALL LETTER EH

57 - \u057A | ARMENIAN SMALL LETTER PEH

44 - \u0572 | ARMENIAN SMALL LETTER GHAD

42 - \u0564 | ARMENIAN SMALL LETTER DA

41 - \u0577 | ARMENIAN SMALL LETTER SHA

28 - \u0566 | ARMENIAN SMALL LETTER ZA

24 - \u056D | ARMENIAN SMALL LETTER XEH

22 - \u0579 | ARMENIAN SMALL LETTER CHA

15 - \u057B | ARMENIAN SMALL LETTER JHEH

14 - \u0583 | ARMENIAN SMALL LETTER PIWR

9 - \u0585 | ARMENIAN SMALL LETTER OH

9 - \u0587 | ARMENIAN SMALL LIGATURE ECH YIWN

8 - \u056A | ARMENIAN SMALL LETTER ZHE

6 - \u0573 | ARMENIAN SMALL LETTER CHEH

3 - \u0586 | ARMENIAN SMALL LETTER FEH

Capital letters

28 - \u0540 | ARMENIAN CAPITAL LETTER HO

27 - \u0533 | ARMENIAN CAPITAL LETTER GIM

19 - \u0531 | ARMENIAN CAPITAL LETTER AYB

10 - \u0532 | ARMENIAN CAPITAL LETTER BEN

9 - \u0544 | ARMENIAN CAPITAL LETTER MEN

8 - \u053F | ARMENIAN CAPITAL LETTER KEN

8 - \u054F | ARMENIAN CAPITAL LETTER TIWN

7 - \u0546 | ARMENIAN CAPITAL LETTER NOW

7 - \u0555 | ARMENIAN CAPITAL LETTER OH

6 - \u0547 | ARMENIAN CAPITAL LETTER SHA

6 - \u054D | ARMENIAN CAPITAL LETTER SEH

5 - \u053B | ARMENIAN CAPITAL LETTER INI

5 - \u053C | ARMENIAN CAPITAL LETTER LIWN

5 - \u0548 | ARMENIAN CAPITAL LETTER VO

4 - \u0537 | ARMENIAN CAPITAL LETTER EH

4 - \u053E | ARMENIAN CAPITAL LETTER CA

3 - \u054E | ARMENIAN CAPITAL LETTER VEW

3 - \u0554 | ARMENIAN CAPITAL LETTER KEH

2 - \u0538 | ARMENIAN CAPITAL LETTER ET

1 - \u0534 | ARMENIAN CAPITAL LETTER DA

1 - \u0535 | ARMENIAN CAPITAL LETTER ECH

1 - \u053A | ARMENIAN CAPITAL LETTER ZHE

1 - \u0543 | ARMENIAN CAPITAL LETTER CHEH

1 - \u0545 | ARMENIAN CAPITAL LETTER YI

1 - \u054A | ARMENIAN CAPITAL LETTER PEH

1 - \u0550 | ARMENIAN CAPITAL LETTER REH

User's Guide to OCR Data Pipeline

Overview

Using Applications

Step A. Splitting PDF into JPEG pages

Configuration

Results

Step B. Preparing JPEG pages for OCR

Configuration

Results

Future

Step C. OCR

Configuration

Results

Step D. Cleansing data

Configuration

Step E. Manual (Visual) review and correction

Step F. Formatting for target applications

Appendix

Small letters

Capital letters

Recent Posts

Comments