The User's Guide to OCR Data Pipeline can provide useful information for setting the environment and configuring the system.
System Diagram
Directory structure
This is the OCR Processor directory structure:
ocr-processor
├── bin
│ ├── GoogleUpload-1.0.0-jar-with-dependencies.jar
│ ├── JPGProcessor-1.0.0-jar-with-dependencies.jar
│ └── PDFProcessor-1.0.0-jar-with-dependencies.jar
│ └── CleansingProcessor-1.0.0-jar-with-dependencies.jar
├── cloud-access
│ ├── credentials.json
│ └── tokens
│ └── StoredCredential
├── config
│ ├── googledocs.properties
│ ├── jpgprocessor.properties
│ └── pdfsplitmain.properties
├── data
│ ├── Ačaryan-1.pdf
│ └── work
│ ├── jpegs
│ │ ├── Ačaryan-1
│ │ │ └── sliced-p.700.jpg
│ │ ├── p.77.jpg
│ │ └── p.78.jpg
│ └── txts
│ └── Ačaryan-1
│ ├── p.76.txt
│ └── sliced-p.700.txt
├── rename-pages.bash (removes prefixes, such as 'sliced', from file name to make them p.1, p.2, etc.)
├── missing-pages.bash (checks if the directory contains all consecutive pages)
├── clean.bash
├── jpeg.bash
├── ocr.bash
└── split.bash
The 4 JAR files are in the bin/ folder. The relevant configuration files are in the config folder.
The cloud-access/ folder contains credentials.json file that you download after creating a GoogleCloud account, activating Google Drive API. The tokens folder contains persistent security tokens for not requesting authentication at every run of the GoogleUpload processor.
The 3 bash files are command-line scripts to activate each of the 4 processors.
The data/ folder is for storing the original PDF files and resulting .jpg and .txt files.
Processors functionality
All 3 processors check on the existence of the target files before writing. If the file exists then no processing for that file takes place. If you want to rewrite the file you need to either remove the files or send the stream into a different location (folder).
The processors are configurable: you can set data/ folder, subfolders underneath, and control modes of operation. However there are some hard coded assumptions:
For each original PDF file <name>.pdf in the specified target folder a name/ sub-folder is created. For example, if you configured the PDF Processor to process .pdf files from data/ folder and specify data/work/jpegs folder for converted to jpeg files then the system will create a name/ sub-folder in the data/work/jpegs folder and the page images there. See for example, Ačaryan-1.pdf file and relevant files in the data/work/jpegs folder. The original filename has only one dot before the extension
The PDF Processor can process multiple files at once. GoogleUpload can process the results of multiple PDF files at once. JPGProcessor works with multiple files in a single directory and can write the result in the same or other directory.
The system heavily relies on the file naming convention: the resulting files have to be of the .p.dddd.<extension> format. The page number is used to calculate the offset at slicing the page into columns.
Configuring the system
The system allows for configuring the source and target directories, file names, modes, etc. Each processor has its own .properties files.
For all configurations:
no spaces are allowed around the '=' sign.
you can control the print of settings (properties) at initialization by setting
-- print-settings=[yes, no]
-- verbose=[yes, no]
PDFProcessor
# PDF split and conversion to JPEG
# The application takes PDF files from the 'source-directory'
# according to the 'filter', splits them to files of smaller
# page count per 'splitBy' and stores them in 'output-folder"
# underneath 'target-root' in sub-folders matching original
# file names.
-- version=2024-06-27
# Folder with original PDFs. Assumptions: 1) original filename has only one dot (before the extension)
-- source-directory=data
# File name filter
-- filter=*.pdf
# Root folder for split PDFs. The pages go into folder named as the original
# PDF name underneath target-root + output-folder
-- target-root=data/work
-- output-folder=jpegs
-- folder-to-splt=split
-- output-format=jpg
-- output-extension=.jpg
# For JPEG conversion (default DPI=300)
-- dpi=600
# Splitting start page (1 by default)
-- start-page=1
-- start-column-page=77
-- end-column-page=700
JPGProcessor
# Folder with original JPGs.
-- source-directory=data/work/jpegs
# Folder with processed JPGs.
-- target-directory=data/work/sliced
# file name filter
-- filter=*.jpg
# Column slicing (for dictionaries, double page scans). Width shows hoe big part of the whole page the width is: widht=0.5 means exactly half of the page.
-- odd-offset=0.0
-- even-offset=0.08
-- width=0.45
# This is a prefix for target filename to tell the sliced (default is empty string - no prefix)
-- prefix-sliced=sliced
-- prefix-tilted=tilted
# Auto-adjust column offset for slicing. Top row defines the starting high in pixels for test (to skip page and other markers). Strip width is a part of the whole width to check splitting vertical against. The system goes from the top-row through the pixel rows until [image] (height - top-row) row and calculates sum of black pixels in each column with a strip-width (in parts of image width) strip around the splitting vertical. The a correction made to cut along the column, which in the middle of the biggest interval of 0 black pixels.
-- auto-adjust=yes
-- top-row=1700
-- strip-width=20
-- review-folder=review
# Rotate image the force-rotate radians: anti-clockwise for positive values, clockwise - for negative.
-- force-rotate=yes
-- rotate=-0.04 (-1.57 is 90 degrees)
# Image processing:
Beam - to calculate tilt for each row a number of Beams is released between -(minus) beam-incline and +beam-incline. The beam-incline defines max angle. The tan (angle) = 40/imageWidth (in pixels). If beam-incline=0 (recommended value) then the value is calculated automatically as width * tilting-threshold. Elaborate [recursive] algorithm is used for detecting the tilt. The 'incline-limit' is used to halt the runaway algorithm, when tilt detection goes wrong (for example, when the image quality is low)
-- detect-tilt=no
-- beam-incline=0
-- tilting-threshold=0.005
-- statistical-error=0.06
# The modulo 'precision' calculation used to select rows. The precision=1 is the most precise (and long); at precision=4 every 4-th row is selected for counting pixels
-- row-precision=2
-- column-precision=2
# Black and White intensity thresholds
-- brightness=0.95
# Image analysis
Histogram parameters: color=0 - black, color=1 - white
-- histogram=no
-- regex=row[20-64]-beam-[mp]\\d+
-- scale-up=1
-- scale-down=1
-- color=0
# Collect spectral statistics
-- statistics=no
-- spectrum=no
GoogleUpload
# Upload files according to 'filter' pattern from the
# local 'source-directory' to specified 'target-root'
# folder on the Google Drive
# Credentials to access Google Drive
-- credentials=cloud-access/credentials.json
-- application-name=pdf-ocr-2024
-- tokens-directory=cloud-access/tokens
-- authorize=user
# Local host callback port
-- port=8888
# Folder with original [pdf, jpg] files.
-- source-directory=data/work
-- upload-folder=jpegs
-- mime-type=image/jpeg
-- filter=*.jpg
# Root folder for split PDFs. The pages go into folder named as the
# original PDF name underneath target-root + output-folder
-- target-root=ocr
-- output-folder=txts
-- output-extension=txt
# If yes; upload from local 'source-directory' into the Google 'upload-folder'
# underneath the 'target-root'
# if from-local=yes, then automatically upload=no
-- upload=no
# If Yes, convert files in folders in Google 'upload-folder'
# underneath the 'target-root'
# if upload=yes, then automatically convert=no
-- convert=no
# If Yes, select files for conversion from local 'upload-folder'
# under the 'source-directory'
-- from-local=yes
# If Yes, select files for downloading from Google 'output-folder' under the 'target-root; if upload=yes, then automatically download=no
-- download=yes
CleansingProcessor
# Folder with original and resulting texts. Assumptions: 1) original filename has only one dot (before the extension)
-- source-directory=data/txts/Ačaryan-1
-- target-directory=data/processed/Ačaryan-1
# Resulting files extensions.
-- extension=txt
# file name filter
-- filter=*.txt
# Pipe ('|') separated list of regEx for pattern replacement regex-file is name of a text file with the list of the same <regEx> | <replacement> format lines.
-- regex-list=(?<=\\p{L})-[^\\p{L}]*\\s[^\\p{L}]*(?=\\p{L})||^$
# Read regEx patterns from the file
-- regex-file=
# Stitch pages. If you want have page split in combined file, then set mark-pages=yes and set the markup. The partitions subfolder contains stitched pages: there can be single or multiple files (hence, the name partitions). The subfolder goes underneath of target folder
-- stitch=no
-- page-intervals=1-2,3-637, 638-640
-- mark-pages=yes
-- number-format=------p.
-- number-offset=-2
Known Issues
The PDF library sometimes complains with WARNING: Could not load font file: /System/Library/Fonts/LastResort.otf, which follows with an exception. typically it recovers and continues working just fine.
At times the upload (download) to Google drive might stole because of socket timeout. Remedy: Restart the process - it is save to start over, because the system will skip already processed pages in no time. If this happens at download time then download the txt/ folder (or remaining parts of it) manually.
If you get:
Folder upload error: from local data/jpegs to ocr com.google.api.client.auth.oauth2.TokenResponseException: 400 Bad Request POST https://oauth2.googleapis.com/token { "error": "invalid_grant", "error_description": "Token has been expired or revoked."}
Delete the StoredCredential file from tokens/ directory, restart the process, and proceed with manual authentication.
After starting ./split.bash the system changes the focus to desktop (main MacOS) window, where you can notice a JVM icon appears (at the start) and disappears (at the end) of processing. The cause is unknown (this behavior appears recently, most likely at on of the OS updates new settings took place). Remedy: return to the original command line window.
Cleansing processor does not work correctly with this regex in configuration:
regex-list=(\\p{L})([\\-\\–\\—]\\s+(?:[^\\p{Ll}])+?)(\\p{Ll})||$1$3
However, it works perfectly with this one:
regex-list=(\\p{L})([\\-\\\u2013\\\u2014]\\s+(?:[^\\p{Ll}])*?)(\\p{Ll})||$1$3
The difference in the hyphen character representation: a) by symbol, b) by code.
Appendix - Hardware Overview:
These are the details of the hardware that the test runs were performed:
Model Identifier: MacBookPro17,1
Model Number: MYD92LL/A
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 8 GB
System Firmware Version: 10151.121.1
Comments