Translation Processor

aramhayr
Feb 3, 2025
4 min read

Updated: Feb 25, 2025

Overview

Translation Processor can be viewed as one of the terminal steps in the OCR Data Pipeline or Armenian Speech-to-Text WebApplication. Once you obtained and edited the text you can Translate it to any of Supported languages (see below).

In majority of cases it makes sense stitching the the whole text together and removing hyphens - manually or using Data Cleansing processor (see Step D. Cleansing data in the User's Guide to OCR Data Pipeline).

The code is available at https://github.com/aramhayr/TranslateProcessor.

Using Application

We assume that the applications installed in, for example ocr-processor/ root directory, according to the Directory structure section in the Administering Text Processors. The data/ directory contains plain text files in text/ subdirectory. Translated into a specified language files go into translated/ folder in data/. with language code prepended to the original name. A typical configuration might look like this:

Configuration

# Configuration for Translation processor processing:
version=2025-02-02
# Folders with original and resulting texts. Assumptions: original filename has only one dot (before the extension)
source-directory=data/text
target-directory=data/translate
# Target language code
target-language=hy #(for Armenian; for Amharic use am)
# File name filter: select files that end with filter from the source-directory
filter=.txt

As for other Processors you can set print-settings and verbose modes.

Running the App

Like in the case of other Processors command line translate.bash calls main in TranslateProcessor-1.0.0-1.0.0-jar-with-dependencies.jar. It prints out minimal info at running: file names that are translated.

These error messages can appear at exceptional circumstances:

Error: Invalid target language. For list of supported languages set verbose=yes in the config file
Error accessing folders: <source-directory>
Error translating file <sourceFile> <reason>

Pricing

This is typical pricing per "normal" (standard) pages (~1800 words):

Genre	Google Translate	Editing (en, hy, ru)
Non-technical	֏100 \| $0.25	֏2000 \| $5
Technical	֏200 \| $0.5	֏3600 \| $9
Literary prose	֏100 \| $0.25	֏3000 \| $7.5

Columns description:

Genre - specifies type of the text. Technical is considered more complex because of Tables, Graphs, and Formulas that might need some manual intervention before editing.
Google Translate - the text is returned "as is" after automatic translation.
Editing (en, hy, ru) - Manual or AI editing is performed for the 3 mentioned languages

These are jusr rough estimates. Pricing can be adjusted upun the text review.

Up to 10 pages of first order are free.

Known Issues, Limitations

com.google.cloud.translate.TranslateException: Request payload size exceeds the limit: 204800 bytes. Despite the message you need to make files less than 64 Kb.

Supported languages

As of 02-02-2025 Google Translate supports these the languages (you can get this list by setting verbose=yes):

Abkhaz, ab	Gujarati, gu	Nuer, nus
Acehnese, ace	Haitian Creole, ht	Occitan, oc
Acholi, ach	Hakha Chin, cnh	Odia (Oriya), or
Afrikaans, af	Hausa, ha	Oromo, om
Albanian, sq	Hawaiian, haw	Pangasinan, pag
Alur, alz	Hebrew, iw	Papiamento, pap
Amharic, am	Hiligaynon, hil	Pashto, ps
Arabic, ar	Hindi, hi	Persian, fa
Armenian, hy	Hmong, hmn	Polish, pl
Assamese, as	Hungarian, hu	Portuguese (Brazil), pt
Awadhi, awa	Hunsrik, hrx	Punjabi (Gurmukhi), pa
Aymara, ay	Icelandic, is	Punjabi (Shahmukhi), pa-Arab
Azerbaijani, az	Igbo, ig	Quechua, qu
Balinese, ban	Ilocano, ilo	Romani, rom
Bambara, bm	Indonesian, id	Romanian, ro
Bashkir, ba	Irish, ga	Rundi, rn
Basque, eu	Italian, it	Russian, ru
Batak Karo, btx	Japanese, ja	Samoan, sm
Batak Simalungun, bts	Javanese, jw	Sango, sg
Batak Toba, bbc	Kannada, kn	Sanskrit, sa
Belarusian, be	Kapampangan, pam	Scots Gaelic, gd
Bemba, bem	Kazakh, kk	Sepedi, nso
Bengali, bn	Khmer, km	Serbian, sr
Betawi, bew	Kiga, cgg	Sesotho, st
Bhojpuri, bho	Kinyarwanda, rw	Seychellois Creole, crs
Bikol, bik	Kituba, ktu	Shan, shn
Bosnian, bs	Konkani, gom	Shona, sn
Breton, br	Korean, ko	Sicilian, scn
Bulgarian, bg	Krio, kri	Silesian, szl
Buryat, bua	Kurdish (Kurmanji), ku	Sindhi, sd
Cantonese, yue	Kurdish (Sorani), ckb	Sinhala, si
Catalan, ca	Kyrgyz, ky	Slovak, sk
Cebuano, ceb	Lao, lo	Slovenian, sl
Chichewa, ny	Latgalian, ltg	Somali, so
Chinese (Simplified), zh	Latin, la	Spanish, es
Chinese (Traditional), zh-TW	Latvian, lv	Sundanese, su
Chuvash, cv	Ligurian, lij	Swahili, sw
Corsican, co	Limburgish, li	Swati, ss
Crimean Tatar, crh	Lingala, ln	Swedish, sv
Croatian, hr	Lithuanian, lt	Tajik, tg
Czech, cs	Lombard, lmo	Tamil, ta
Danish, da	Luganda, lg	Tatar, tt
Dhivehi, dv	Luo, luo	Telugu, te
Dinka, din	Luxembourgish, lb	Tetum, tet
Dogri, doi	Macedonian, mk	Thai, th
Dombe, dov	Maithili, mai	Tigrinya, ti
Dutch, nl	Makassar, mak	Tsonga, ts
Dzongkha, dz	Malagasy, mg	Tswana, tn
English, en	Malay, ms	Turkish, tr
Esperanto, eo	Malay (Jawi), ms-Arab	Turkmen, tk
Estonian, et	Malayalam, ml	Twi, ak
Ewe, ee	Maltese, mt	Ukrainian, uk
Fijian, fj	Maori, mi	Urdu, ur
Filipino, tl	Marathi, mr	Uyghur, ug
Finnish, fi	Meadow Mari, chm	Uzbek, uz
French, fr	Meiteilon (Manipuri), mni-Mtei	Vietnamese, vi
Frisian, fy	Minang, min	Welsh, cy
Fulani, ff	Mizo, lus	Xhosa, xh
Ga, gaa	Mongolian, mn	Yiddish, yi
Galician, gl	Myanmar (Burmese), my	Yoruba, yo
Georgian, ka	Ndebele (South), nr	Yucatec Maya, yua
German, de	Nepalbhasa (Newari), new	Zulu, zu
Greek, el	Nepali, ne	Hebrew, he
Guarani, gn	Norwegian, no	Javanese, jv
	Chinese (Simplified), zh-CN