Overview
Translation Processor can be viewed as one of the terminal steps in the OCR Data Pipeline or Armenian Speech-to-Text WebApplication. Once you obtained and edited the text you can Translate it to any of Supported languages (see below).
In majority of cases it makes sense stitching the the whole text together and removing hyphens - manually or using Data Cleansing processor (see Step D. Cleansing data in the User's Guide to OCR Data Pipeline).
The code is available at https://github.com/aramhayr/TranslateProcessor.
Using Application
We assume that the applications installed in, for example ocr-processor/ root directory, according to the Directory structure section in the Administering Text Processors. The data/ directory contains plain text files in text/ subdirectory. Translated into a specified language files go into translated/ folder in data/. with language code prepended to the original name. A typical configuration might look like this:
Configuration
# Configuration for Translation processor processing:
version=2025-02-02
# Folders with original and resulting texts. Assumptions: original filename has only one dot (before the extension)
source-directory=data/text
target-directory=data/translate
# Target language code
target-language=hy #(for Armenian; for Amharic use am)
# File name filter: select files that end with filter from the source-directory
filter=.txt
As for other Processors you can set print-settings and verbose modes.
Running the App
Like in the case of other Processors command line translate.bash calls main in TranslateProcessor-1.0.0-1.0.0-jar-with-dependencies.jar. It prints out minimal info at running: file names that are translated.
These error messages can appear at exceptional circumstances:
Error: Invalid target language. For list of supported languages set verbose=yes in the config file
Error accessing folders: <source-directory>
Error translating file <sourceFile> <reason>
Pricing
This is typical pricing per "normal" (standard) pages (~1800 words):
Genre | Google Translate | Editing (en, hy, ru) |
Non-technical | ֏100 | $0.25 | ֏2000 | $5 |
Technical | ֏200 | $0.5 | ֏3600 | $9 |
Literary prose | ֏100 | $0.25 | ֏3000 | $7.5 |
Columns description:
Genre - specifies type of the text. Technical is considered more complex because of Tables, Graphs, and Formulas that might need some manual intervention before editing.
Google Translate - the text is returned "as is" after automatic translation.
Editing (en, hy, ru) - Manual or AI editing is performed for the 3 mentioned languages
These are jusr rough estimates. Pricing can be adjusted upun the text review.
Up to 10 pages of first order are free.
Known Issues, Limitations
com.google.cloud.translate.TranslateException: Request payload size exceeds the limit: 204800 bytes. Despite the message you need to make files less than 64 Kb.
Supported languages
As of 02-02-2025 Google Translate supports these the languages (you can get this list by setting verbose=yes):
Abkhaz, ab | Gujarati, gu | Nuer, nus |
Acehnese, ace | Haitian Creole, ht | Occitan, oc |
Acholi, ach | Hakha Chin, cnh | Odia (Oriya), or |
Afrikaans, af | Hausa, ha | Oromo, om |
Albanian, sq | Hawaiian, haw | Pangasinan, pag |
Alur, alz | Hebrew, iw | Papiamento, pap |
Amharic, am | Hiligaynon, hil | Pashto, ps |
Arabic, ar | Hindi, hi | Persian, fa |
Armenian, hy | Hmong, hmn | Polish, pl |
Assamese, as | Hungarian, hu | Portuguese (Brazil), pt |
Awadhi, awa | Hunsrik, hrx | Punjabi (Gurmukhi), pa |
Aymara, ay | Icelandic, is | Punjabi (Shahmukhi), pa-Arab |
Azerbaijani, az | Igbo, ig | Quechua, qu |
Balinese, ban | Ilocano, ilo | Romani, rom |
Bambara, bm | Indonesian, id | Romanian, ro |
Bashkir, ba | Irish, ga | Rundi, rn |
Basque, eu | Italian, it | Russian, ru |
Batak Karo, btx | Japanese, ja | Samoan, sm |
Batak Simalungun, bts | Javanese, jw | Sango, sg |
Batak Toba, bbc | Kannada, kn | Sanskrit, sa |
Belarusian, be | Kapampangan, pam | Scots Gaelic, gd |
Bemba, bem | Kazakh, kk | Sepedi, nso |
Bengali, bn | Khmer, km | Serbian, sr |
Betawi, bew | Kiga, cgg | Sesotho, st |
Bhojpuri, bho | Kinyarwanda, rw | Seychellois Creole, crs |
Bikol, bik | Kituba, ktu | Shan, shn |
Bosnian, bs | Konkani, gom | Shona, sn |
Breton, br | Korean, ko | Sicilian, scn |
Bulgarian, bg | Krio, kri | Silesian, szl |
Buryat, bua | Kurdish (Kurmanji), ku | Sindhi, sd |
Cantonese, yue | Kurdish (Sorani), ckb | Sinhala, si |
Catalan, ca | Kyrgyz, ky | Slovak, sk |
Cebuano, ceb | Lao, lo | Slovenian, sl |
Chichewa, ny | Latgalian, ltg | Somali, so |
Chinese (Simplified), zh | Latin, la | Spanish, es |
Chinese (Traditional), zh-TW | Latvian, lv | Sundanese, su |
Chuvash, cv | Ligurian, lij | Swahili, sw |
Corsican, co | Limburgish, li | Swati, ss |
Crimean Tatar, crh | Lingala, ln | Swedish, sv |
Croatian, hr | Lithuanian, lt | Tajik, tg |
Czech, cs | Lombard, lmo | Tamil, ta |
Danish, da | Luganda, lg | Tatar, tt |
Dhivehi, dv | Luo, luo | Telugu, te |
Dinka, din | Luxembourgish, lb | Tetum, tet |
Dogri, doi | Macedonian, mk | Thai, th |
Dombe, dov | Maithili, mai | Tigrinya, ti |
Dutch, nl | Makassar, mak | Tsonga, ts |
Dzongkha, dz | Malagasy, mg | Tswana, tn |
English, en | Malay, ms | Turkish, tr |
Esperanto, eo | Malay (Jawi), ms-Arab | Turkmen, tk |
Estonian, et | Malayalam, ml | Twi, ak |
Ewe, ee | Maltese, mt | Ukrainian, uk |
Fijian, fj | Maori, mi | Urdu, ur |
Filipino, tl | Marathi, mr | Uyghur, ug |
Finnish, fi | Meadow Mari, chm | Uzbek, uz |
French, fr | Meiteilon (Manipuri), mni-Mtei | Vietnamese, vi |
Frisian, fy | Minang, min | Welsh, cy |
Fulani, ff | Mizo, lus | Xhosa, xh |
Ga, gaa | Mongolian, mn | Yiddish, yi |
Galician, gl | Myanmar (Burmese), my | Yoruba, yo |
Georgian, ka | Ndebele (South), nr | Yucatec Maya, yua |
German, de | Nepalbhasa (Newari), new | Zulu, zu |
Greek, el | Nepali, ne | Hebrew, he |
Guarani, gn | Norwegian, no | Javanese, jv |
Chinese (Simplified), zh-CN |
Kommentarer