top of page

Translation Processor

aramhayr

Updated: 2 days ago





Overview

Translation Processor can be viewed as one of the terminal steps in the OCR Data Pipeline or Armenian Speech-to-Text WebApplication. Once you obtained and edited the text you can Translate it to any of Supported languages (see below).

In majority of cases it makes sense stitching the the whole text together and removing hyphens - manually or using Data Cleansing processor (see Step D. Cleansing data in the User's Guide to OCR Data Pipeline).

Using Application

We assume that the applications installed in, for example ocr-processor/ root directory, according to the Directory structure section in the Administering Text Processors. The data/ directory contains plain text files in text/ subdirectory. Translated into a specified language files go into translated/ folder in  data/. with language code prepended to the original name. A typical configuration might look like this:

Configuration

  1. # Configuration for Translation processor processing:

    version=2025-02-02

  2. # Folders with original and resulting texts.  Assumptions: original filename has only one dot (before the extension)

    source-directory=data/text

    target-directory=data/translate

  3. # Target language code 

    target-language=hy #(for Armenian; for Amharic use am)

  4. # File name filter: select files that end with filter from the source-directory

    filter=.txt

As for other Processors you can set print-settings and verbose modes.

Running the App

Like in the case of other Processors command line translate.bash  calls main in TranslateProcessor-1.0.0-1.0.0-jar-with-dependencies.jar. It prints out minimal info at running: file names that are translated.

These error messages can appear at exceptional circumstances:

  1. Error: Invalid target language. For list of supported languages set  verbose=yes in the config file

  2. Error accessing folders: <source-directory>

  3. Error translating file <sourceFile> <reason>

Pricing

This is typical pricing per "normal" (standard) pages (~1800 words):

Genre

Google Translate

Editing (en, hy, ru)

Non-technical

֏100 | $0.25

֏2000 | $5

Technical

֏200 | $0.5

֏3600 | $9

Literary prose

֏100 | $0.25

֏3000 | $7.5

Columns description:

  1. Genre - specifies type of the text. Technical is considered more complex because of Tables, Graphs, and Formulas that might need some manual intervention before editing.

  2. Google Translate - the text is returned "as is" after automatic translation.

  3. Editing (en, hy, ru) - Manual or AI editing is performed for the 3 mentioned languages

These are jusr rough estimates. Pricing can be adjusted upun the text review.

Up to 10 pages of first order are free.

Known Issues, Limitations

  1. com.google.cloud.translate.TranslateException:  Request payload size exceeds the limit: 204800 bytes. Despite the message you need to make files less than 64 Kb.

Supported languages

As of 02-02-2025 Google Translate supports these the languages (you can get this list by setting verbose=yes):

Abkhaz, ab

Gujarati, gu

Nuer, nus

Acehnese, ace

Haitian Creole, ht

Occitan, oc

Acholi, ach

Hakha Chin, cnh

Odia (Oriya), or

Afrikaans, af

Hausa, ha

Oromo, om

Albanian, sq

Hawaiian, haw

Pangasinan, pag

Alur, alz

Hebrew, iw

Papiamento, pap

Amharic, am

Hiligaynon, hil

Pashto, ps

Arabic, ar

Hindi, hi

Persian, fa

Armenian, hy

Hmong, hmn

Polish, pl

Assamese, as

Hungarian, hu

Portuguese (Brazil), pt

Awadhi, awa

Hunsrik, hrx

Punjabi (Gurmukhi), pa

Aymara, ay

Icelandic, is

Punjabi (Shahmukhi), pa-Arab

Azerbaijani, az

Igbo, ig

Quechua, qu

Balinese, ban

Ilocano, ilo

Romani, rom

Bambara, bm

Indonesian, id

Romanian, ro

Bashkir, ba

Irish, ga

Rundi, rn

Basque, eu

Italian, it

Russian, ru

Batak Karo, btx

Japanese, ja

Samoan, sm

Batak Simalungun, bts

Javanese, jw

Sango, sg

Batak Toba, bbc

Kannada, kn

Sanskrit, sa

Belarusian, be

Kapampangan, pam

Scots Gaelic, gd

Bemba, bem

Kazakh, kk

Sepedi, nso

Bengali, bn

Khmer, km

Serbian, sr

Betawi, bew

Kiga, cgg

Sesotho, st

Bhojpuri, bho

Kinyarwanda, rw

Seychellois Creole, crs

Bikol, bik

Kituba, ktu

Shan, shn

Bosnian, bs

Konkani, gom

Shona, sn

Breton, br

Korean, ko

Sicilian, scn

Bulgarian, bg

Krio, kri

Silesian, szl

Buryat, bua

Kurdish (Kurmanji), ku

Sindhi, sd

Cantonese, yue

Kurdish (Sorani), ckb

Sinhala, si

Catalan, ca

Kyrgyz, ky

Slovak, sk

Cebuano, ceb

Lao, lo

Slovenian, sl

Chichewa, ny

Latgalian, ltg

Somali, so

Chinese (Simplified), zh

Latin, la

Spanish, es

Chinese (Traditional), zh-TW

Latvian, lv

Sundanese, su

Chuvash, cv

Ligurian, lij

Swahili, sw

Corsican, co

Limburgish, li

Swati, ss

Crimean Tatar, crh

Lingala, ln

Swedish, sv

Croatian, hr

Lithuanian, lt

Tajik, tg

Czech, cs

Lombard, lmo

Tamil, ta

Danish, da

Luganda, lg

Tatar, tt

Dhivehi, dv

Luo, luo

Telugu, te

Dinka, din

Luxembourgish, lb

Tetum, tet

Dogri, doi

Macedonian, mk

Thai, th

Dombe, dov

Maithili, mai

Tigrinya, ti

Dutch, nl

Makassar, mak

Tsonga, ts

Dzongkha, dz

Malagasy, mg

Tswana, tn

English, en

Malay, ms

Turkish, tr

Esperanto, eo

Malay (Jawi), ms-Arab

Turkmen, tk

Estonian, et

Malayalam, ml

Twi, ak

Ewe, ee

Maltese, mt

Ukrainian, uk

Fijian, fj

Maori, mi

Urdu, ur

Filipino, tl

Marathi, mr

Uyghur, ug

Finnish, fi

Meadow Mari, chm

Uzbek, uz

French, fr

Meiteilon (Manipuri), mni-Mtei

Vietnamese, vi

Frisian, fy

Minang, min

Welsh, cy

Fulani, ff

Mizo, lus

Xhosa, xh

Ga, gaa

Mongolian, mn

Yiddish, yi

Galician, gl

Myanmar (Burmese), my

Yoruba, yo

Georgian, ka

Ndebele (South), nr

Yucatec Maya, yua

German, de

Nepalbhasa (Newari), new

Zulu, zu

Greek, el

Nepali, ne

Hebrew, he

Guarani, gn

Norwegian, no

Javanese, jv


Chinese (Simplified), zh-CN





 
 
 

Recent Posts

See All

User's Guide to OCR Data Pipeline

Overview The goal of this document is to provide the guidelines on the tuning and running the OCR Data Pipeline . and share the...

Kommentarer


bottom of page