Running the Armenian Parser

aramhayr
Dec 11, 2023
7 min read

Updated: May 26

Overview

The Parser performs spell-checking, tagging, and lemmatization of Eastern Armenian [plain] text, typed in revised orthography (see exception in the book). It is Linux command line application distributed as a Java .jar executable, bash script to run it, and morpheme Dictionaries as plain text (JSON objects). The language model used for text processing is described in details in the Բնական խոսքի ընդհանրական ներկայացման մի տար-բերակի մասին.

To install the application download and unzip these files:

Version 2.3.3 - Fixed multiple files processing statistics; reshuffled code to support service mode, added typo-position property to the system properties for formatting the output in the service mode; added CharDetector command-line application (see Step E. Manual (Visual) review and correction in the User's Guide to OCR Data Pipeline); temporarily turned off Index creation functionality, because issues with understanding (remembering) the logic.

Renamed JAR to ArmParser.

See the Content of the Application folder section below for more details about the content of these archives.

This archive contains executables - everything that is not in core dictionary or user's directory. See relevant archives below.

This archive contains Dictionaries and system configuration files that typically are not for users to update.

This archive contains user or optional Dictionaries and user configuration file.

First unarchive the arm-parser-core.zip archive in text-processor (or other) directory. It will create forlders with the executables in text-processor directory and extract scripts necessary for running applications. Copy the other 2 files into text-processor directory and unarchive.

Note. The Application development is incomplete. It is far from production level (see the list of major known issues at the bottom of page). However, it catches some typos that humans sometimes miss. I used earlier versions for editing Հայկական հանրագիտարան (I edited ~100 pages in 2016-2017) and at preparation of my book to printing (also generated the Index using Lemmatization).

Content of the Application folder

1. This is the Parser directory structure:

├── bin

│ ├── ArmParser-2.3.3-jar-with-dependencies.jar

│ └── CharDetector-1.0.0-jar-with-dependencies.jar

├── config

│ ├── descriptionOf.properties

│ ├── dictionaries

│ │ ├── immutables

│ │ │ ├── 0-10-particles.txt

│ │ │ ├── 0-11-pronoun.txt

│ │ │ ├── 0-12-natural-sounds.txt

│ │ │ ├── 0-5-abbreviations.txt

│ │ │ ├── 0-6-auxiliary-verb.txt

│ │ │ ├── 0-7-conjunctions.txt

│ │ │ ├── 0-8-adpositons.txt

│ │ │ └── 0-9-modal.txt

│ │ ├── lexemes

│ │ │ ├── 21-1-modal.txt

│ │ │ ├── 21-2-conjunctions.txt

│ │ │ ├── 21-3-adpositons.txt

│ │ │ └── 21-4-pronoun.txt

│ │ ├── names

│ │ │ └── 0-4-proper-stems.txt

│ │ ├── postfixes

│ │ │ ├── 1-1-articles.txt

│ │ │ ├── 1-2-plural-terminal.txt

│ │ │ ├── 10-1-voice.txt

│ │ │ ├── 11-1-verbal.txt

│ │ │ ├── 11-2-multiplicative.txt

│ │ │ ├── 12-1-adverbal.txt

│ │ │ ├── 13-1-adjective.txt

│ │ │ ├── 13-2-resultive-participle.txt

│ │ │ ├── 14-1-alternation.txt

│ │ │ ├── 15-1-nomenal.txt

│ │ │ ├── 2-1-repetitive-declension.txt

│ │ │ ├── 2-2-declension.txt

│ │ │ ├── 3-1-declension.txt

│ │ │ ├── 4-1-plural.txt

│ │ │ ├── 5-0-declension.txt

│ │ │ ├── 6-1-nomenal8un.txt

│ │ │ ├── 7-1-participle-mood.txt

│ │ │ ├── 7-2-past-perfect.txt

│ │ │ ├── 7-3-subjunctive.txt

│ │ │ ├── 8-1-infinitive.txt

│ │ │ └── 9-1-aspect.txt

│ │ ├── prefixes

│ │ │ ├── 1-0-prefix-terminal.txt

│ │ │ ├── 1-1-prefix-ch.txt

│ │ │ ├── 1-2-prefix-k.txt

│ │ │ ├── 1-3-numeral.txt

│ │ │ ├── 2-1-negative.txt

│ │ │ ├── 3-1-prefix-time.txt

│ │ │ ├── 3-2-prefix-relation.txt

│ │ │ └── 4-1-prefix-space.txt

│ │ ├── stems

│ │ │ ├── 0-1-root-stems.txt

│ │ │ ├── 0-2-verb-stems.txt

│ │ │ ├── 0-3-natural-stems.txt

│ │ │ └── 0-4-complex-stems.txt

│ │ ├── system-dictionaries.json

│ │ └── tags

│ │ ├── constraints.txt

│ │ ├── tag-product.csv

│ │ └── tags.txt

│ ├── docxprocessor.properties

│ ├── spellcheck.properties

│ └── update.properties

├── crdet.bash - see Step E. Manual (Visual) review and correction in the User's Guide to OCR Data Pipeline

├── dadmin.bash

├── morphemes.bash - count # of (other than suffixes) morphemes in the Dictionary

├── spell.bash

├── suffixes.bash - count # of suffixes in the Dictionary

└── work

├── appAdmin-zlmuz-com

│ ├── appAdmin-zlmuz-com.profile

│ └── appAdmin-zlmuz-com.txt

├── batchSpellcheck-zlmuz-com

│ └── batchSpellcheck-zlmuz-com.profile

└── testerA-zlmuz-com

├── dictionaries

│ ├── immutables

│ │ ├── cyrillic.txt

│ │ ├── latin.txt

│ │ └── other.txt

│ ├── names

│ ├── stems

│ │ ├── corpus.txt

│ │ ├── ea.txt

│ │ └── linguist.txt

│ └── user-dictionaries.json

├── spellcheck

│ ├── config

│ │ ├── stopwords.txt

│ │ └── synonyms.txt

│ ├── texts

│ │ └── short.txt

│ └── typos

├── testerA-zlmuz-com.profile

└── update

23 directories, 73 files

2. There are 2 sets of directories under the root: 2 system directories bin/ and config/ and work/ directory that contains end-user profiles and dictionaries.

3. There are 2 scripts for command line activation of the text and dictionaries processing spell.bash and dadmin.bash. Other scripts described in the tree above.

4. The system directories:

a. The bin/ contains .jar file that contains text and dictionary processing functionality

b. The config/ contains multifile dictionary of terms, its .json configuration, tags descriptions, and 2 system profiles: spellcheck – for stemming, tagging, lemmatization, indexing, and spellcheck; and update – for dictionary updates.

c. The Dictionary of Terms is split into 6 categories: immutables, prefixes, lexemes, postfixes, names, stems. The last 2 can be updated by the Dictionaries maintenance functionality – update.properties.

5. The work/ directory contains user profiles and dictionaries:

a. The spell.bash and dadmin.bash scripts are configured to run for the testerA-zlmuz-com/ user profiles

b. All user dictionaries can be updated by dadmin.bash. There are currently 3 non-Armenian script dictionaries and 3 Armenian script dictionaries in the testerA-zlmuz-com/ user.

c. The spellcheck/config/ folder contains configuration for index creation (in this particular case for the my book): stopwords (words to ignore) and synonyms (words that should be replaced by a different word in the index) [note. These should go under the user project specific directory there also a project level dictionaries should be added).

Running the Application

6. To run processing:

a. Put the POT (plain old text) file(s) into testerA-zlmuz-com/text/ folder under any name, but with the .txt extension (files with other extensions are ignored).

b. Run ./spell.bash in command line

c. You get spell check results – list of script specific files with typos - in the testerA-zlmuz-com/typos/: review .update files.

7. To update the dictionary:

a. Move (or create) * -update.txt file into testerA-zlmuz-com/update/directory

b. Edit the dictionary name and/or list of terms to be ADDed or DELETed

c. Run ./dadmin.bash

d. Fix tags for the words with tag=”X” in relevant dictionaries.

Note. Updating the system dictionaries is not recommended until having centralized storage.

8. Editing user profile. If you set:

i. log-found=yes – then tagging results go into the *-short.txt file

ii. debug=yes – error messages go into the *-short.txt file.

Classic orthography workaround

In order to reuse algorithms and dictionaries developed so far for the texts of classic orthography a simple workaround is implemented: convert the classically spelled word to revised spelling before passing it to the spellchecker or tagger.

This is the implemented conversion table;

"ուեա", "վյա"
"ուե", "վե"
"ուա", "վա"
"ուէ", "վե"
"իւ", "յու" // բառավերջում` ւ -> վ, եթե ու կամ եւ չի
"եա", "յա" // իդ-եա-լական-ում` ժամանակակից արտասանությունն է արտահայտվում
"ոյ", "ույ" // բացի գոյ
"ուիլ", "վել"
"ուի", "վի"
"աւ", "ավ"
"ւոր", "վոր"
"ւում", "վում"
"րէ", "րե"
"օ", "ո" // բառամիջում
"է", "ե" // բացի էջ համակցությունից
"յ", "" // բառավերջում
"թյանց", "թյուններ" // [բառավերջում]

By implementing the above conversion I decreased false negative count for the Հ.Աճառյան. ՀԱԲ. ՅԱՌԱՋԱԲԱՆ page from 1137 to 404. It might look not particularly impressive, but, in practice, it simplified proof reading a lot.

Known issues

The details of known problems are documented in the Qyasibi Jira - Tagger spreadsheet (ask the author for permissions to review). These are some major issues:

False positive errors

Misspelled words that the spellchecker does not catch.

Examples: միջանցներից, ամենաընդհանրան, Ժաժանեցի, Կարուցման, հաջորդանություն, հաջորդկանությունները, պաշտինները, արդեեն, ընդհարական, կիրարական, վերձված, արտածաման, փոխանցվել, ընդհանրակնից, պաշտինները, արդեեն, ընդհարական, բավակն, տեսակնորեն, արտահատվում, քերակականական, քերակական, Լեզվալան, Ազգակցան, քողակված, մորանալը, համակարչային, վերկանգնող, ահմար, անուներ, խնդր, կառուցուն, կարճացելու, համակարչի, հոլովաձեւվից, Մուրենը, ուղությամբ, չեզուքացում, վերկրնական, քերակականական, կաշեցի, հառակը, օդնությամբ, արտահատված, համամարդակային, ընթեցումից, etc.

False negative errors

Correctly spelled words that the spellchecker reports as misspelled.

Examples: չանելու, կա, չստանալով, չստանալը, ստորոգելիական, երբվանից, չկան, etc.

Malfunction

Some words are not tagged without warning or reasons
Incorrect lemmatization աղյուսակում, ամենայ, ընդհատականի, ինքնաբարելավ, խոնարհման/ում, ծն, կապահովի, համաձայնեց, հայերենի, հայրենիք, հայրենի, հատկացվ, հեռավորությունն, հիմնակ, ձեւափոխ, մասն, միջ, մշտ
Suffix lemmatization: ան, անած, անակ
Wrong suffix - զբաղմ-անք at lemmatization
Incorrect lemmatization: թվայինացում (թվայնացում), հաղորդակիցում (հաղորդակցում), մեծության (մեծություն)
STM.DUAL + POX.PSPT.OMOD returns POX.POX rather than PSPT
finds բացատրել lemma for բացատրի correctly, but gets extra բացատր as NOML lemma

Processors