top of page
aramhayr

The Eastern Armenian National Corpus: State of the Art and Perspectives: questions, comments

Updated: Aug 27

Feb 26, 2023




Ե.,These are some questions to the authors of the Eastern Armenian National Corpus: State of the Art and Perspectives (all attempts to get in touch failed: any help in communicating these questions to the authors is greatly appreciated):


  1. EANC includes not only all school reading texts in today’s Armenian secondary school program, but the vast majority of MEA classical literature starting from the mid 19th century, a large number of scientific texts (including the 13-volume Armenian Soviet Encyclopaedia 1974-1987).

    1. [AH: Did they use this ՀՍՀ as a raw text. As far as I know it has a lot of typos. Very few pages are spellchecked/edited/approved.


  1. The main challenge has been to work out a formal morphological model of Modern Eastern Armenian inflection that would be comprehensive enough to cover most of the corpus tokens. In other words, each lexeme that inflects had to be provided with information about its paradigmatic type (or types, in case of inflectional variance) that predicts its forms. This challenge may seem unexpected, provided a long tradition of Armenian studies. However, the conventional grammars of Eastern Armenian proved not to be formal enough for an automatic analysis (lemmatization) of EANC electronic library, which is quite justifiable because conventional grammars serve a purpose other than automatic processing (mostly educational).

    1. [AH. Are the authors familiar with Աղա1968, Աբր1974, Ջահ1974, Уру1975, Առա1979. BTW, the first is a "conventional grammar" textbook for high-school (for my and earlier generations it was a 7 grade (?) grammar textbook). Section 8 of Հայ2022 gives a summary of the morphological data from those sources.]


  1. Supplemented with part-of-speech and inflectional information, the EANC wordlist became a grammatical e-dictionary, similar to those used by Internet search engines for other morphologically rich languages.  The EANC Parser assigns token markup tags to each word form, provided that the respective lexeme is present in the EANC grammatical wordlist. Overall, 92,5% of all tokens are recognized and annotated with 72,6% analyzed unambiguously, 17% ambiguously, and 7,5% not recognized. Parsing success rate varies depending on a genre. The highest percentage of unrecognized tokens occurs, unsurprisingly, in oral discourse.

    1. [AH: I interpret the second sentence as confirmation that the wordlist has all text forms. The questions are: 1) are they partially automatically generated, for example, regular declension and conjugation forms ? Or 2) is the wordlist completely hand-made? If neither then 3) how they find the text form in the wordlist?] 

    2. [AH: How the ambiguity, particularly, syncretism is handled: for example, how the tag assigned for a form like կաղապարում? Does it get 3 tags and added to 17%?]


  1. A major problem of the EANC is the presence of numerous mistakes in optical character recognition. Wrong or impossible spellings result in losing hits and/or returning wrong hits. A number of procedures have been implemented to increase the accuracy, including human-assisted proofreading of the most important texts.

    1. [AH: The tagging system by itself is a spellchecker: if it cannot tag a word form then the word form is misspelled. After that there are 2 options for correction: 1) automated suggest closest (Levenstien/Damerau distance). and 2) manual.


  1. An important problem is the absence of syntactic and morphosyntactic markup. MEA is rich in periphrastic constructions in verbal morphology which are ignored by the parser. One of the perspectives of the project could be the implementation of basic collocation markup, including markup of auxiliary verb constructions.

    1. [AH: The morphological unit is a word group - lexemes (իմաստույթ), rather than a word. See the analysis of [other than verb] lexemes in Հայ2022 (section 9.4). For the verb and nominal lexemes see Հայ2022 (sections 12.6, 12.8, 12.9).


  1. Ignoring the context also leads to significant number of ambiguous cases in parsing results, which, for some queries, is a strong ‘noise’ factor. One of the solutions is human-assisted ambiguity removal.

    1. [AH: These and other ambiguities are discussed in detail in Հայ2022 (section 8.4.1). I am not sure what the 'noise' factor is.]


  1. For selected highly frequent cases of an extremely improbable homonymy, second readings have already been eliminated (e.g. the locative asum from the noun as).

    1. [AH: What's a noun as? all known variants are either foreign or dialect words. For better example, see Հայ2022, p.111: the կաղապարում is three ways ambiguous]


  1. EANC rule-based annotation model will be accompanied by RNN, transformer-based and hybrid models in order to attune the ambiguity and to provide context-based (hence future syntactic) annotation;

    1. [AH: This assumes a huge number of texts manually tagged. See Հայ2022 p.180 3-rd paragraph from the top. If such a high-quality corpus already exists (e.g. Arevak) then it looks reasonable implementing the whole tagging process by RNN, seq2seq in particular. What is the benefit of mixing algorithmic and statistical methods?]


  1. The article mentioned the disparity between the dates of the “press” texts used by the corpus compared to the poetry, prose, and etc (press mainly being from 2000 onward). 

    1. [DN: What is its significance of that and how does that affect the corpus? Is the issue that pre-2000 press texts just aren’t inputted into the corpus or is there another reason?] 


  1. The article also mentioned other corpuses that are similar to the Eastern Armenian one, Georgian, Czech, and Slovak; 

    1. [DN: what are those like in terms of percentage of texts analyzed?]


[Уру1975] Р.Л. Урутян. Модель порождения именных форм современного армянского язы­ка. ԷՀՄ-ների և մաթեմատիկական կաղապարների կիրառությունը լեզվա­բա­նու­թյան մեջ», միջազ­գային գիտաժողովի նյութեր, 6 էջ. Վառնա (Բուլղարիա), 1975 

[Աբր1974] Ս. Գ. Աբրահամյան, Ն. Ս. Պառնասյան, Հ. Ա. Օհանյան. Ժամանակակից հայոց լե­զու. Հատոր 2, ՀՍՍՀ գիտությունների ակադեմիայի հրատարակություն, Երեւան, 1974: 

[Աղա1968] Է. Բ. Աղայան, Հ. Խ. Բարսեղյան. Հայոց լեզու. Դասագիրք V-VII դասա­րան­ների համար. 4-րդ հրատարակություն, «Լույս», Ե., 1968։ 

[Առա1979] Վ. Դ. Առաքելյան, Ա. Հ. Խաչատրյան, Ս. Ա. Էլոյան. Ժամանակակից հայոց լե­զու. Հատոր 1, ՀՍՍՀ գիտությունների ակադեմիայի հրատարակություն, Ե., 1979։ 

[Ջահ1974] Գ. Ջահուկյան. Ժամանակակից հայերենի տեսության հիմունքները. ՀՍՍՀ գիտու­թյուն­ների ակադեմիայի հրատարակչություն, 1974: 




3 views0 comments

Recent Posts

See All

Comments


bottom of page