Unicode Standard overview
To evaluate the Armenian Characters encoding we need to understand the principles of Unicode standard and determine the degree of textual source coverage by specified characters set. Coverage evaluation has 2 aspects: the percentage of 1) sources, units of texts, and 2) the characters specified in the set vs used in all texts.
I am going to address only principles that are relevant to the Armenian character set and affect rendering as well as processing of texts.
The Principles
Any computer symbol has three important characteristics: 1) an internal binary code, 2) glyph - image associated with the code, 3) the semantics of the symbol - letter, digit, punctuation, diacritics, etc.
“The difference between identifying a code point and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper—called a glyph—is a visual representation of the character [1].”
“The Unicode Standard directly addresses only the encoding and semantics of text. It addresses no other action performed on the text [1].” What is the semantics of text? Simply put it is the role of the symbol (glyph) in rendering the text. Some roles, for example, space, line end, etc. do not have glyphs (they are invisible).
The above sounds simple and reasonable until you realize that since Unicode is a universal encoding the same glyph can have different text semantics in different writing systems. The simplest example is Armenian Middle Dot (mijaket). Its role is similar to colon (or semicolon), but it looks like a full stop in the majority of writing systems. The last sentiment is debatable: some might say that the mijaket is a rhombus (or rather a square with angle pointing up), while full stop is a circle.
That is why the combination of glyph and role makes a symbol unique. Unicode assigns codes to unique symbols.
The Unicode standard specifies code, glyph, and these relevant to our discussion categories for each symbol:
General Category (basic partition into letters, numbers, symbols, punctuation, and so on)
Other important general characteristics (whitespace, dash, ideographic, alphabetic, noncharacter, deprecated, and so on)
Casing (upper, lower, title, folding—both simple and full)
The categories are used with the by language-agnostic text processors for rendering, tokenizing, capitalizing, etc. These are the categories (abbreviated names are in parenthesis) that might need revisiting in Armenian related encoding and some more or less known examples of signs, which belong in the category:
Other Punctuation (Po) - Exclamation (!) mark, question (?) mark, ampersand, asterisk,
Modifier Letter (Lm) - Circumflex, Caron, Acute Accent, Grave Accent [Note. Sh0uld have been named Letter Modifier since it modifies the phoneme represented by the letter)
Other Symbol (So) - Copyright Sign, Registered Sign, Degree Sign
Currency Symbol (Sc) - Dolar, Tugrik, Hryvnia, Manat, Bitcoin, Yen
Dash Punctuation (Pd) - Dash, Hyphen
Lowercase Letter (Ll) - ո, n, o, օ, о
Nonspacing Mark (Mn) - they used for creating combining signs like Á, ẳ, ů.
Modifier Symbol (Sk) - `, ¯, ¸.
Armenian characters set
Overview
What is missing today - is a scholarly justified Armenian character set. At the time of working on ArmSCII standard (7- and 8-bit standards) [2] our (at least mine) approach was not exactly scientific. Even though we got linguists, historians, and software engineers involved in the project, we failed to organize the process right. It was an “Armenian”, rather than scientific debate: in Armenian debate you start with discussing 3 proposals in a group of half a dozen people and end up with 7 proposals (statistically one of 6 Armenians cannot decide between 2 unsupported with others alternatives); in scientific - the outcome is one. Not much later, in the mid 90-s, when I was involved in the Unicode Armenian block reconciliation with (mapping into) the ArmSCII the situation did not change much. Emotions beat reason.
Along with these subjective problems there are some objectives. At the time the focus was on message communication. Today we deal with a much wider spectrum of tools and communication sources and audiences. These look like enough reasons to revisit encodings.
As of today the issues related to և and ու are resolved with no casualties among people involved. The majority agrees that the former is a symbol, similar to &, or a lowercase shorthand for եւ, while the latter is a letter represented with 2 characters (note: you can find recent publications still listing Եվ/և as the last letter in Armenian alphabet).
Currently, there are no issues with letters.
I listed non-letter characters for review and discussion. Each item on the list has an ArmSCII code (like xA1) [2], then Unicode [3], the name of the sign in the ArmSCII (or Unicode, if there are no ArmSCII code), the glyph in square brackets, categories in parenthesis, and my comment separated with dash ‘-’:
N/A U+055F Abbreviation Mark (patiw) [ ՟ ] (Po) - importance to be determined
N/A U+0559 Modifier Letter Left Half Ring (suin) [ ՙ ] (Lm) - usage unknown, importance to be determined
xA1 U+058D Eternity Sign (right facing) [ ֍ ] (So) - importance to be determined
xA2 U+0587 Armenian Sign 'Yev'/Armenian Small Ligature Ech Yiwn [ և ] [Ll] -
xA3 U+0589 Armenian Full Stop (Verjaket) [ ։ ] (Po)
xA4 U+0029 Right Parenthesis
xA5 U+0028 Left Parenthesis
xA6 U+00BB Armenian Right Quote (Aj Tchakert)
xA7 U+00AB Armenian Left Quote (Dzakh Tchakert)
xA8 U+2011 Armenian Joining Line (Miutyan Gtsik) [ ‑ ] (Pd) - could have been alright with regular dash; it is different from the xA8, because it is used to join morphemes, for example, ինչ‑որ
xA9 U+2024 Armenian Middle Dot (Mijaket) [ ․ ] (Po) - no major problem, however having this instead of U+0559 is preferable.
xAA U+055D Armenian Separation Sign (Boot) [ ՝ ] (Po)
xAB U+002C Comma (Storaket) - could have used regular x2C comma in ArmSCII
xAC U+2014 Armenian Separating Line (Anjatman Gits) [ — ] (Pd) - could have been alright with regular dash; it is different from the xA8, because it is used for ranges (?), for example, e — 𝝿
xAD U+058A Armenian Hyphen (Yentamna) [ ֊ ] (Pd) - we could have been alright with regular hyphen (if glyph difference is not important)
xAE U+2026 Armenian Ellipsis (Kakhman Keter) [ … ] (Po)
xAF U+055C Armenian Exclamation Mark (Batsakanchakan) [ ՛՜ ] (Po) - wrong category; (Sk) looks more appropriate (for (Mn) categorization see Code mappings and classification [2])
xB0 U+055B Armenian Emphasis (Shesht) [ ՜ ] (Po) - wrong category; (Sk) looks more appropriate (for (Mn) categorization see Code mappings and classification [2])
xB1 U+055E Armenian Question Mark (Paruyk) [ ՞ ] (Po) - wrong category; (Sk) looks more appropriate (for (Mn) categorization see Code mappings and classification [2])
xFE U+055A Armenian Apostrophe (Apatarts) [ ՚ ] (Po) - we could have been alright with regular Apostrophe xA3, U+0027
N/A U+058E Eternity Sign (left facing) [ ֎ ] (So) - importance to be determined
N/A U+058F Armenian Dram Sign [ ֏ ] (Sc) - adding to ArmSCII looks important
N/A U+FB13 Armenian Small Ligature Men Now [ ﬓ ] [Ll] - importance to be determined
N/A U+FB14 Armenian Small Ligature Men Ech [ ﬔ ] [Ll] - importance to be determined
N/A U+FB15 Armenian Small Ligature Men Ini [ ﬕ ] [Ll] - importance to be determined
N/A U+FB16 Armenian Small Ligature Vew Now [ ﬖ ] [Ll] - importance to be determined
N/A U+FB17 Armenian Small Ligature Men Xeh [ ﬗ ] [Ll] - importance to be determined
Punctuation vs intonation signs
All issues except for 17 - 19 are superficial (note: this issue can also be superficial - no one cries out loud).
At first it looks like they are correctly categorized. However, punctuation signs are outside the words and text processors treat them as word separators. Punctuation signs typically separate phrases.
These 3 signs in Armenian are tone (more precise - intonation) changing signs that are associated with words rather than phonemes (unlike Chinese or Vietnamese). More accurately they put three different kinds of logical stress on words.
My non-professional (phonologists should explain this) take on this issue is this: interrogation, exclamation, and emphasis is encoded by intonation in all languages (?). In some cases special words or word order additionally used. Typically, emphasis is associated with the word via putting a stress mark on the last syllable of the word. The other two marked by punctuation signs such as '!', '?' (plus syntax). In cases when combination of word order, special words, and punctuation signs still leave space for misinterpretation, the empasses sign is added to relevant word. Armenian signs #17 and #19 are combinations of punctuation '!' and '?' signs and the emphasis sign for specifying the word that undergoes tone change.
The text processors that take standardized by Unicode character categories seriously, tokenize Armenian words incorrectly. See for example how Armenian words are tokenized in UD. They have modified their tokenizer for Armenian, while the whole purpose of Unicode or Universal Dependencies is to minimize script specific processing (I would guess).
The UD processor considers “ինչպէ՞ս" word form as “how?”, which become two tokens, ինչպէս and ՞”, which does not look incorrect. They 'refer to such cases as multiword tokens' (I would refer to them as multi-token words). Telegram’s tokenizer splits the same ինչպէ՞ս into 2 tokens: ինչպէ and ս in received messages, but considers as single token at preparing a message (try double-clicking a word with paruyk in received and prepared messages). However, this could be an unrelated bug.
Another issue is related to search in text: when you search for “ինչպէս" you expect finding “ինչպէ՞ս";
Revisiting Armenian characters set
Even if we agree that since beginning of milenium everyone was happy with the standards in 20 years the usage of character sets and tools changed dramatically. If by the end of last century it was about writing emails now it is OCR, search, tagging, ML, etc. Can these technologies encounter problems with existing set of characters? Is keeping 7,8-bit encodings up to date still important? The need for revisiting is debatable, but if we decide to revisit I would suggest this process:
Create a work group (task force) of scholars, who understand basics of DH: scanning, OCR, creating corpus, etc.,
2-3 three historians, աղբյուրագետ` from Matenadaran, Venice, Jerusalem, British Museum, etc. who have extensive knowledge of Armenian manuscript sources
2-3 three linguists and philogists, who have extensive knowledge of middle Armenian publications in both: e/w Armenian, and dialects.
2-3 three linguists and philogists, who specialize in contemporary media e-publications, social media, etc.
1-2 specialists in Armenian epigraphy.
1-2 computer scientist with the basic understanding of localization, text [non-linguistic] processing,
1-2 DH specialists
1 Standardization specialist
Find home for the group. The Home will be conducting official communication with the Armenian standards committee. The latter will communicate with the Unicode and other international bodies. It could be any academic or technological organization (like AESA, who was [a successful] Home for Unicode Armenian block design).
The workforce members 1a - 1d prepare a report with a list of characters, their text semantics, and 3-5 levels of every aspect of textual source coverage.
The workforce members 1e - 1g prepare a proposal to standardization authority based on the above report and focusing on the "the majority usage of majority of users" [4]."
Before starting the work members of the workforce must review, amend, and agree on the above approach and swear on the Unicode v.15 documentation to follow the above principles during their work of the character set amendment.
The group can already start reviewing available analysis/proposals [5 - 7] and find other resources.
References
R.Youatt. Unicode and ArmCII. Journal of the Society for Armenian Studies. Vol. 9, pp 87-97. 1992
L.V. Baronian, Armenian Phonetic Characters in Unicode. Visiting Scholar, Stanford Associate Professor, UQAC, January 19, 2017
Comments