Տողադարձ

aramhayr
Jul 23, 2023
2 min read

Պարզեցված կանոններ

Տողադարձի հիմնօրինակ կանոններից [Ղար1981::16-18] գործնական իմաստ ունի ամենապարզի իրացումը` երկու ձայնավորների միջեւ գտնվող բաղաձայներից վերջինը անցնում է մյուս տող:

Հաջորդական երկու ձայնավորները նույնպես կարելի է տարբեր վանկերում ընդգրկել:

Այսինք այս երկու օրենքների համաձայն վանկարկումը բավարար է գերակշռող դեպքերում. C-ով նշանակված են բաղաձայնները (consonant), իսկ V-ով` ձայնավորները (vovel)`

*-CV... - ար-տադ-րա-մաս (VC-CVC-CV-CVC)
...V-V... - ա-մե-նա-ան-տար-բեր (V-CV-CV-VC-CVC-CVC)

Կիսաձայն յ-ով կազմված յու, յա, ևն երկհնչյունները որոշ դեպքերում կարելի ձայնավոր համարել` (VC-, մա-տյան (CV-CVC): Բացառություն արժե անել ան-ով սկսվող բառերի դեպքում` ան-օրեն, ան-որոշ, ինչպես նաեւ դժ, դր, ևն երկհնչյունները` առ-դրու-թյուն

(VC-[CC]V-CVC)

Կանոնների մի այլ իրացում

Տե՛ս ocr-verification` hyphenate.py

Ալգորիթմի նկարագրությունը (Armenian syllabification algorithm)

- Works on a C/V abstraction of a word:

* Vowels are multi-character units from a configurable list

(default: ["յա","յու","յո","ու","ա","ե","է","ը","ի","ո","օ"]).

* All other Armenian letters (U+0531–U+058F) are treated as consonants.

* Armenian accent signs '՛' (U+055B), '՜' (U+055C), '՞' (U+055E)

are attached to the preceding vowel token.

- For each word W:

* Builds a sequence of tokens (kind, original_text, lower_text),

where kind ∈ {C, V}.

* Builds a C/V pattern string and recursively splits syllables

from the end using the ordered pattern set:

{CVCCC, CVCC, CVC, CV, V, VC, VCC, VCCC}

with “more-than-one-vowel remaining” as the stopping rule.

* Uses a fallback that guarantees the last syllable ends in a

vowel token when no pattern matches.

* Returns syllables as original substrings, preserving casing.

Մոտեցումը (Hyphenation strategy)

- Hyphenation is applied per run (python-docx Run object) rather

than per entire paragraph, to avoid disturbing run-level styling

(italics, bold, fonts, etc.).

- Inside a run:

* Armenian words are detected as sequences of characters in

the Armenian Unicode block (U+0530–U+058F).

* Each word is syllabified, existing U+00AD soft hyphens are

stripped, and new soft hyphens are inserted between syllables.

* Internally, syllables are joined using U+00AD; when writing

back to OOXML, each U+00AD becomes a <w:softHyphen/> element.

- Because runs are not merged or split, formatting does not bleed

across boundaries; any word that spans multiple runs is

hyphenated independently within each run segment.

DOCX / OOXML handling

- Uses python-docx for high-level document access (Document,

Paragraph, Run).

- At the OOXML level:

* For each run, only <w:t> (text) and <w:softHyphen/> children

are replaced; <w:rPr> (run properties: italic, bold, fonts,

etc.) and other children are preserved unchanged.

* New text nodes are created as <w:t> elements. When their

content has leading or trailing spaces, the attribute

xml:space="preserve" is set so that Word does not trim

whitespace.

* U+00AD characters are never left in <w:t> text; they are

converted to <w:softHyphen/>, the OOXML-native representation

of an optional hyphen.

- Paragraph and run counts and ordering are preserved exactly.

Only the internal text / soft-hyphen representation is changed.

Ձայնավորների ձեւակցումը (Vowel configuration)

- The default vowel set is suitable for Armenian, but can be

overridden from the command line with a CSV string, e.g.:

--vowels "յա,յու,յո,ու,ա,ե,է,ը,ի,ո,օ"

- The syllabifier always sorts vowels longest-first for matching,

so multi-character vowels are recognized before single-character

ones.

Limitations

-----------

- Because words that span multiple runs are never merged, syllable

boundaries are computed independently within each run. This is a

deliberate choice to guarantee that all original run-level

formatting is preserved.

Running

-----------

./hyphenate.py input.docx --vowels "յա,յու,յո,ու,ա,ե,է,ը,ի,ո,օ"

--

[Ղար1981] Ա.Ս. Ղարիբյան, Գ.Ս. Շեկոյան. Դպրոցական ուղղագրական բառարան. «Լույս», Ե., 1981.

Տողադարձ

Պարզեցված կանոններ

Կանոնների մի այլ իրացում

Ալգորիթմի նկարագրությունը (Armenian syllabification algorithm)

- Works on a C/V abstraction of a word:

Մոտեցումը (Hyphenation strategy)

- Hyphenation is applied per run (python-docx Run object) rather

DOCX / OOXML handling

- Uses python-docx for high-level document access (Document,

Ձայնավորների ձեւակցումը (Vowel configuration)

- The default vowel set is suitable for Armenian, but can be

Limitations

-----------

Running

-----------

--

Recent Posts

Comments