top of page
Search

PolyWordlot Data Preparation

Updated: Apr 5


PolyWordlot Login

For developers: gitHub google-1gram.


Overview

The goal is to build a word frequency dictionary for Eastern Armenian in .csv format:

word, count, definition (meaning)

We will built this dictionary using wikipedia articles.

We will use singular nomenal case nouns - excluding proper nouns, gerunds, and infinitives and some pronouns in the game.


Downloading

Using Wikimedia Downloads 2 data dumps were downloaded:

  1. hywiki-20260301-pages-articles-multistream.xml.bz2 - 583.2 MB, which is an archived version of the 3.79 GB hywiki-20260301-pages-articles-multistream.xml - the wikipedia 324 668 articles content.

  2. hywiktionary-20260301-pages-articles-multistream.xml.bz2 - 48 MB, which is an archived version of the 494.4 MB hywiktionary-20260301-pages-articles-multistream.xml - the wiktionary 304 879 words.

We will refer to these documents as wiki.xml and wiktionary.xml.

The scripts for Wikipedia and Wiktionary data manipulation and cleaning are available in the gitHub google-1gramr project.

All scripts before processing the dictionary words replace 'և' with 'եւ'.

Note: The new, recommended location for downloading monthly dumps is MediaWiki Content File Exports/Download MediaWiki Content Current/hywiktionary/.


Building word frequency dictionary

Wiktionary quality investigation

After running

./extract_hywiki_words.py --max 9 hywiki.xml > armenian-9.csv

we get 1.157.632 word forms of 4 to 9 letters long. The majority words that appear only once that constitute more than half of total 582338 (or 50.3%) number of words are garbage. There are some words like ֆօքսթրօթ that one can understand despite 4 (2 for sure) typos in 8-letter word, but in the majority of cases it is very hard guess. The situation is not much better for the words frequencies of 2, 3, up to couple of dozen.

The upper limit of 9 was initialy chosen because Armenian words tend to be longer. Another reason is that the letter 'ու', which perceived as single symbol is a combination of 2 Unicode characters.

Using

split-list.bash armenian-9.csv 12

we get 2 files: armenian-9-more-12.csv and armenian-9-less-12.csv. The first contains 150886 words that appear 12 or more times in the articles of wikipedia. This is a reasonable amount to work with. The prediction is that roughly the half of them are proper nouns and every third word of the remaining half is a dictionary form (lemma). The expectation is that 20-25 thousand words can be used for the game.


Wikipedia data extract extract for the App

Since current version of the PolyWordlot uses 5 to 7 character words after initial investigation of 4 - 9 length range words only 5 to 7 character words extracted for further cleaning:

./extract_hywiki_words.py --min 5 --max 7 ~/Downloads/hywiki-20260301-pages-articles-multistream.xml > armenian-5-7.csv

The armenian-5-7.csv file contains 578562 word forms 47.57% are words that appear once in Wikipedia content).

The armenian-7-more-12.csv contains 81309 word forms.


Building word types and definitions dictionary

The base for this dictionary is the .xml dump from Armenian Wiktionary - #2 in the Downloading section. A Python script is developed to extract 'word', 'PoS', and 'Definition from the wiktionary.xml:


./extract_wiktionary_items.py wiktionary.xml --skip skip.txt --filter հնց,ժղ --desc-trigger տե՛ս > armenian_wikt.csv


This script creates a .csv file of records extracted from Wiktionary (by default 4 to 9 characters long) and filters out the words listed in skip.txt file and words that are archaic (հնց or հին), colloquial (ժղ), abbreviations (հպվ), billiard terms (Բիլիարդային), dialect (բրբռ), and unsophisticated (գվռ). Since the PolyWordlot game is a family friendly game the skip.txt file is used to remove offensive words and adult terminology.

The --desc-trigger is a list of words in the first bullet-item in the word description that triggers the removal of the bullet-item from the description. This is the actual command used for Wiktionary extraction:


./extract_wiktionary_items.py --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ,հին --desc-trigger տե՛ս,տե´ս hywiktionary-20260301-pages-articles-multistream.xml > armenian_wikt.csv


From the original 304 879 words only 91694 made the cut. This is a more accurate report:


Processed titles: 313943

Extracted entries: 91694

Skipped/filtered entries: 122390

Titles with non-Armenian characters: 99774

Titles ignored because of redirects: 85


I have no explanation for the discrepancy between the number of words reported by Wiktionary on 2026-03-01 - there are ~9000 extra words processed (Processed titles:). Skipping is due to several factors, but the major is inability to find a tag listed in the map below (see Tables Ա1.1 and Ա1.2 in the On Syntactic Structure Representation, pp. 183-4 for Latin abbrեviations) - it is in a separate from the script pos-map.txt file to make the script universal, applicable to other languages:

-hy-ած-  ADJ

-hy-մակ-  ADV

-hy-գո-   NOUN

-hy-բայ-  VERB

-hy-դեր-  PRON

-hy-թվ-   NUM

-hy-կապ-  ADP

-hy-շաղ-  CONJ

-hy-ձա-   INTJ

-hy-եղբ-  MOD

-hy-հատ-  PLC

-hy-հպվ-  ABBR


The skip.txt file currently contains about 677 words.

The quality of Armenian Wiktionary content seems much higher than Armenian Wikipedia articles despite of some - mostly markup and questionable PoS assignment - typos. However, detailed investigation is needed for both.


Combining word types and counts

The end result of the data processing is a .csv file of this format:

word, count, tag, definition

Run:

./combine_csv.py --augment augment-records.csv word-tags.csv word-count.csv > combined.csv

It merges the data from the previously obtained files only for these types of words: NOUN, PRON, VERB, NUM, MOD, ADP. It adds 1 to the original count: if the word is not in the wiki.xml then it gets count 1. You can add more words using optional augment-records.csv which is of the type as the word-tags.csv.

If you run:

./combine-csv.py wiktionary/armenian_wikt.csv wiki/armenian-5-7.csv > hy-combined.csv

you get the resulting file of 49658 words.


Data formatting for application

Final step in data preparation is splitting the hy-combined.csv into two types of files: dictionaries-N.txt and answers-N.txt per words length - the N in the file names is the length.

There are 2 script for that purpose:

./create_dictionaries.py --count <f> hy-combined.csv

./create_answers.py --skip <%> --keep <%> dictionary-N.txt

The first script splits the hy-combined.csv into word-length specific - the dictionary-N.txt files by skipping the words with the frequency (count) less than f.

The second extracts keep percentage of overall number of words in the dictionary-N.txt file by skipping "skip percentage" of high frequency (count) words. Additionally, a file of the same format of manually created records can be submitted to prevent listed words from appearing the answers - the --ignore argument.

The results for this data set are:

dictionary-5.txt: 2784 words, count range 4..265320

dictionary-6.txt: 3251 words, count range 4..276040

dictionary-7.txt: 3584 words, count range 4..192989

-->

./create_answers.py --skip 5 --keep 60  --ignore wiktionary/problematic-answers.txt dictionary-5.txt

answers-5.txt: 1513 entries, count range 18..2938

./create_answers.py --skip 5 --keep 50  --ignore wiktionary/problematic-answers.txt dictionary-6.txt

answers-6.txt: 1461 entries, count range 24..2257

./create_answers.py --skip 5 --keep 50  --ignore wiktionary/problematic-answers.txt dictionary-7.txt

answers-7.txt: 1716 entries, count range 24..1844


The list of guessed answer words is a subset of all - dictionary - words of a given length. For each length ~50% of medium to least frequent words are selected to make guessing harder.


Helper tools

There are several tools that help in file content manipulation to help in splitting the huge files into manageable chunks.


split-list.bash

# Usage: split-list.bash <name>.csv n


Splits the <name> file into 2 files with the 1) count more or equal to n, and 2) less than n.


split-wordforms.py 

Split word forms into groups by comma separated list of suffixes, for example, ել ում ած եր ով իկ ող.

Takes a text file with one record per line in the format:

    <word><separator><text>

where <separator> is any of: space, tab, comma, period, dash, colon.

It moves records whose <word> ends with the given suffixes into separate files

named:

    <name>-<suffix>.txt

where <name>.txt is the original input file. Records that do not match any

suffix remain in the original file. At the end, it prints a report of the form:

    <filename>.txt - <number of records>

for all files involved, including the updated input file:

./split-wordforms.py wiktionary/armenian_wikt.csv --suffix ել ում ած եր ով իկ ող 

wiktionary/armenian_wikt-ած.csv - 810 records

wiktionary/armenian_wikt-ել.csv - 2092 records

wiktionary/armenian_wikt-եր.csv - 909 records

wiktionary/armenian_wikt-իկ.csv - 718 records

wiktionary/armenian_wikt-ող.csv - 577 records

wiktionary/armenian_wikt-ով.csv - 2981 records

wiktionary/armenian_wikt-ում.csv - 791 records

wiktionary/armenian_wikt.csv - 83164 records


The above suffixes selected words with these suffixes fequently (except for -ել) are not dictionary forms.


Refactored toolkit

I refactored the toolkit and added debug logging to make the maintenance easier for further enhancement to use the toolkit for Wikipedia and Wiktionary research and formatting and content corrections.

These are major components after the extract_wiktionary_items.py refactoring split:

  1. page_element_processor.py - reads dump <page> by <page>

  2. text_pre_processors.py - there are 4 pre-processors to fix most common formating errors:

    1. pp0: converts Armenia POS headers to wiki markup: {{-hy-<pos>-}}

    2. pp1: Normalize placement of the 'Ստուգաբանություն' etymology header.

    3. pp2: Normalize POS description regions to use '# ' bullets.

    4. pp3: Structural handling of -hy-բաց- POS sections at the text level

  3. text_processor.py - Standard processing of a normalized <text> element that conforms as much as possible to the structure in text-element-structure.txt.

  4. description_processor.py - Builds final dictionary entry descriptions from POS description regions extracted from a Wiktionary <text> name-pos.txtelement.


  5. pos-map.txt - Armenian POS template conforming tag conversion to universally accepted tags [Հայ2022::228]

  6. pos-weights.txt - POS weights for selecting single tag for a malti-pos words

  7. name-pos.txt - map for POS name conversion to {{-hy-<pos>-}} template format


  8. text-element-structure.txt - the most common wiki markup governed Wiktionary XML <text> element markup (most common does not mean recommended or "correct").


Running refactored script yields this results:

./page_element_processor.py --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ --single y --desc-trigger տե՛ս,տե´ս ~/Downloads/hywiktionary-20260301-pages-articles-multistream.xml > armenian_wikt.csv

Processed titles: 313943

Extracted entries - 91947

Skipped/filtered entries - 41524

Titles with non-Armenian characters - 99774

Titles ignored because of redirects - 85


This is description of all page_element_processor.py command-line arguments (optional arguments are in square brackets):


[-h] - print help

[--min MIN] - minimal word length

[--max MAX]  - maximal word length

[--skip SKIP] - SKIP is a text file with comma separated words to exclude from the result

[--filter FILTER] - FILTER is a csv list of that removal of the defintion from the word description. It is used to remove definitions that marked as dialect (բրբռ), old (հնց), etc.

[--desc-trigger DESC_TRIGGER] - DESC_TRIGGER is a csv list words to ignore the definition. The difference from the first that it removes all, but one.

[--single SINGLE] - SINGLE is y/n indicator to live single POS description for a word (see #6 above).

[--nopp NOPP] - NOPP is a csv list of pre-processors to skip (not execute)

[--debug DEBUG] - DEBUG is a word in Wiktionary to print processing info about.

xml_path - is the qualified name of Wiktionary dump.


The upgraded version of combine-csv.py now takes --boost, --weights, and --w arguments. The former is for boosting ranking (count) of some words. The other too is a mechanizm for taking the list of POS types out of the script - get rid of hardcode.


./combine-csv.py --boost boost-count.txt --weights wiktionary/pos-weights.txt --w 6 wiktionary/armenian_wikt.csv wiki/armenian-5-7.csv > hy-combined.csv 

ALLOWED_TAGS: ['ADJ', 'ADP', 'ADV', 'NOML', 'NOUN', 'NUM', 'PRON', 'VERB']

Done. Recorded 87488 words.  


This is the description of the properties and the printout:


--weights - sets the tag weights according to pos-weights.txt file (see bullet-item #6 above)

--w N - the script combines only words that assigned PoS tag with the weight in pos-weights.txt more or equal to N. It prints the set of tags that meet this criteria as ALLOWED_TAGS. After that the script prints the number of words that were moved into the resulting hy-combined.csv file.

[--augment augment.csv]

[--boost boost.txt] - takes words listed in csv boost.txt file and appends the count with 0s to make a 5 digit count from the original

word-tags.csv - a file with the Wiktionary words with tag and description

word-count.csv - a file with the Wikipedia words and count (how many times the word appears in Wikipedia articles.


Then the dictionaries and answers are created:

./create_dictionaries.py --count 4 hy-combined.csv 

dictionary-5.txt: 3849 words, count range 4..413474

dictionary-6.txt: 4418 words, count range 4..276040

dictionary-7.txt: 5384 words, count range 4..194463


./create_answers.py --skip 5 --keep 60 --ignore wiktionary/problematic-answers.txt dictionary-5.txt

answers-5.txt: 2154 entries, count range 15..3158

./create_answers.py --skip 5 --keep 50 --ignore wiktionary/problematic-answers.txt dictionary-6.txt

answers-6.txt: 2023 entries, count range 22..2614

 ./create_answers.py --skip 5 --keep 50 --ignore wiktionary/problematic-answers.txt dictionary-7.txt

answers-7.txt: 2617 entries, count range 20..2126


This results are still per March dump. I made fixed formatting (mostly normalized PoS tags) in Wiktionary, which will about 2 dozen words to the dictionaries. Next Wiktionary dump is on April 1.


PolyWordlot Login

For developers: gitHub google-1gram.


 
 
 

Recent Posts

See All
Speech-to-IPA Conversion

Speech-to-IPA conversion seems very useful as a research tool specifically for languages that have no writing system. Graphical (in IPA) recording of these languages will allow statistical and NLP met

 
 
 
Terms

Actor – either human or mechanical (software module, hardware unit, application, etc.) role player in requirements specification for a System design Automated Speech Recognition (ASR) - interdisci

 
 
 

Comments


bottom of page