PolyWordlot Data Preparation
- aramhayr
- 4 days ago
- 4 min read
Updated: 20 hours ago
PolyWordlot Login
For developers: gitHub google-1gram.
Overview
The goal is to build a word frequency dictionary for Eastern Armenian in .csv format:
word, count, definition (meaning)
We will built this dictionary using wikipedia articles.
We will use singular nomenal case nouns - excluding proper nouns, gerunds, and infinitives and some pronouns in the game.
Downloading
Using Wikimedia Downloads 2 data dumps were downloaded:
hywiki-20260301-pages-articles-multistream.xml.bz2 - 583.2 MB, which is an archived version of the 3.79 GB hywiki-20260301-pages-articles-multistream.xml - the wikipedia 324 668 articles content.
hywiktionary-20260301-pages-articles-multistream.xml.bz2 - 48 MB, which is an archived version of the 494.4 MB hywiktionary-20260301-pages-articles-multistream.xml - the wiktionary 304 879 words.
We will refer to these documents as wiki.xml and wiktionary.xml.
All scripts before processing the dictionary words replace 'և' with 'եւ'.
Building word frequency dictionary
After running (this and other scripts are in the gitHub google-1gramr project)
./extract_hywiki_words.py --max 9 hywiki.xml > armenian-9.csv
we get 1.157.632 word forms of 4 to 9 letters long. The majority words that appear only once that constitute more than half of total 582338 (or 50.3%) number of words are garbage. There are some words like ֆօքսթրօթ that one can understand despite 4 typos in 8-letter word, but the majority very hard even to guess. The situation is not much better for the words frequencies of 2, 3, up to couple of dozen.
The upper limit of 9 is chosen because Armenian words tend to be longer. Another reason is that the letter 'ու', which perceived as single symbol is a combination of 2 Unicode characters.
Since current version of the PolyWordlot uses 5-to 7 character words after initial cleaning I extracted words in that length range:
./extract_hywiki_words.py --min 5 --max 7 ~/Downloads/hywiki-20260301-pages-articles-multistream.xml > armenian-5-7.csv
The armenian-5-7.csv file contains 578562 word forms 47.57% are words that appear once in Wikipedia content).
Using
split-list.bash armenian-9.csv 12
we get 2 files: armenian-9-more-12.csv and armenian-9-less-12.csv. The first contains 150886 words that appear 12 or more times in the articles of wikipedia. This is a reasonable amount to work with. The prediction is that roughly the half of them are proper nouns and every third word of the remaining half is a dictionary form (lemma). The expectation is that 20-25 thousand words can be used for the game.
The armenian-7-more-12.csv contains 81309 word forms.
Building word types and definitions dictionary
The base for this dictionary is the .xml dump from Armenian Wiktionary - #2 in the Downloading section. A Python script is developed to extract 'word', 'PoS', and 'Definition from the wiktionary.xml:
./extract_wiktionary_items.py wiktionary.xml --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ,հին > armenian_wikt.csv
This script creates a .csv file of records extracted from Wiktionary (by default 4 to 9 characters long) and filters out the words listed in skip.txt file and words that are archaic (հնց), colloquial (ժղ), abbreviations (հպվ), billiard terms (Բիլիարդային), dialect (բրբռ), and unsophisticated (գվռ). Since the PolyWordlot game is a family friendly game the skip.txt file is used to remove offensive words and adult terminology.
From the original 304 879 words only 94942 made the cut. This is a more accurate report:
./extract_wiktionary_items.py --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ,հին hywiktionary-20260301-pages-articles-multistream.xml > armenian_wikt.csv
Processed titles: 313943
Extracted entries: 94942
Skipped/filtered entries: 119142
Titles with non-Armenian characters: 99774
Titles ignored because of redirects: 85
I have no explanation for the discrepancy between the number of words reported by Wiktionary on 2026-03-01 - there are ~9000 extra words processed (Processed titles:). Skipping is due to several factors, but the major is inability to find a tag listed in the map below (see Tables Ա1.1 and Ա1.2 in the On Syntactic Structure Representation, pp. 183-4 for Latin abbriviations) - it is in a separate from the script pos-map.txt file to make the script universal, aplicable to other languages:
-hy-ած- ADJ
-hy-մակ- ADV
-hy-գո- NOUN
-hy-բայ- VERB
-hy-դեր- PRON
-hy-թվ- NUM
-hy-կապ- ADP
-hy-շաղ- CONJ
-hy-ձա- INTJ
-hy-եղբ- MOD
-hy-հատ- PLC
-hy-հպվ- ABBR
The skip.txt file currently contains about 176 words.
The quality of Armenian Wiktionary content is much higher than Armenian Wikipedia articles despite of some - mostly markup and questionable PoS assignment - typos. However, detailed investigation is needed for both.
Combining word types and counts
The end result of the data processing is a csv file of this format:
word, count, tag, definition
To generate the file that goes into the application you run:
./combine_csv.py --augment augment-records.csv word-tags.csv word-count.csv > combined.csv
It merges the data from the previously obtained files only for these types of words: NOUN, PRON, VERB, NUM, MOD, ADP. It adds 1 to the original count: if the word is not in the wiki.xml then it gets count 1. You can add more words using optional augment-records.csv which is of the type as the word-tags.csv.
If you run:
./combine-csv.py wiktionary/armenian_wikt.csv wiki/armenian-5-7.csv > hy-combined.csv
you get the resulting file of 51387 words.
Data formatting for application
Final step in data preparation is splitting the hy-combined.csv into two types of files: dictionaries-N.txt and answers-N.txt per words length - the N in the file names is the length.
There are 2 script for that purpose:
./create_dictionaries.py --count <f> hy-combined.csv
./create_answers.py --skip <%> --keep <%> dictionary-N.txt
The first script splits the hy-combined.csv into word-length specific - the dictionary-N.txt files by skipping the words with the frequency (count) less than f.
The second extracts keep percentage of overall number of words in the dictionary-N.txt file by skipping "skip percentage" of high frequency (count) words.
The results for this data set are:
dictionary-5.txt: 2784 words, count range 4..265320
dictionary-6.txt: 3251 words, count range 4..276040
dictionary-7.txt: 3584 words, count range 4..192989
-->
./create_answers.py --skip 5 --keep 60 dictionary-5.txt
answers-5.txt: 1670 entries, count range 18..2938
./create_answers.py --skip 5 --keep 50 dictionary-6.txt
answers-6.txt: 1625 entries, count range 24..2257
./create_answers.py --skip 5 --keep 50 dictionary-7.txt
answers-7.txt: 1792 entries, count range 24..1844
The list of guessed answer words is a subset of all - dictionary - words of a given length. For each length ~80% of medium to least frequent words are selected to make guessing harder.

Comments