top of page
Search

PolyWordlot Data Preparation

Updated: 20 hours ago


PolyWordlot Login

For developers: gitHub google-1gram.


Overview

The goal is to build a word frequency dictionary for Eastern Armenian in .csv format:

word, count, definition (meaning)

We will built this dictionary using wikipedia articles.

We will use singular nomenal case nouns - excluding proper nouns, gerunds, and infinitives and some pronouns in the game.


Downloading

Using Wikimedia Downloads 2 data dumps were downloaded:

  1. hywiki-20260301-pages-articles-multistream.xml.bz2 - 583.2 MB, which is an archived version of the 3.79 GB hywiki-20260301-pages-articles-multistream.xml - the wikipedia 324 668 articles content.

  2. hywiktionary-20260301-pages-articles-multistream.xml.bz2 - 48 MB, which is an archived version of the 494.4 MB hywiktionary-20260301-pages-articles-multistream.xml - the wiktionary 304 879 words.

We will refer to these documents as wiki.xml and wiktionary.xml.

All scripts before processing the dictionary words replace 'և' with 'եւ'.


Building word frequency dictionary

After running (this and other scripts are in the gitHub google-1gramr project)

./extract_hywiki_words.py --max 9 hywiki.xml > armenian-9.csv

we get 1.157.632 word forms of 4 to 9 letters long. The majority words that appear only once that constitute more than half of total 582338 (or 50.3%) number of words are garbage. There are some words like ֆօքսթրօթ that one can understand despite 4 typos in 8-letter word, but the majority very hard even to guess. The situation is not much better for the words frequencies of 2, 3, up to couple of dozen.

The upper limit of 9 is chosen because Armenian words tend to be longer. Another reason is that the letter 'ու', which perceived as single symbol is a combination of 2 Unicode characters.

Since current version of the PolyWordlot uses 5-to 7 character words after initial cleaning I extracted words in that length range:

./extract_hywiki_words.py --min 5 --max 7 ~/Downloads/hywiki-20260301-pages-articles-multistream.xml > armenian-5-7.csv

The armenian-5-7.csv file contains 578562 word forms 47.57% are words that appear once in Wikipedia content).

Using

split-list.bash armenian-9.csv 12

we get 2 files: armenian-9-more-12.csv and armenian-9-less-12.csv. The first contains 150886 words that appear 12 or more times in the articles of wikipedia. This is a reasonable amount to work with. The prediction is that roughly the half of them are proper nouns and every third word of the remaining half is a dictionary form (lemma). The expectation is that 20-25 thousand words can be used for the game.

The armenian-7-more-12.csv contains 81309 word forms.


Building word types and definitions dictionary

The base for this dictionary is the .xml dump from Armenian Wiktionary - #2 in the Downloading section. A Python script is developed to extract 'word', 'PoS', and 'Definition from the wiktionary.xml:

./extract_wiktionary_items.py wiktionary.xml --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ,հին > armenian_wikt.csv

This script creates a .csv file of records extracted from Wiktionary (by default 4 to 9 characters long) and filters out the words listed in skip.txt file and words that are archaic (հնց), colloquial (ժղ), abbreviations (հպվ), billiard terms (Բիլիարդային), dialect (բրբռ), and unsophisticated (գվռ). Since the PolyWordlot game is a family friendly game the skip.txt file is used to remove offensive words and adult terminology.

From the original 304 879 words only 94942 made the cut. This is a more accurate report:


./extract_wiktionary_items.py --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ,հին hywiktionary-20260301-pages-articles-multistream.xml > armenian_wikt.csv


Processed titles: 313943

Extracted entries: 94942

Skipped/filtered entries: 119142

Titles with non-Armenian characters: 99774

Titles ignored because of redirects: 85


I have no explanation for the discrepancy between the number of words reported by Wiktionary on 2026-03-01 - there are ~9000 extra words processed (Processed titles:). Skipping is due to several factors, but the major is inability to find a tag listed in the map below (see Tables Ա1.1 and Ա1.2 in the On Syntactic Structure Representation, pp. 183-4 for Latin abbriviations) - it is in a separate from the script pos-map.txt file to make the script universal, aplicable to other languages:

-hy-ած-  ADJ

-hy-մակ-  ADV

-hy-գո-   NOUN

-hy-բայ-  VERB

-hy-դեր-  PRON

-hy-թվ-   NUM

-hy-կապ-  ADP

-hy-շաղ-  CONJ

-hy-ձա-   INTJ

-hy-եղբ-  MOD

-hy-հատ-  PLC

-hy-հպվ-  ABBR


The skip.txt file currently contains about 176 words.

The quality of Armenian Wiktionary content is much higher than Armenian Wikipedia articles despite of some - mostly markup and questionable PoS assignment - typos. However, detailed investigation is needed for both.


Combining word types and counts

The end result of the data processing is a csv file of this format:

word, count, tag, definition

To generate the file that goes into the application you run:

./combine_csv.py --augment augment-records.csv word-tags.csv word-count.csv > combined.csv

It merges the data from the previously obtained files only for these types of words: NOUN, PRON, VERB, NUM, MOD, ADP. It adds 1 to the original count: if the word is not in the wiki.xml then it gets count 1. You can add more words using optional augment-records.csv which is of the type as the word-tags.csv.

If you run:

./combine-csv.py wiktionary/armenian_wikt.csv wiki/armenian-5-7.csv > hy-combined.csv

you get the resulting file of 51387 words.


Data formatting for application

Final step in data preparation is splitting the hy-combined.csv into two types of files: dictionaries-N.txt and answers-N.txt per words length - the N in the file names is the length.

There are 2 script for that purpose:

./create_dictionaries.py --count <f> hy-combined.csv

./create_answers.py --skip <%> --keep <%> dictionary-N.txt

The first script splits the hy-combined.csv into word-length specific - the dictionary-N.txt files by skipping the words with the frequency (count) less than f.

The second extracts keep percentage of overall number of words in the dictionary-N.txt file by skipping "skip percentage" of high frequency (count) words.

The results for this data set are:

dictionary-5.txt: 2784 words, count range 4..265320

dictionary-6.txt: 3251 words, count range 4..276040

dictionary-7.txt: 3584 words, count range 4..192989

-->

./create_answers.py --skip 5 --keep 60 dictionary-5.txt

answers-5.txt: 1670 entries, count range 18..2938

./create_answers.py --skip 5 --keep 50 dictionary-6.txt

answers-6.txt: 1625 entries, count range 24..2257

./create_answers.py --skip 5 --keep 50 dictionary-7.txt

answers-7.txt: 1792 entries, count range 24..1844


The list of guessed answer words is a subset of all - dictionary - words of a given length. For each length ~80% of medium to least frequent words are selected to make guessing harder.


 
 
 

Recent Posts

See All
Speech-to-IPA Conversion

Speech-to-IPA conversion seems very useful as a research tool specifically for languages that have no writing system. Graphical (in IPA) recording of these languages will allow statistical and NLP met

 
 
 
Terms

Actor  – either human or mechanical (software module, hardware unit, application, etc.)  role player in requirements specification for a System design Automated Speech Recognition  (ASR) - interdisci

 
 
 

Comments


bottom of page