top of page
Search

PolyWordlot

Updated: 2 days ago

PolyWordlot Login

For developers: gitHub google-1gram.

Overview

This page describes the process of data gathering and cleansing for the Armenian version of the PolyWordlot word game, which is an improved and multilingual version of the Wordle.

The Eastern Armenian Wikipedia has been selected as a corpus of Armenian word forms.


Data Preparation

Overview

The goal is to build a word frequency dictionary for Eastern Armenian in .csv format:

word, count, definition (meaning)

We will built this dictionary using wikipedia articles.

We will use singular nomenal case nouns - excluding proper nouns, gerunds, and infinitives and some pronouns in the game.


Downloading

Using Wikimedia Downloads 2 data dumps were downloaded:

  1. hywiki-20260301-pages-articles-multistream.xml.bz2 - 583.2 MB, which is an archived version of the 3.79 GB hywiki-20260301-pages-articles-multistream.xml - the wikipedia 324 668 articles content.

  2. hywiktionary-20260301-pages-articles-multistream.xml.bz2 - 48 MB, which is an archived version of the 494.4 MB hywiktionary-20260301-pages-articles-multistream.xml - the wiktionary 304 879 words.

We will refer to these documents as wiki.xml and wiktionary.xml.

All scripts before processing the dictionary words replace 'և' with 'եւ'.


Building word frequency dictionary

After running (this and other scripts are in the gitHub google-1gramr project)

extract_hywiki_words.py --max 9 hywiki.xml > armenian-9.csv

we get 1.157.632 word forms of 4 to 9 letters long. The majority words that appear only once that constitute more than half of total 582338 (or 50.3%) number of words are garbage. There are some words like ֆօքսթրօթ that one can understand despite 4 typos in 8-letter word, but the majority very hard even to guess. The situation is not much better for the words frequencies of 2, 3, up to couple of dozen.

The upper limit of 9 is chosen because Armenian words tend to be longer. Another reason is that th letter 'ու', which perceived as single symbol is a combination of 2 Unicode characters.

Using

split-list.bash armenian-9.csv 12

we get 2 files: armenian-9-more-12.csv and armenian-9-less-12.csv. The first contains 150886 words that appear 12 or more times in the articles of wikipedia. This is a reasonable amount to work with. The prediction is that roughly the half of them are proper nouns and every third word of the remaining half is a dictionary form (lemma). The expectation is that 20-25 thousand words can be used for the game.


Building word types and definitions dictionary

The base for this dictionary is the .xml dump from Armenian Wiktionary - #2 in the Downloading section. A Python script is developed to extract 'word', 'PoS', and 'Definition from the wiktionary.xml:

./extract_wiktionary_items.py wiktionary.xml --skip skip.txt --filter հնց,ժղ,հպվ,Բիլիարդային,բրբռ,գվռ > armenian_wikt.csv

This script creates a .csv file of records extracted from Wiktionary (by default 4 to 9 characters long) and filters out the words listed in skip.txt file and words that are archaic (հնց), colloquial (ժղ), abbreviations (հպվ), billiard terms (Բիլիարդային), dialectic (բրբռ), and unsophisticated (գվռ). Since the PolyWordlot game is a family friendly game the skip.txt file is used to remove offensive words and adult terminology.

From the original 304 879 words only 68839 made the cut.

The script uses these tags: ADJ, ADV, NOUN, PRON, VERB, NUM, MOD, ADP, CONJ, and INTJ.

The quality of Armenian Wiktionary content is much higher than Armenian Wikipedia articles despite of some - mostly markup and questionable PoS assignment - typos. However, detailed investigation is needed for both.


Combining word types and counts

The end result of the data processing is a csv file of this format:

word,count,tag,definition

To generate the file that goes into the application you run:

./combine_csv.py --augment augment-records.csv word-tags.csv word-count.csv > combined.csv

It merges the data from the previously obtained files only for these types of words: NOUN, PRON, VERB, NUM, MOD, ADP. It adds 1 to the original count: if the word is not in the wiki.xml then it gets count 1. You can add more words using optional augment-records.csv which is of the type as the word-tags.csv.

If you run:

./combine-csv.py wiktionary/armenian_wikt.csv wiki/armenian-9.csv > hy-combined.csv

you get the resulting file of 45100 words.


Data formating for application

Final step in data preparation is splitting the hy-combined.csv into two types of files: dictionaries-N.txt and answers-N.txt per words length - the N in the file names is the length.

There are 2 script for that purpose:

./create_dictionaries.py --count <f> hy-combined.csv

./create_answers.py --skip <%> --keep <%> dictionary-N.txt

The first script splits the hy-combined.csv into word-length specific - the dictionary-N.txt files by skiping the words with the frequency (count) less than f.

The second extracts keep percentage of overall number of words in the dictionary-N.txt file by skipping "skip percentage" of high frequency (count) words.

The results for this data set are:

dictionary-4.txt: 1185 words, count range 4..124360

dictionary-5.txt: 2200 words, count range 4..209637

dictionary-6.txt: 2670 words, count range 4..275357

dictionary-7.txt: 3080 words, count range 4..192989

dictionary-8.txt: 3164 words, count range 4..69354

dictionary-9.txt: 2968 words, count range 4..1182791

-->

answers-4.txt: 948 entries, count range 8..4328

answers-5.txt: 1760 entries, count range 7..2587

answers-6.txt: 2136 entries, count range 7..1774

answers-7.txt: 2464 entries, count range 6..1537

answers-8.txt: 2531 entries, count range 6..1422

answers-9.txt: 2374 entries, count range 6..1082


Using the resulting data set

The list of guessed answer words is a subset of all - dictionary - words of a given length. For each length ~80% of medium to least frequent words are selected to make guessing harder.

The game is useful not just for pleasure, but also for business (learning): when you hover over an entered complete word, the app shows the definition of the word.

Such application can be used not just for improving linguistic skills, but also for science learning by using thematic dictionaries of terms in geography, chemistry, anatomy, etc.


Keyboard layout

The suggested phonetic keyboard layout is a modification of the modification suggested in the RFC: EN-0004 - see the Armenian Characters set review:


[“չ”, “ձ”, “շ”, “է”, “ռ”, “թ”, “ղ”, “ը”, “օ”, “փ”]

["ք", "ժ", "ե", “ր”, "տ", "յ", "ւ", "ի", “ո”, "պ"]

["ա", "ս", "դ", "ֆ", "գ", "հ", “ջ”, "կ", “լ”]<delete>

["զ", “խ”, "ց", "վ", "բ", "ն", "մ", “ծ“, “ճ”]<return>


Player scores

Players can be ranked by scores accumulated as a result of successful guesing. For each guessed word the player scores the word length points. To encourage language learning a progressive scoring is applied. For every guessed words in second language the score is doubled, for the third - tripled, and so on. For example if the player has 3 4 character words in 2 languages then the score is 3x4 for one language and 3x(4x2) for second language - total of 36 points (for 6 words in one language it would have been just 24). If the player has 3 4 character later words in each of three language then the scores becomes 3x4 + 3x(4x2) + 3x(4x3) = 72.

To make more ambitious people to play PolyWordlot the list of top 20 (50 or 100) player nicknames is published daily on the PolyWordlot game webpage.

Start playing PolyWordlot.

Important. If you have suggestions or found inappropriate words in the game please report: aram.hayrapetyan@gmail.com.



 
 
 

Recent Posts

See All
Speech-to-IPA Conversion

Speech-to-IPA conversion seems very useful as a research tool specifically for languages that have no writing system. Graphical (in IPA) recording of these languages will allow statistical and NLP met

 
 
 
Terms

Actor  – either human or mechanical (software module, hardware unit, application, etc.)  role player in requirements specification for a System design Automated Speech Recognition  (ASR) - interdisci

 
 
 
The Logic of Discourse

Copular verbs “ What are copular verbs ?” Proofed. 2022. “What Are Copular Verbs?” Proofed Writing Tips (blog). 22 April 2022. ​ Mikkelsen, Line. 2011. “Copular Clauses.” In Semantics: An Internatio

 
 
 

Comments


bottom of page