Technical Paper: Xosum.am –Armenian Speech-to-Text WebApplication

aramhayr
Jan 20
4 min read

Abstract

Xosum.am is a cloud-based speech-to-text system designed for Armenian

language transcription. The platform supports real-time recording and file-

based transcription, with the ability to process multiple audio formats, including

MP3 and WAV. It is optimized for noise-robust speech recognition, allowing

transcription in varied acoustic environments. The system supports long-

duration files—up to 8 hours each—with significantly reduced processing times

compared to the file length. Xosum.am operates across all modern web

browsers on both desktop and mobile devices without restrictions, requiring

only an internet connection. A flexible pricing model offers subscription-based

access with additional transcription hours available for purchase. Additionally,

the system integrates with Telegram via a bot that enables transcription and

summarization of Armenian voice messages. This paper provides an overview

of the system architecture, functionalities, and future development roadmap.

1. Introduction

Speech-to-text technology is increasingly utilized in accessibility, data

processing, and workflow automation. Xosum.am provides an Armenian-

language transcription solution, leveraging cloud-based infrastructure for

scalability and accessibility. The system supports multiple input methods and

offers a robust speech recognition model optimized for real-world noise

conditions. By eliminating the need for dedicated hardware or software

installations, Xosum.am ensures accessibility across devices with minimal user-

side requirements. The following sections outline the system architecture, core

functionalities, and future development plans.

2. System Architecture

Xosum.am is a cloud-hosted application, accessible via xosum.am and

app.xosum.am. The platform consists of the following architectural

components:

2.1 Frontend

Web-based user interface supporting real-time recording, file uploads,

and transcription management.

Designed for compatibility across all modern desktop and mobile web

browsers.

Includes transcription history and user account management

functionalities.

2.2 Backend

Cloud-hosted speech processing engine implementing noise-robust

speech recognition models.

Supports simultaneous processing of multiple files.

Optimized for high-performance computing with reduced transcription

turnaround times.

2.3 Database

Stores transcriptions and associated metadata for retrieval.

Secure authentication and account-based access to transcription history.

2.4 Authentication Module

Google-based authentication for streamlined and secure user login.

Enables subscription management and usage tracking.

3. Core Functionalities

3.1 Speech-to-Text Conversion Modes

Xosum.am provides two primary methods for speech input:

1. Live Recording Mode

Speech is recorded directly via the browser and transcribed in real time.

The resulting text is displayed for immediate review and editing.

2. File Upload Mode

Users can upload audio files in multiple formats, including MP3, WAV, and

other common codecs.

The system processes files up to 8 hours in duration with reduced

processing times.

Supports concurrent file processing without user-side queuing constraints.

3.2 Noise-Robust Speech Recognition

The transcription engine employs noise-robust speech recognition models,

enabling high accuracy even in environments with background noise. This

enhances usability in settings such as meetings, lectures, and outdoor

recordings.

3.3 Simultaneous File Processing

The system supports parallel processing of multiple files, ensuring that users

can submit multiple transcription requests without waiting for sequential

completion.

3.4 Cross-Platform Accessibility

Xosum.am is accessible from both desktop and mobile devices without

platform-specific constraints. The web application is optimized for all modern

browsers, ensuring compatibility across Windows, macOS, Linux, Android, and

iOS. No additional software or extensions are required beyond an active

internet connection.

3.5 Transcription History Management

Users can access a history of previous transcriptions for reference and

download.

Stored transcripts remain available within user accounts, facilitating

workflow continuity.

3.6 User Authentication & Account Management

Google authentication is integrated for secure login.

User accounts track transcription usage and available processing hours.

3.7 Telegram Bot for Voice Message Transcription

A Telegram bot is available, allowing users to submit Armenian voice messages

for automatic transcription and summarization. This feature is designed for

users who rely on voice-based communication within messaging applications.

4. Pricing Model

Xosum.am operates on a hybrid pricing model, offering both subscription-

based and on-demand transcription options.

4.1 Subscription Plans

Users can opt for a monthly or yearly subscription, providing a fixed

number of transcription hours.

Subscription tiers accommodate varying usage needs, from occasional to

high-volume transcription requirements.

4.2 Additional Transcription Hours

Users can purchase additional transcription hours at a discounted rate.

There are no limits on the number of additional hours a user can acquire.

5. Cloud Infrastructure and Performance

Xosum.am leverages a scalable cloud infrastructure, ensuring:

High availability with minimal downtime.

Optimized speech processing algorithms to minimize transcription

turnaround times.

Secure storage with encryption to protect user data.

The system supports unlimited transcription requests, making it suitable for

both individual users and large-scale institutional use cases.

6. Future Roadmap

Future development efforts aim to expand the platform’s capabilities beyon

standard speech-to-text transcription. Planned enhancements include:

YouTube Link-Based Transcription
Users will be able to input a YouTube link and receive a full automated transcription of the video.
Automated Subtitle Generation for YouTube
The system will generate synchronized subtitles for YouTube videos.
Subtitle Support for TikTok and Instagram Reels
Subtitle processing will be extended to short-form video content, including TikTok and Instagram Reels.
Multilingual Subtitle and Transcription Translation
The system will support subtitle translation for videos, enabling broader accessibility.
General Transcription Translation
Users will have the option to translate full transcriptions into multiple languages.

These enhancements will position Xosum.am as a comprehensive speech and

media processing platform, extending beyond speech-to-text conversion into

multilingual and multimedia applications.

7. Conclusion

Xosum.am is a scalable and noise-robust Armenian speech-to-text

application, designed for both real-time and file-based transcription. The

system supports multiple input formats, long-duration files, parallel

processing, and cross-platform accessibility. With flexible pricing options

and Telegram bot integration, it provides a versatile solution for individuals and

organizations requiring Armenian-language transcription services.

Future developments will focus on YouTube and social media subtitle

generation, as well as multilingual transcription translation, further expanding

the platform’s capabilities in speech and video processing.