Rohingya language and AI: why data is needed

The problem: Rohingya is a low-resource language

In the world of artificial intelligence and natural language processing (NLP), languages are often categorised by how much data exists to train AI systems. Languages like English, Mandarin, and Spanish have enormous datasets — billions of web pages, books, and transcripts. Rohingya has almost none.

This matters because modern AI tools — translation software, speech recognition, text-to-speech, chatbots — depend on large datasets. Without Rohingya data, these tools cannot be built.

What does “language data” mean?

Language data for AI typically includes:

Text corpora — large collections of written text in the language
Parallel corpora — text in Rohingya alongside translations in another language (e.g., Rohingya–English sentence pairs)
Audio recordings — speech recordings with transcriptions, for speech recognition and text-to-speech
Annotations — labelled text for named entity recognition, sentiment, or other NLP tasks

Why Rohingya AI matters

Access to information

If AI translation and speech tools supported Rohingya, millions of speakers could access:

Healthcare information
Legal rights documentation
Educational content
News and safety information

Reducing dependence on interpreters

While professional interpreters remain essential for complex situations, AI tools could help Rohingya speakers access basic information independently — especially in situations where interpreters are unavailable.

Digital inclusion

Without Rohingya NLP support, Rohingya speakers are effectively excluded from AI-powered services that speakers of major languages take for granted.

Current state of Rohingya NLP

Progress is limited but growing:

Some small parallel corpora exist in academic research settings
The Masakhane project and similar initiatives have begun including low-resource languages
Unicode support for Hanifi (added 2018) was a prerequisite for digital text processing
A small number of research papers address Rohingya NLP

What data is needed

Text data

More digitised Rohingya texts — books, newspapers, community publications
Web content in Rohingya across all three scripts
Parallel translated documents (Rohingya + English, Rohingya + Bengali)

Audio data

Recorded speech in Rohingya with transcriptions
Varied speakers — male/female, different age groups, different dialects
Spontaneous speech as well as read speech

Quality and ethics

Data collection must respect:

Informed consent — speakers must understand and agree to how their data is used
Community ownership — ideally, Rohingya communities should benefit from data collected from them
Privacy — personal information must not be included in training data

Our AI data services

RohingyaLanguage.org offers AI Data Services for organisations building Rohingya NLP systems — including transcription, translation pairs, annotation, and ethically sourced audio datasets.