The problem: Rohingya is a low-resource language
In the world of artificial intelligence and natural language processing (NLP), languages are often categorised by how much data exists to train AI systems. Languages like English, Mandarin, and Spanish have enormous datasets — billions of web pages, books, and transcripts. Rohingya has almost none.
This matters because modern AI tools — translation software, speech recognition, text-to-speech, chatbots — depend on large datasets. Without Rohingya data, these tools cannot be built.
What does “language data” mean?
Language data for AI typically includes:
- Text corpora — large collections of written text in the language
- Parallel corpora — text in Rohingya alongside translations in another language (e.g., Rohingya–English sentence pairs)
- Audio recordings — speech recordings with transcriptions, for speech recognition and text-to-speech
- Annotations — labelled text for named entity recognition, sentiment, or other NLP tasks
Why Rohingya AI matters
Access to information
If AI translation and speech tools supported Rohingya, millions of speakers could access:
- Healthcare information
- Legal rights documentation
- Educational content
- News and safety information
Reducing dependence on interpreters
While professional interpreters remain essential for complex situations, AI tools could help Rohingya speakers access basic information independently — especially in situations where interpreters are unavailable.
Digital inclusion
Without Rohingya NLP support, Rohingya speakers are effectively excluded from AI-powered services that speakers of major languages take for granted.
Current state of Rohingya NLP
Progress is limited but growing:
- Some small parallel corpora exist in academic research settings
- The Masakhane project and similar initiatives have begun including low-resource languages
- Unicode support for Hanifi (added 2018) was a prerequisite for digital text processing
- A small number of research papers address Rohingya NLP
What data is needed
Text data
- More digitised Rohingya texts — books, newspapers, community publications
- Web content in Rohingya across all three scripts
- Parallel translated documents (Rohingya + English, Rohingya + Bengali)
Audio data
- Recorded speech in Rohingya with transcriptions
- Varied speakers — male/female, different age groups, different dialects
- Spontaneous speech as well as read speech
Quality and ethics
Data collection must respect:
- Informed consent — speakers must understand and agree to how their data is used
- Community ownership — ideally, Rohingya communities should benefit from data collected from them
- Privacy — personal information must not be included in training data
Our AI data services
RohingyaLanguage.org offers AI Data Services for organisations building Rohingya NLP systems — including transcription, translation pairs, annotation, and ethically sourced audio datasets.