Translation & AI

Rohingya language and AI: why data is needed

Translation & AI

Rohingya language and AI: why data is needed

Why the Rohingya language is underrepresented in AI systems, what language data means for AI, and how data collection efforts help the community.

R
RohingyaLanguage.org
·

The problem: Rohingya is a low-resource language

In the world of artificial intelligence and natural language processing (NLP), languages are often categorised by how much data exists to train AI systems. Languages like English, Mandarin, and Spanish have enormous datasets — billions of web pages, books, and transcripts. Rohingya has almost none.

This matters because modern AI tools — translation software, speech recognition, text-to-speech, chatbots — depend on large datasets. Without Rohingya data, these tools cannot be built.

What does “language data” mean?

Language data for AI typically includes:

  • Text corpora — large collections of written text in the language
  • Parallel corpora — text in Rohingya alongside translations in another language (e.g., Rohingya–English sentence pairs)
  • Audio recordings — speech recordings with transcriptions, for speech recognition and text-to-speech
  • Annotations — labelled text for named entity recognition, sentiment, or other NLP tasks

Why Rohingya AI matters

Access to information

If AI translation and speech tools supported Rohingya, millions of speakers could access:

  • Healthcare information
  • Legal rights documentation
  • Educational content
  • News and safety information

Reducing dependence on interpreters

While professional interpreters remain essential for complex situations, AI tools could help Rohingya speakers access basic information independently — especially in situations where interpreters are unavailable.

Digital inclusion

Without Rohingya NLP support, Rohingya speakers are effectively excluded from AI-powered services that speakers of major languages take for granted.

Current state of Rohingya NLP

Progress is limited but growing:

  • Some small parallel corpora exist in academic research settings
  • The Masakhane project and similar initiatives have begun including low-resource languages
  • Unicode support for Hanifi (added 2018) was a prerequisite for digital text processing
  • A small number of research papers address Rohingya NLP

What data is needed

Text data

  • More digitised Rohingya texts — books, newspapers, community publications
  • Web content in Rohingya across all three scripts
  • Parallel translated documents (Rohingya + English, Rohingya + Bengali)

Audio data

  • Recorded speech in Rohingya with transcriptions
  • Varied speakers — male/female, different age groups, different dialects
  • Spontaneous speech as well as read speech

Quality and ethics

Data collection must respect:

  • Informed consent — speakers must understand and agree to how their data is used
  • Community ownership — ideally, Rohingya communities should benefit from data collected from them
  • Privacy — personal information must not be included in training data

Our AI data services

RohingyaLanguage.org offers AI Data Services for organisations building Rohingya NLP systems — including transcription, translation pairs, annotation, and ethically sourced audio datasets.