Rohingya AI Data Services

High-quality Rohingya language data for AI training, NLP research, and speech technology — produced by native speakers, ethically sourced, delivered in both scripts.

Discuss Your Project →
Native Rohingya speakers
NGO & humanitarian experience
Hanifi, Rohingyalish & English
NDA & confidentiality available

Translation Pairs

Rohingya–English (also Burmese, Bangla) parallel sentence pairs for machine translation.

Transcription

Audio transcription in Rohingya with consistent orthography for ASR training data.

Speech Recordings

Recorded Rohingya speech with speaker diversity for TTS and ASR.

Text Annotation

Named entity, sentiment, intent, and classification labels for NLP tasks.

Data Validation

Native-speaker review and quality scoring of existing Rohingya datasets or model output.

Lexicon & Terminology

Wordlists, spelling normalisation, and script-conversion datasets (Hanifi ↔ Rohingyalish).

Why work with us

  • Native Rohingya speakers produce and review every datapoint — orthography stays consistent across the whole dataset
  • We maintain the largest free English–Rohingya digital dictionary (6,500+ entries) and a rule-based Hanifi ↔ Rohingyalish converter — deep familiarity with exactly the normalisation problems that break low-resource models
  • Delivery in both scripts with documented conventions
  • Experience with humanitarian-sector data sensitivities; NDA available

Our process

  1. Scoping call or brief — task, volume, format, licensing; response within one business day
  2. Pilot batch — a small sample so you can validate quality and format early
  3. Production — collection/annotation by native speakers with ongoing QA review
  4. Delivery — your schema (JSONL, CSV, platform export) with documentation

Our approach to data ethics

  • All data is collected with informed consent
  • Contributors understand how data will be used and are paid for their work
  • Community members benefit from the work
  • No personally identifiable information in datasets
  • Transparent licensing and usage agreements

Frequently asked questions

Why is Rohingya considered a low-resource language for AI?

Very little digital Rohingya text and speech exists compared with major languages, and what exists is split across two scripts and inconsistent spellings. That makes high-quality, consistently-transcribed data the bottleneck for any Rohingya AI work — and it is exactly what we produce.

What data formats do you deliver?

Whatever your pipeline needs — JSONL, CSV/TSV, plain text, or your annotation platform's schema. Speech data is delivered with aligned transcripts and speaker metadata (no personally identifiable information).

Can you produce data in both Rohingya scripts?

Yes. We deliver Rohingyalish (Latin), Hanifi (Unicode block U+10D00–10D3F), or parallel versions of both — with consistent orthography across the dataset, which matters enormously for model quality.

How do you ensure data ethics?

All contributors give informed consent, understand how the data will be used, and are paid for their work. Datasets contain no personally identifiable information, and licensing terms are agreed transparently before collection begins. We can sign an NDA covering your project as well.

How does a project start?

Describe the task, target volume, and format through the contact page. We respond within one business day with a proposed approach, timeline, and quote — pilot batches are a common first step.

Discuss your AI data needs

Describe the task and volume — we'll propose an approach within one business day.

Contact Us