Rohingya AI Data Services
High-quality Rohingya language data for AI training, NLP research, and speech technology — produced by native speakers, ethically sourced, delivered in both scripts.
Discuss Your Project →Translation Pairs
Rohingya–English (also Burmese, Bangla) parallel sentence pairs for machine translation.
Transcription
Audio transcription in Rohingya with consistent orthography for ASR training data.
Speech Recordings
Recorded Rohingya speech with speaker diversity for TTS and ASR.
Text Annotation
Named entity, sentiment, intent, and classification labels for NLP tasks.
Data Validation
Native-speaker review and quality scoring of existing Rohingya datasets or model output.
Lexicon & Terminology
Wordlists, spelling normalisation, and script-conversion datasets (Hanifi ↔ Rohingyalish).
Why work with us
- Native Rohingya speakers produce and review every datapoint — orthography stays consistent across the whole dataset
- We maintain the largest free English–Rohingya digital dictionary (6,500+ entries) and a rule-based Hanifi ↔ Rohingyalish converter — deep familiarity with exactly the normalisation problems that break low-resource models
- Delivery in both scripts with documented conventions
- Experience with humanitarian-sector data sensitivities; NDA available
Our process
- Scoping call or brief — task, volume, format, licensing; response within one business day
- Pilot batch — a small sample so you can validate quality and format early
- Production — collection/annotation by native speakers with ongoing QA review
- Delivery — your schema (JSONL, CSV, platform export) with documentation
Our approach to data ethics
- All data is collected with informed consent
- Contributors understand how data will be used and are paid for their work
- Community members benefit from the work
- No personally identifiable information in datasets
- Transparent licensing and usage agreements
Frequently asked questions
Why is Rohingya considered a low-resource language for AI?
Very little digital Rohingya text and speech exists compared with major languages, and what exists is split across two scripts and inconsistent spellings. That makes high-quality, consistently-transcribed data the bottleneck for any Rohingya AI work — and it is exactly what we produce.
What data formats do you deliver?
Whatever your pipeline needs — JSONL, CSV/TSV, plain text, or your annotation platform's schema. Speech data is delivered with aligned transcripts and speaker metadata (no personally identifiable information).
Can you produce data in both Rohingya scripts?
Yes. We deliver Rohingyalish (Latin), Hanifi (Unicode block U+10D00–10D3F), or parallel versions of both — with consistent orthography across the dataset, which matters enormously for model quality.
How do you ensure data ethics?
All contributors give informed consent, understand how the data will be used, and are paid for their work. Datasets contain no personally identifiable information, and licensing terms are agreed transparently before collection begins. We can sign an NDA covering your project as well.
How does a project start?
Describe the task, target volume, and format through the contact page. We respond within one business day with a proposed approach, timeline, and quote — pilot batches are a common first step.
Discuss your AI data needs
Describe the task and volume — we'll propose an approach within one business day.
Contact UsRead: Rohingya language and AI: why data is needed · Related: Localization