Question 1

Why is Rohingya considered a low-resource language for AI?

Accepted Answer

Very little digital Rohingya text and speech exists compared with major languages, and what exists is split across two scripts and inconsistent spellings. That makes high-quality, consistently-transcribed data the bottleneck for any Rohingya AI work — and it is exactly what we produce.

Question 2

What data formats do you deliver?

Accepted Answer

Whatever your pipeline needs — JSONL, CSV/TSV, plain text, or your annotation platform's schema. Speech data is delivered with aligned transcripts and speaker metadata (no personally identifiable information).

Question 3

Can you produce data in both Rohingya scripts?

Accepted Answer

Yes. We deliver Rohingyalish (Latin), Hanifi (Unicode block U+10D00–10D3F), or parallel versions of both — with consistent orthography across the dataset, which matters enormously for model quality.

Question 4

How do you ensure data ethics?

Accepted Answer

All contributors give informed consent, understand how the data will be used, and are paid for their work. Datasets contain no personally identifiable information, and licensing terms are agreed transparently before collection begins. We can sign an NDA covering your project as well.

Question 5

How does a project start?

Accepted Answer

Describe the task, target volume, and format through the contact page. We respond within one business day with a proposed approach, timeline, and quote — pilot batches are a common first step.

Rohingya AI Data Services

Translation Pairs

Transcription

Speech Recordings

Text Annotation

Data Validation

Lexicon & Terminology

Why work with us

Our process

Our approach to data ethics

Frequently asked questions

Why is Rohingya considered a low-resource language for AI?

What data formats do you deliver?

Can you produce data in both Rohingya scripts?

How do you ensure data ethics?

How does a project start?

Discuss your AI data needs