About Currasat: Arabic Dialect Corpora

Annotated Corpora Portal

One portal for Arabic corpora with full morphological annotations

+- Curras كُرَّاس Ver.2 (Palestinian Dialect Corpus)

The corpus consists of about 56K words/tokens collected from Facebook, Twitter, "Watan Aa Water" scripts, and others. Each word in the corpus was annotated with different morphological features, including (CODA, Prefixes, Stem, Suffixes, MSA lemma, Dialect Lemma, Gloss, Part-of-Speech, Gender, Number, and Aspect). The corpus was collected using the LDC’s SAMA tagsets. The first version of this corpus was released in 2013, this 2nd version is a complete revision of the annotations. This article explains these revisions. Download Curras corpus (CC BY 4.0 License).

+- Baladi بَلَدي (Lebanese Dialect Corpus)

The corpus consists of about 9.6K words/tokens collected from Facebook, blog posts and traditional poems. The corpus was annotated as an extension to Curras and following the same annotation methodology to form a Levantine Corpus. This article explains the corpus. Download Baladi corpus (CC BY 4.0 License).

+- Lisan لِسان (Iraqi, Yemeni, Sudanese, and Libyan Dialect Corpus)

The four corpora consists of about 1.2 million tokens) that we collected from different social media platforms. The Yemeni corpus was collected automatically from Twitter, while the other three dialects were manually collected from Facebook and YouTube. Each word in the four corpora was annotated with different morphological features, such as POS, stem, prefixes, suffixes, lemma, and a gloss in English. The annotation process was carried out by 35 annotators who are native speakers of the target dialects. The annotators were trained on a set of guidelines and on how to use our Arabic Dialect Annotation Toolkit (ADAT), which is open source. This article explains the four corpora. Iraqi corpus (45K tokens) Download corpus,Yemeni corpus (\~1.05M tokens) Download corpus, Sudanese corpus (52K tokens) Download corpus, Libyan corpus (51K tokens) Download corpus (CC BY 4.0 License).

+- Nabra نَبرة (Syrian Dialect Corpus)

The Nabra corpora consists of about 60K words/tokens collected from social media posts, scripts of movies and series, lyrics of songs and local proverbs. Each word in the corpus was annotated with different morphological features, including (CODA, Prefixes, Stem, Suffixes, MSA lemma, Dialect Lemma, Gloss, Part-of-Speech, Gender, Number, and Aspect). Furthermore, this corpora encompasses content in 10 Syrian dialects, including: Damascus (Shami), Aleppo, Latakia, Raqqa, Deir-Ezzur, Homs, Huran, Suwayda, Hama, and Mardin. This article explains the corpora. Download Nabra corpus (CC BY 4.0 License).

News

Birzeit University releases corpora six arabic dialects

Birzeit University released corpora for Libyan, Palestinian, Lebanese, Iraqi, Sudanese and Yemeni dialects that have 1.3 million words. (Archive)

مشاريع ذات علاقة

News

Birzeit University releases corpora six arabic dialects