The corpus consists of about 56K words/tokens collected from Facebook, Twitter, "Watan Aa Water" scripts, and others. Each word in the corpus was annotated with different morphological features, including (CODA, Prefixes, Stem, Suffixes, MSA lemma, Dialect Lemma, Gloss, Part-of-Speech, Gender, Number, and Aspect). The corpus was collected using the LDC’s SAMA tagsets. The first version of this corpus was released in 2013, this 2nd version is a complete revision of the annotations. This article explains these revisions. Visit the download page to download the corpus (CC BY-NC-SA 4.0 License).
The corpus consists of about 9.6K words/tokens collected from Facebook, blog posts and traditional poems. The corpus was annotated as an extension to Curras and following the same annotation methodology to form a Levantine Corpus. This article explains the corpus. Visit the download page to download the corpus (CC BY-NC-SA 4.0 License).
The four corpora consists of about 1.2 million tokens) that we collected from different social media platforms. The Yemeni corpus (\~1.05M tokens) was collected automatically from Twitter, while the other three dialects (~\ 50K tokens each) were manually collected from Facebook and YouTube. Each word in the four corpora was annotated with different morphological features, such as POS, stem, prefixes, suffixes, lemma, and a gloss in English. The annotation process was carried out by 35 annotators who are native speakers of the target dialects. The annotators were trained on a set of guidelines and on how to use our Arabic Dialect Annotation Toolkit (ADAT), which is open source. This article explains the four corpora. Visit this download page to download the corpus (CC BY-NC-SA 4.0 License).
The Nabra corpora consists of about 60K words/tokens collected from social media posts, scripts of movies and series, lyrics of songs and local proverbs. Each word in the corpus was annotated with different morphological features, including (CODA, Prefixes, Stem, Suffixes, MSA lemma, Dialect Lemma, Gloss, Part-of-Speech, Gender, Number, and Aspect). Furthermore, this corpora encompasses content in 10 Syrian dialects, including: Damascus (Shami), Aleppo, Latakia, Raqqa, Deir-Ezzur, Homs, Huran, Suwayda, Hama, and Mardin. This article explains the corpora. Visit the download page to download the corpus (CC BY-NC-SA 4.0 License).
Birzeit University released corpora for Libyan, Palestinian, Lebanese, Iraqi, Sudanese and Yemeni dialects that have 1.3 million words. (Archive)