SALMA Corpus
A corpus and model for Arabic Word Sense Disambiguation (WSD).
Version: 1.0 (updated on 22/10/2023)
SALMA consists of about 34K tokens that are manually annotated with 4151 unique senses (collected from two Arabic lexicons, namely Modern and Ghani) and 4389 named entities. It was collected from 33 online media sources written in Modern Standard Arabic (MSA) and covering general topics. An end-to-end basline WSD model was trained based on Target Sense Verification (TSV) using BERT, which achieved state-of-the-art in Arabic (Accuracy: 84.2%). Try the service:
Corpus size:
34,253 tokens (MSA)
Richness:
8,760 unique tokens, contains 3,875 unique lemmas distributed to 2,904 nouns, 677 verbs, 119 functional words, and 175 between punctuations and digists
Domains:
general topics
IAA: 92%
(Quadratic Weighted Kappa)
WSD Model:
WSD end-to-end system (84.2% Accuracy)
Named Entity Classes:
Tag | Description |
PERS | Person names: first, middle, last, nickname ... |
ORG | Organizations: company, team, goverment ... |
GPE | Geopolitical entities: country, city, state ... |
LOC | Geographical locations: river, sea, mountain ... |
FAC | Facilities: landmark, road, building, airport ... |
CURR | Currency names or symbols. |
SALMA is available to download upon request for academic and commercial
use.
Request to download SALMA (sense annotated corpus and model, ~34K tokens)
GitHub
(download BERT training source code)
Hugging Face (download fine-tuned BERT model, ready to use)
Request API Token to access SALMA web service online
Actors | Authenticated user. |
URL schema | https://{domain}/sina/v2/api/salma/?apikey={key} |
Pre-conditions | The user has registered and provided their API Token. |
API Parameters |
|
Flow of events |
|
Retrieved Data | returns the results in the JSON format. |
Mustafa Jarrar, Sanad Malaysha, Tymaa Hammouda, Mohammed Khalilia:
SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks. In Proceedings of the
Arabic Natural Language Processing Conference (ArabicNLP 2023), Singapore. 2023
PDF - Slides
Sanad Malaysha, Mustafa Jarrar, Mohammad Khalilia:
Context-Gloss Augmentation for Improving Arabic Target Sense Verification.
The 12th International Global Wordnet Conference (GWC2023), Global Wordnet Association. (pp. ). San Sebastian, Spain, 2023
PDF - Slides
Moustafa Al-Hajj, Mustafa Jarrar:
ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD.
In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). PP 40--48, 2021
PDF - Slides - Video
Moustafa Al-Hajj, Mustafa Jarrar:
LU-BZU at SemEval-2021 Task 2: Word2Vec and Lemma2Vec performance in Arabic Word-in-Context disambiguation.
In Proceedings of the Fifteenth Workshop on Semantic Evaluation (SemEval2021) Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC). PP 748--755, Association for Computational Linguistics. 2021