SALMA Corpus

A corpus and model for Arabic Word Sense Disambiguation (WSD).
Version: 1.0 (updated on 22/10/2023)

SALMA consists of about 34K tokens that are manually annotated with 4151 unique senses (collected from two Arabic lexicons, namely Modern and Ghani) and 4389 named entities. It was collected from 33 online media sources written in Modern Standard Arabic (MSA) and covering general topics. An end-to-end basline WSD model was trained based on Target Sense Verification (TSV) using BERT, which achieved state-of-the-art in Arabic (Accuracy: 84.2%). Try the service:

  • Corpus size: 34,253 tokens (MSA)
    Richness: 8,760 unique tokens, contains 3,875 unique lemmas distributed to 2,904 nouns, 677 verbs, 119 functional words, and 175 between punctuations and digists
    Domains: general topics
    IAA: 92% (Quadratic Weighted Kappa)
    WSD Model: WSD end-to-end system (84.2% Accuracy)
    Named Entity Classes:

    Tag Description
    PERS Person names: first, middle, last, nickname ...
    ORG Organizations: company, team, goverment ...
    GPE Geopolitical entities: country, city, state ...
    LOC Geographical locations: river, sea, mountain ...
    FAC Facilities: landmark, road, building, airport ...
    CURR Currency names or symbols.

  • SALMA is available to download upon request for academic and commercial use.
    Request to download SALMA (sense annotated corpus and model, ~34K tokens)
    GitHub (download BERT training source code)
    Hugging Face (download fine-tuned BERT model, ready to use)

  • Request API Token to access SALMA web service online

    Actors Authenticated user.
    URL schema https://{domain}/sina/v2/api/salma/?apikey={key}
    Pre-conditions The user has registered and provided their API Token.
    API Parameters
      sentence is received through the body
    1. sentence: arabic text
    2. apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the API Key (i.e., Token) is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. If so the system extracts the entities from text.
    5. Otherwise, the system returns (-4) error code.
    6. The system returns the results in the specified format.
    Retrieved Data returns the results in the JSON format.
  • Mustafa Jarrar, Sanad Malaysha, Tymaa Hammouda, Mohammed Khalilia: SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks. In Proceedings of the Arabic Natural Language Processing Conference (ArabicNLP 2023), Singapore. 2023
    PDF - Slides

    Sanad Malaysha, Mustafa Jarrar, Mohammad Khalilia: Context-Gloss Augmentation for Improving Arabic Target Sense Verification. The 12th International Global Wordnet Conference (GWC2023), Global Wordnet Association. (pp. ). San Sebastian, Spain, 2023
    PDF - Slides

    Moustafa Al-Hajj, Mustafa Jarrar: ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). PP 40--48, 2021
    PDF - Slides - Video

    Moustafa Al-Hajj, Mustafa Jarrar: LU-BZU at SemEval-2021 Task 2: Word2Vec and Lemma2Vec performance in Arabic Word-in-Context disambiguation. In Proceedings of the Fifteenth Workshop on Semantic Evaluation (SemEval2021) Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC). PP 748--755, Association for Computational Linguistics. 2021