ArabicNLU

REGISTRATION

Registration is now open until 5 April 2024 29 April 2024. To register your team please use this link.

For further details you can reach us by Slack or NLUSharedtask2024@gmail.com.

INTRODUCTION

Natural language understanding (NLU) is a core aspect of natural language processing (NLP), facilitating semantics-based human-machine interactions. One of the key challenges in Arabic is ambiguity, that is because Arabic exhibits morphological richness, encompassing a complex interplay of roots, stems, and affixes, rendering words susceptible to multiple interpretations based on their morphology. Ambiguity in language can lead to misunderstandings, incorrect interpretations, and errors in NLP applications. A core NLU task is Word Sense Disambiguation (WSD), and its special case Location Mention Disambiguation (LMD). WSD aims to determine the correct sense of ambiguous words in context, while LMDfocuses on disambiguating location mention that is referred to with multiple toponyms, i.e., particular places or locations. Both tasks are vital in NLP and information retrieval, as it helps to correctly interpret and extract information from text. In this shared-task we introduce two subtasks, WSD and LMD.

Subtask 1: Word Sense Disambiguation (WSD)
Polysemous words that convey multiple meanings (in different contexts) have led to the emergence of the WSD task. WSD aims to disambiguate a word's semantics. Given a context (i.e., sentence), a target word in the context, and a set of candidate senses (i.e., glosses or definitions) for the target word, the goal of the WSD task is to determine which of these senses is the intended meaning for the target word. Participants can use any machine learning, including deep learning and generative methods.

WSD dataset: SALMA corpus is the first sense-annotated corpus for Arabic. SALMA contains 1,440 sentences and 34K tokens (8,760 unique tokens with 3,875 unique lemmas). All tokens are sense-annotated manually, with a total of 4,151 senses. Additional data about the data can be found in this article.

Participants will be provided with the development and test datasets. The development set consists of 100 sentences randomly selected from SALMA, the set of candidate senses (glosses) and the target/correct sense for each word in each sentence. The rest of the SALMA corpus (1,340 sentences) will be shared as a test set. The test set is similar to the development set, but it will not include the target/correct senses. No training datasets will be shared with the participants, that is to encourage participants to leverage generative models techniques such as in-context learning and chain-of-thought prompting. Participants can also utilize external datasets, sense inventories or lexicons during the development phase.

WSD dataset format: The data is in JSON format and follows the schema below. The data will be shared with the participants after registration.

{
  "type": "array",
  "items":{
     "type": "object",
     "properties": {
       "sentence_id": {"type": "integer"},
       "sentence": {"type": "string"},
       "words": {
           "type": "array",
            "items":{
              "type": "object",
              "properties":{
                "word_id":{"type": "integer"},
                "word":{"type": "string"},
                "senses":{"type": "array",items":{"type": ["integer", "null"]}
                "target_sense":{"type": ["integer", "null"]}
               }
            }
       }
     }
  }
}

Evaluation Participants will be given the test set, 1340 sentences, and for each target word a set of candidate senses (i.e., glosses). Participants are expected to submit their results following the JSON schema defined above. The accuracy (%) metric will be used to evaluate the submitted results, where it is calculated as the count of correct instances divided by the total number of instances.

Baseline Our is computed based on a sliding window of words count surrounding the target word. We used the Accuracy metric to measure the performance of our model according to the following table:

Window

Accuracy(%)

All words in sentence 82.8
11 words (5 left + target word + 5 right) 84.2
9 words 83.5
7 words 83.8
5 words 84.0
3 words 82.8

Google Colab notebook: To allow you to experiment with the baseline, we authored a Google Colab notebook that demonstrates how to load and evaluate the data. The notebook demonstrates how to load the data and how to use our WSD system to predict the sense for the target words. The Notebook can be found here.

Subtask 2: Location Mention Disambiguation (LMD)
Offering the LMD in a separate task supports developing precise models capable of accurately resolving Location Mentions (LMs) within microblogs and linking them to toponyms in geo-positioning databases. LMD represents challenging retrieval and classification problems such as the lack of context, toponymic polysemy, toponymic homonymous, to name a few. Indeed, the LMD task is understudied for the Arabic language.

Problem Definition:

Given the following inputs: A post \(p\) (tweet in our dataset), a set of location mentions (LMs) \(L_p =\{l_i; i \in [1,n_p]\}\) in post \(p\), where \(l_i\) is the ith location mention, and np is the total number of location mentions in \(p\), and a geo-positioning database \(G\) (i.e., OSM) that consists of a set of toponyms \(T = \{t_j; j \in [1,k]\}\), where \(t_j\) is the \(jth\) toponym, and \(k\) is the number of toponyms in \(G\). The LMD system aims to match every location mention \(l_i\) in the post \(p\) to one of the toponyms \(t_j\) in OSM that accurately represents it, if it exists.

In our shared task, we perceive the LMD task as a candidate retrieval and ranking problem. For every location mention, the LMD system has to retrieve a ranked list of up to 3 candidate toponyms from OSM. The list of retrieved candidates is ranked based on their probability of being the target toponym. In other words, the LMD problem can be typically decomposed into two sub-problems: 1) candidate retrieval (which aims to retrieve a list of candidate toponyms from OSM), 2) and candidate ranking (which aims to rank the list of retrieved candidates).

LMD Dataset: IDRISI-DA is the first Arabic LMD dataset. It has been carefully constructed to ensure domain and geographical generalizability, enabling fair evaluation of models across various disaster types (such as floods and bombings), as well as different geographical areas. Notably, it encompasses 2,869 posts from diverse dialects, featuring 3,893 location mentions, of which 763 are unique, across seven countries (more statistics of the data below). We use the standard 70:10:20 splits per event.

#tweets #LMs #Unique LMs
Train 2,170 3,997 530
Validation 333 818 254
Test 791 1,630 308

The training and development sets will be provided to the participants after registration. For each post/tweet in the train and dev datasets, we provide annotations of the location mentions, each location mention is accompanied by its correct toponym from the OpenStreetMap (OSM) gazetteer. Each toponym includes several attributes such as geo-coordinates, address, etc. More details about the dataset (and how one can generate negative examples, if needed) can be found in this paper. The JSON schema of the data is provided below. The test set that will be shared with the participants includes the annotation of the location mentions, but does not include the correct toponym from the OpenStreetMap gazetteer. Participants are expected to query OpenStreetMap to retrieve candidate toponyms and perform disambiguation to return the correct toponym.

Accessing the OpenStreetMap can be through its public API or through the library/code that we will also provide in this shared task - to make it easy for the participants.

{
  {
        "type": "array",
        "items":{
              "type": "object",
              "properties": {
                  "tweet_id": {"type": "integer"},
                  "text": {"type": "string"},
                  "location_mentions": {
                      "type": "array",
                      "items":{
                          "type": "object",
                          "properties":{
                              "location_mention":{"type": "string"},
                              "location_mention_id":{"type": "integer"},
                              "toponym":{"type": "string"}
                            }
                        }
                  }
                }
        }
  }

Evaluation Participants are expected to submit the test set with disambiguated location mentions. The submission file should be in JSONL format. Each line is a JSON object corresponding to one candidate toponym retrieved for a location mention. The definition of each property in the prediction file is described in the table below:

Column name Description
lm_id A string that concatenates the tweet_id and location_mention_id

Example:
tweet_id = 1290755929869373443
location_mention_id = 1

Then
lm_id = '1290755929869373443_1'
toponym_id A string that concatenates the first character of osm_type and osm_id properties extracted from OSM.

Example:
If osm_type = way and osm_id = 708764827, then the toponym_id = w708764827
rank An integer value indicating the rank of the toponym in the returned list of candidate toponyms.

For example, for each location mention (i.e. “مرفأ بيروت”), the prediction file should contain multiple lines, each line containing the toponym and its rank. An example is below:

{'lm_id': '1290755929869373443_1', 'toponym_id': 'r12927915', 'rank': 1}
{'lm_id': '1290755929869373443_1', 'toponym_id': 'n1139565566', 'rank': 2}
{'lm_id': '1290755929869373443_1', 'toponym_id': 'w777711541', 'rank': 3}

Systems will be evaluated using Mean Reciprocal Rank (MRR@k), where k indicates the cutoff of the retrieved ranked list of candidate toponyms. We will set k to 1, 2, and 3. MRR@1 is the primary measure on which submissions will be ranked on the leaderboard. In our context, MRR@1 is equivalent to the accuracy measure. Ties will be resolved by doing a secondary rank using the MRR@2 or MRR@3.

Baseline We use OSM search API as a baseline. We then compute the MRR@k of the returned hits against our ground truth toponyms. The baseline performance is presented below.

Accuracy Validation Test
MRR@1 0.5795 0.6270
MRR@2 0.6204 0.6669
MRR@3 0.6298 0.6701

Google Colab notebook A notebook was created to demonstrate how to pull the data from Hugging Face, call OSM API to retrieve candidate toponyms, and evaluate the OSM using MRR@k. The notebook assumes that OSM is the baseline model. The notebook can be found here .

IMPORTANT DATES

         - March 1, 2024: Shared Task Registration Open
         - March 15, 2024: Data-sharing and Evaluation on Development Set Available
         - April 5, 2024 April 29, 2024: Shared Task Registration Deadline
         - April 10, 2024:Test Set Published
         - May 03, 2024: Evaluation on Test Set (TEST) Deadline
         - May 15, 2024: Shared Task System Paper Submission Due
         - June 17, 2024: Notification of Acceptance
         - July 1, 2024: Camera-ready Version Due
         - August 16, 2024: ArabicNLP Conference.

ORGANIZERS

         - Mohammed Khalilia, mkhalilia@birzeit.edu, Qualtrics/Birzeit University, USA (Contact Person)
         - Imed Zitouni, imed.zitouni@gmail.com, Google, USA
         - Mustafa Jarrar, mjarrar@birzeit.edu, Birzeit University , Palestine
         - Tamer Elsayed, telsayed@qu.edu.qa, Qatar University, Qatar
         - Sanad Malaysha, sanad.ahmed2000@gmail.com, Birzeit University , Palestine
         - Ala’ Jabari, ala.jabari@hotmail.com, Birzeit University , Palestine
         - Reem Suwaileh, rsuwaileh@hbku.edu.qa, Hamad Bin Khalifa University, Qatar