Wojood

A corpus and model for nested Arabic Named Entity Recognition
Version: 1.1 (updated on 1/9/2023)

Wojood consists of about 560K tokens (MSA and dialect) that are manually annotated with 21 entity types (e.g., person, organization, location, event, date, etc). It covers multiple domains and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. A nested named entity recognition (NER) model based on BERT was trained ArBERTv2 (flat 89.2% F1-score, nested 91.68% F1-score). Try the service:

  • Corpus size: 550K tokens (MSA and dialects)
    Richness: 21 entity classes, contains ~75K entities and 22.5% of them are nested entities
    Domains: Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media
    IAA: 97.9% (Cohen's Kappa)
    NER Model: ArBERTv2 (flat 89.2% F1-score, nested 91.68% F1-score)
    Entity Classes (21):

    PERS (person) EVENT CARDINAL
    NORP (group of people) DATE ORDINAL
    OCC (occupation) TIME PERCENT
    ORG (organization) subtypes LANGUAGE QUANTITY
    GPE (geopolitical entity) subtypes WEBSITE UNIT
    LOC (geographical location) subtypes LAW MONEY
    FAC (facility: landmarks places) subtypes PRODUCT CURR (currency)
    ORG:
    GOV
    COM
    EDU
    ENT
    NONGOV
    MED
    REL
    SCI
    SPO
    ORG_FAC
    GPE:
    COUNTRY
    STATE-OR-PROVINCE
    TOWN
    NEIGHBORHOOD
    CAMP
    GPE_ORG
    SPORT
    LOC:
    CONTINENT
    CLUSTER
    ADDRESS
    BOUNDARY
    CELESTIAL
    WATER-BODY
    LAND-REGION-NATURAL
    REGION-GENERAL
    REGION-INTERNATIONAL
    FAC:
    PLANT
    AIRPORT
    BUILDING-OR-GROUNDS
    SUBAREA-FACILITY
    PATH

    Versions: Wojood 1.1 is the same as Wojood-V1 but we corrected some annotations. For the baselines of Wojood-V1 please refer to [1], and for Wojood-V1.1 please refer to [2]

    Please email Prof. Jarrar (mjarrar AT birzeit.edu) for the annotation guidelines

  • Wojood is available to download upon request for academic and commercial use.
    Request to download Wojood (Flat/Nested NER corpus, or Wojood_Fine (Wojood subtypes))
    GitHub (download BERT training source code + sample data (~35K tokens))
    Hugging Face (download fine-tuned BERT model, ready to use)

  • Request API Token to access Wojood web service online

    Actors Authenticated user.
    URL schema https://{domain}/sina/v2/api/wojood/?apikey={key}
    Pre-conditions The user has registered and provided their API Token.
    API Parameters
      Mode and sentence are received through the body
    1. mode: output format (1) JSON IBO format, (2) XML format, or (3) entities and their positions in JSON.
    2. sentence: arabic text
    3. apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the API Key (i.e., Token) is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. If so the system extracts the entities from text.
    5. Otherwise, the system returns (-4) error code.
    6. The system returns the results in the specified format.
    Retrieved Data returns the results in the specified format.
  • Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim El-madany, Nagham Hamad, Alaa’ Omar: WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task. In Proceedings of the 1st Arabic Natural Language Processing Conference (Arabic- NLP), Part of the EMNLP 2023. ACL.
    PDF - Slides - Poster

    Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad AbdulMageed: Arabic Fine-Grained Entity Eecognition. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.
    PDF - Slides

    Mustafa Jarrar, Mohammed Khalilia, Sana Ghanem: Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France. 2022
    PDF - Slides - Poster - Video