Wojood

A corpus and model for nested Arabic Named Entity Recognition
Version: 1.0 (updated on 20/1/2022)

Wojood consists of about 550K tokens (MSA and dialect) that are manually annotated with 21 entity types (e.g., person, organization, location, event, date, etc). It covers multiple domains and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. A nested named entity recognition (NER) model based on BERT was trained (F1-score 88.4%). Try the service:

  • Corpus size: 550K tokens (MSA and dialects)
    Richness: 21 entity classes, contains ~75K entities and 22.5% of them are nested entities
    Domains: Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media
    IAA: 97.9% (Cohen's Kappa)
    NER Model: AraBERTV2 (88.4% F1-score)
    Entity Classes (21):

    PERS (person) EVENT CARDINAL
    NORP (group of people) DATE ORDINAL
    OCC (occupation) TIME PERCENT
    ORG (organization) LANGUAGE QUANTITY
    GPE (geopolitical entity) WEBSITE UNIT
    LOC (geographical location) LAW MONEY
    FAC (facility: landmarks places) PRODUCT CURR (currency)

    Please email Prof. Jarrar (mjarrar AT birzeit.edu) for the annotation guidelines

  • Wojood is available to download upon request for academic and commercial use.
    Request to download Wojood (Nested NER corpus, 550K tokens)
    GitHub (download BERT training source code + sample data (~35K tokens))
    Hugging Face (download fine-tuned BERT model, ready to use)

  • Request API Token to access Wojood web service online

    Actors Authenticated user.
    URL schema https://{domain}/sina/v2/api/wojood/?apikey={key}
    Pre-conditions The user has registered and provided their API Token.
    API Parameters
      Mode and sentence are received through the body
    1. mode: output format (1) JSON IBO format, (2) XML format, or (3) entities and their positions in JSON.
    2. sentence: arabic text
    3. apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the API Key (i.e., Token) is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. If so the system extracts the entities from text.
    5. Otherwise, the system returns (-4) error code.
    6. The system returns the results in the specified format.
    Retrieved Data returns the results in the specified format.
  • Mustafa Jarrar, Mohammed Khalilia, Sana Ghanem: Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France. 2022

    PDF - Slides - Poster - Video