Wojood
A corpus and model for nested Arabic Named Entity Recognition
Version: 1.1 (updated on 1/9/2023)
Wojood consists of about 560K tokens (MSA and dialect) that are manually annotated with 21 entity types (e.g., person, organization, location, event, date, etc). It covers multiple domains and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. A nested named entity recognition (NER) model based on BERT was trained ArBERTv2 (flat 89.2% F1-score, nested 91.68% F1-score). Try the service:
Corpus size:
550K tokens (MSA and dialects)
Richness: 21
entity classes, contains ~75K entities and 22.5% of them are nested entities
Domains:
Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics,
Migration, Terrorism, social media
IAA: 97.9%
(Cohen's Kappa)
NER Model:
ArBERTv2 (flat 89.2% F1-score, nested 91.68% F1-score)
Entity Classes
(21):
PERS (person) | EVENT | CARDINAL |
NORP (group of people) | DATE | ORDINAL |
OCC (occupation) | TIME | PERCENT |
ORG (organization) subtypes | LANGUAGE | QUANTITY |
GPE (geopolitical entity) subtypes | WEBSITE | UNIT |
LOC (geographical location) subtypes | LAW | MONEY |
FAC (facility: landmarks places) subtypes | PRODUCT | CURR (currency) |
Wojood is available to download upon request for academic and commercial
use.
Request to download Wojood (Flat/Nested NER
corpus, or Wojood_Fine (Wojood subtypes))
GitHub
(download BERT training source code + sample data (~35K tokens))
Hugging Face (download fine-tuned BERT model, ready
to use)
Request API Token to access Wojood web service online
Actors | Authenticated user. |
URL schema | https://{domain}/sina/v2/api/wojood/?apikey={key} |
Pre-conditions | The user has registered and provided their API Token. |
API Parameters |
|
Flow of events |
|
Retrieved Data | returns the results in the specified format. |
Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim El-madany, Nagham Hamad, Alaa’ Omar:
WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task.
In Proceedings of the 1st Arabic Natural Language Processing Conference (Arabic- NLP), Part of the EMNLP 2023. ACL.
PDF -
Slides -
Poster
Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad AbdulMageed:
Arabic Fine-Grained Entity Eecognition.
In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.
PDF -
Slides
Mustafa Jarrar, Mohammed Khalilia, Sana Ghanem:
Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT.
In Proceedings of the International Conference on Language
Resources and Evaluation (LREC 2022), Marseille, France. 2022
PDF -
Slides - Poster - Video