Wojood (وجود)
Named Entity Recognition
Nested and Flat NER, 21 entity types, and 31 entity subtypes. (Open
Source)
Performance: Nested(89.42%) Flat(87.33%)
Models: Flat, Nested,
and Fine-grain BERT models.
Wojood Corpus:
560K tokens (MSA and dialect), manually annotated with 21
entity types, annotated with nested and flat entities (See
Article). covers multiple domains (Media, History, Culture,
Health, Finance, ICT, Law, Elections, Politics, Migration,
Terrorism, social media).
WojoodFine Corpus: Same as Wojood but extended with subtypes of entities (51
tags in total), (See
Article).
WojoodGaza Corpus:
60K tokens related to Israeli War on Gaza in domains (See
Article).
Tags and Guidelines:
NORP (group of people) | DATE | ORDINAL |
OCC (occupation) | TIME | PERCENT |
ORG (organization) subtypes | LANGUAGE | QUANTITY |
GPE (geopolitical entity) subtypes | WEBSITE | UNIT |
LOC (geographical location) subtypes | LAW | MONEY |
FAC (facility: landmarks places) subtypes | PRODUCT | CURR (currency) |
SinaTools:
NER module as python library.
GitHub:
training source code + sample data (~35K tokens).
Hugging Face:
fine-tuned BERT model using Wojood.
Wojood Corpus
(Corpus only)
WojoodGaza Corpus
(Corpus only)
WojoodFine Corpus
(Corpus only)
Mustafa Jarrar, Nagham Hamad, Mohammed Khalilia, Bashar Talafha, AbdelRahim Elmadany, Muhammad Abdul-Mageed: WojoodNER 2024: The Second Arabic Named Entity Recognition Shared Task. In Proceedings of the Second Arabic Natural Language Processing Conference (ArabicNLP 2024), Bangkok, Thailand. Association for Computational Linguistics.
Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim El-madany, Nagham Hamad, Alaa’ Omar: WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task. In Proceedings of the 1st Arabic Natural Language Processing Conference (Arabic- NLP), Part of the EMNLP 2023. ACL.
Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar El-Shangiti, Muhammad AbdulMageed:Arabic Fine-Grained Entity Recognition. In Proceedings of the 1st Arabic Natural Language Processing Conference (ArabicNLP), Part of the EMNLP 2023. ACL.
Mustafa Jarrar, Mohammed Khalilia, Sana Ghanem: Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France. 2022