SinaLab NLP Resources

Resources

Demo and download our tools and datasests

+ -

Arabic Ontology

الأنطولوجيا العربية

An Arabic Wordnet with ontologically-clean content. Classification of the meanings of the Arabic terms, (see Article, see FAQ).

Web Service

REST

Search Ontology

Online

Ontology Tree

Browse

Download Ontology

Search Arabic Ontology for a term

Actors	Authenticated user.
URL schema	https://{domain}/api/OntologyTermSearch/{term}?page={page-no}&limit={pageSize}&apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	page: number of results page. limit: number of results per page. apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system checks the page size to be between 1 and 1000. If so, the system performs the required search query. Otherwise, the system returns (-4) error code. The system returns the JSON data object.
Data	results JSON object (list of ontology concepts).

Example: virus

https://ontology.birzeit.edu/sina/api/OntologyTermSearch/virus/?page=1&limit=5&apikey=sampleKey

Lookup an Arabic Ontology concept

Actors	Authenticated user.
URLs schema	https://{domain}/api/OntologyConcept/{conceptID}?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system performs the required search query. The system return the JSON data object.
Retrieved Data	results JSON object (One concept from the Arabic Ontology).

Example ID: 293572

https://ontology.birzeit.edu/sina/api/OntologyConcept/293572?apikey=sampleKey

Lookup subtypes of an Onotlogy concept

Actors	Authenticated user.
URL schema	https://{domain}/api/OntologyConceptSubtypes/{superId}?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system performs the required search query. The system return the JSON data object.
Retrieved Data	results JSON object (list of ontology concepts).

Example ID: 293572

https://ontology.birzeit.edu/sina/api/OntologyConceptSubtypes/293572?apikey=sampleKey

Lookup concepts part of another concept

Actors	Authenticated user.
URL schema	https://{domain}/api/ConceptParts/{partOfID}?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system performs the required search query. The system return the JSON data object.
Retrieved Data	results JSON object (list of ontology concepts).

Example ID: 293121

https://ontology.birzeit.edu/sina/api/ConceptParts/293121?apikey=sampleKey

Lookup instances of an Ontology concept

Actors	Authenticated user.
URL schema	https://{domain}/api/ConceptInstances/{instanceOfID}?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system performs the required search query. The system return the JSON data object.
Retrieved Data	results JSON object (list of ontology concepts).

Example ID: 293121

https://ontology.birzeit.edu/sina/api/ConceptInstances/293121?apikey=sampleKey

Lexicographic Databases (Qabas & 150 lexicons)

حوسبة المعاجم (قبس و150 معجم)

Qabas - Open Source Arabic Lexicographic Database

An Arabic Lexicon (58K lemmas) linked with many NLP resources (110 lexicons and 12 corpora), represented as a lexicographic data graph (See Article, See About)

Online

Statistics and Guidelines

Download Qabas

Lexicographic Search Engine (150 lexicons)

150 Arabic-Multilingual dictionaries were manually digitized, then structured and integrated in one database, including definitions, synonyms, translations, morphological features, etc.(See Article, See About)

Online

Web Services

REST

Search dictionaries for a term

Retrieves lexical concepts from all lexicons that have the SearchTerm in its synset. It allows an authenticated user (application or end-user) to search the dictionaries for a term they provide. They can set the results page size and the search filter to search either for definitions, translations, synonyms or a combination of them Request API Token.

Web Service

REST

Online

Actors	Authenticated user.
URL schema	https://{domain}/api/term/{term}/?type={filter-no}&page={page-no}&limit={pageSize}&apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	type: search filter value (1: translations only, 2: synonyms only, 3: definitions only, 4: translations and synonyms , 5: translations and definitions , 6: synonyms and definitions, 7: translation, synonyms and definitions). page: number of results page. limit: number of results per page. apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system checks the page size to be between 1 and 1000, and the search filter to be between 1 and 7. If so the system performs the required search query. Otherwise, the system returns (-4) error code. The system returns the JSON data object.
Retrieved Data	results JSON object (list of lexical concepts).

Example: virus

Retrieve definitions of the term "virus"

https://ontology.birzeit.edu/sina/api/term/virus/?type=3&page=1&limit=10&apikey=sampleKey

Lookup a lexical concept by ID

Retrieves a certain lexical concept from a lexicon, given its IDff Request API Token.

Web Service

REST

Online

Actors	Authenticated user.
URL schema	https://{domain}/api/lexicalconcept/{id}?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	apikey: a key (provided offline) to access the API.
Flow of events	The system checks if the user is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. The system performs the required search query. The system return the JSON data object.
Retrieved Data	results JSON object (one lexical concept).

Example ID: 1520039900

https://ontology.birzeit.edu/sina/api/lexicalconcept/1520039900?apikey=sampleKey

+-

Dialect Annotated Corpora (Currasat)

مدونة اللهجات العامية (كراسات)

+-
Curras - Palestinian Dialect Corpus.

Palestinian morphologically-annotated corpus (56K tokens). Each token is annotated with 16 different features. (See Article , See About).

Online

Download Corpus

+-
Baladi - Annotated Lebense Dialect Corpus

Lebanese morphologically-annotated corpus (10K tokens). Each token is annotated with 16 different features. (See Article , see About).

Online

Download Corpus

+-
Nabra: Syrian Arabic Dialect Corpora with Morphological Annotations.

Syrian morphologically-annotated corpus (60K tokens). Each token is annotated with 16 different features.(See Article , see About).

Online

Download Corpus

+-
Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Corpora with Morphological Annotations.

Four corpora consists of about (1.2 million tokens) that we collected from different social media platforms. The Yemeni corpus (~1.05M tokens) was collected automatically from Twitter, while the other three dialects (~50K tokens each) were manually collected from Facebook and YouTube. Each word in the four corpora was annotated with different morphological features. (See Article , see About).

Online

Download Yemeni Corpus Download Iraqi Corpus Download Libyan Corpus Download Sudanese Corpus
+-

SinaTools (NLP toolkit)

أدوات سينا

+-
Open-source Python toolkit for Arabic Natural Understanding,

Python APIs, command lines, colabs, and online demos.
Modules: Morphology Tagging, Named Entity Recognition, Word Sense Disambiguation, Relation Extraction, Semantic Relatedness, Synonyms, Diacritic-Based Matching, Corpora Processing, Utilities (See Article).

Download

Documentation

Benchmark
+-

Morphology Tagger (Alma)

المحلل الصرفي (ألمى)

+-
Arabic morphology taggers (Lemmatizer, POS tagger, and root tagger).

The fastest and most accurate (See Article).

Online Demo

Download Data

Download Code
Part of SinaTools

Word Sense Disambiguation(Salma)

المحلل الدلالي (سلمى)

WSD Pipeline and Datasets

Pipeline: performs several task together. Given a sentence as input it tags all words with: Lemma, single-word sense, multi-word sense, and NER. The sense disambiguation is done using the ArabGlossBER TSV model using our single and multi word sense inventory (see Article). The lemmatization is done using Alma and the NER is done using Wojood.

ArabGlossBERT dataset: 167K context-gloss pairs, labeled with True/False, to train a TSV model (see Article). The dataset was also augmented with more pairs (See Article).

Salma Corpus: manually sense annotated corpus (34K tokens), (See Article).

Online Demo

ArabGlossBERT

Download code and TSV Models

SALMA corpus

Download

Download Code

Part of SinaTools

ArabicNLU 2024

Shared Tasks

Request API Token to access WSD SALMA web service online

Actors	Authenticated user.
URL schema	https://{domain}/v2/api/SALMA/{text}?apikey={key}
Pre-conditions	The user has registered and provided their API Key. The text must be in the http request body.
API Parameters	text: arabic text apikey: a key (provided offline) to access the API.
Flow of events	1.The system checks if the user is authenticated or not. 2.If not authenticated, the system returns (-3) error code in JSON format. 3.If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. 4.If so the system Semantically Analyse the sentence. 5.Otherwise, the system returns (-4) error code. 7.The system returns the JSON data object.
Retrieved Data	results JSON object.

Named Entity Recognition (Wojood)

استخراج أسماء الاعلام (وجود)

Flat, Nested and Fine-Grained Arabic Named Entity Recognition

Models: Flat and Nested BERT models.

Wojood Corpus: 560K tokens (MSA and dialect), manually annotated with 21 entity types, covers multiple domains and was annotated with nested and flat entities (See Article).

Wojood_FineCorpus: Same as Wojood but extended with subtypes of entities (51 tags in total),(See Article).

Wojood^GazaCorpus: 60K tokens related to Israeli War on Gaza in domains (See Article).

Online Demo

Download Corpora

Wojood

Wojood_Fine

Wojood^Gaza

Download Code and Models

HuggingFace

GitHub

Download

Annotation Guidelines

Download Code

Part of SinaTools

Shared Tasks

WojoodNER 2024

WojoodNER 2023

Request API Token to access Wojood web service online

Actors	Authenticated user.
URL schema	https://{domain}/sina/v2/api/wojood/?apikey={key}
Pre-conditions	The user has registered and provided their API Key.
API Parameters	mode and sentence are received through the body mode: output format (1) JSON IBO format, (2) XML format, or (3) entities and their positions in JSON. text: arabic text apikey: a key (provided offline) to access the API.
Flow of events	1. The system checks if the user is authenticated or not. 2. If not authenticated, the system returns (-3) error code in JSON format. 3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. 4. If so the system extracts the entities from text. 5. Otherwise, the system returns (-4) error code. 6. The system returns the results in the specified format.
Retrieved Data	returns the results in the specified format.

+-

Relation Extraction

استخراج العلاقات

+-
Event argument relation extraction (Corpus and models)

Extract relations between events and their arguments within a sentence (hasLocation, hasDate, hasAgent).

Corpus (Wojood^Hadath): We extended Wojood NER corpus with relations

Method: Novel method with using BERT with 95% accuracy, implemented as part of SinaTools (See Article)

Online Demo

Download Corpus

Download Code
Part of SinaTools
+-

Social Computing (Fada)

الإنسانيات الحاسوبية والتواصل الاجتماعي (فضا)

+-
Offensive Hebrew Corpus and Detection using BERT

The corpus includes about 16K tweets manually labeled with (abusive, hate, violence, pornographic, or non-offensive) in addition to Target, Topic, and Phrase. We fined-tuned 8 models (using HeBERT and AlphaBERT). (See Article)

Download

Github (Corpus and Code)

Huggingface (8 BERT Models)

Fada Page

+-
Bias and propaganda detection in social media

A corpora of 12,000 Facebook posts in five languages (Arabic, Hebrew, English, French, Hindi), with 2,400 posts in each language, manually annotated with Bias and Propaganda. This dataset was collected during the Israeli War on Gaza from October 7, 2023, to January 31, 2024. (See Article)

Download

Github (Corpus)

Fada Page

Shared Task

FigNews 2025

+-
Benchmark for detecting bias in LLMs

A dataset consisting of 1,800 pairs of ChatGPT responses was created to analyze potential biases related to Palestine and Israel. The dataset encompasses the 30 articles of international human rights law, about 60 pairs for each article. Each pair was manually classified into one of three categories (Biased against Palestine, Biased against Israel, No Bias) by 12 well-trained law master’s students.

(coming soon)

Fada Page

+-
Nakba-NLP Workshop

International Workshop on Nakba Narratives as Language Resources

NakbaNLP 2025
Part of COLING

Synonyms

استخراج المترادفات

Tool for extending and evaluating synonyms

Extend: Given a one or more synonyms the tool extends it with more synonyms.

Evaluate: Given a set of synonyms the tool evaluates how much these synonyms belong to this set. The tools is based on a novel algorithm and datasets, treating synonymy as a fuzzy relation. (See Article ).

Online Demo

Web Service

REST and Javascript

Download Datasets

Download Code

Part of SinaTools

Request API Token to access Synonyms Generator web service online

Actors	Authenticated user.
URL schema	https://{domain}/sina/v2/api/SynonymGenerator/?apikey={key}
Pre-conditions	The user has registered and provided their API Token.
API Parameters	Synset, lexicons, pos and level are received through the body Synset: mono/multilingual synset. Lexicons: Select one or more of these lexicons (AWN, مكنز بيرزيت, Princeton WordNet, ALECSO, Cairo Academy). POS: part of speech (noun, verb). Level: Level3 and Level4. Apikey: A key (provided offline) to access the API.
Flow of events	The system checks if the API Key (i.e., Token) is authenticated or not. If not authenticated, the system returns (-3) error code in JSON format. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request. If so the system extracts the entities from text. Otherwise, the system returns (-4) error code. The system returns the results in the specified format.
Retrieved Data	Return the candidate synonyms with their fuzzy values.

+-

Chatbots and intent detection (AraBanking77)

المساعدات الآلية

+-
ArBanking77 Intent Detection Corpus and Models

The dataset consists of 31,404 (MSA and Palestinian dialect). Each query is classified into one of the 77 classes (intents) including card arrival, card linking, exchange rate, etc. A set of BERT models are fine-tuned on the ArBanking77 dataset (F1-score 92% for MSA, 90% for PAL). (See Article )

Online Demo

Download

Github (Dataset and Code)

Huggingface (Models)

Shared Task

AraFinNLP 2025

API errors messages

Details of error messages returned by the APIs.

Error Code	Error Message
-1	User blocked, exceeded access limit
-3	user is not authenticated
-4	Incorrect API parameter value
-5	No Data Records Found
-6	Incorrect Data Value
login-error	{"error":"invalid_grant","error_description":"Bad credentials"}

+ -
List of Corpora and Datasets

بيانات مفتوحة المصدر

+-
Lexicography & Semantic

Arabic Ontology An Arabic Wordnet with ontologically Clean Content.(CC-BY-4.0)

Qabas Lexicon Lexicographic database, 58K lemmas, linked with 110 lexicons and corpora 2M tokens.(CC-BY-ND-4.0)

Salma WSD Arabic sense-annotated corpus, 34k tokens. Multilevels: single-word, multi-word senses, and NER.(CC-BY-4.0)

Synonyms Synonyms dataset parallelly annotated by 4 linguists and fuzzy values.(CC-BY-4.0)

ArabGlossBERT 167K context-gloss pairs labeled with True/False to train a TSV BERT model for WSD.(CC-BY-4.0)

+-
Classical Arabic

Quran Morphology tagging of the Quran, each word is linked with a lemma in Qabas lexicon.(CC-BY-4.0)

+-
Dialects & Morphology

Curras Palestinian dialect corpus, 56K tokens with morphological annotations.(CC-BY-4.0)

Baladi Lebanese dialect corpus, 10K tokens with morphological annotations.(CC-BY-4.0)

Nabra Syrian dialect corpus, 60K tokens with morphological annotations.(CC-BY-4.0)

Lisan-Iraqi Iraqi dialect corpus, 45K tokens with morphological annotations.(CC-BY-4.0)

Lisan-Libyan Libyan dialect corpus, 51K tokens with morphological annotations.(CC-BY-4.0)

Lisan-Sudanese Sudanese dialect corpus, 52K tokens with morphological annotations.(CC-BY-4.0)

Lisan-Yemeni Yemeni dialect corpus, 1.05M tokens with morphological annotations(CC-BY-4.0)

+-
Information Extractions

Wojood NER Nested NER corpus, 550K tokens, 21 entity types, multidomains, MSA and dialects.(CC-BY-4.0)

WojoodFine Fine-grain NER corpus - extending Wojood with 31 entity subtypes.(CC-BY-4.0)

WojoodGaza NER corpus, 60K tokens about the Israeli War on Gaza, using Wojood guidelines.(CC-BY-4.0)

WojoodHadath Event-relation extraction corpus - extending Wojood with relations.(CC-BY-4.0)

+-
Chatbots & Dialect Translation

ArBanking77 Parallel Corpora: 15K questions in MSA, Palestinian, Morocoo, Suadi, and Tunisian, each is labeled with a banking intent.(CC-BY-SA-4.0)

+-
Social Computing

Offensive Hebrew 16K Tweets labeled with hate, violence, racism, porno.(CC-BY-4.0)

FigNews 12K FB posts annotated with Bias and Propaganda in Arabic, Hebrew, English, French, and Hindi.(CC-BY-4.0)

Search Arabic Ontology for a term

Example: virus

Lookup an Arabic Ontology concept

Example ID: 293572

Lookup subtypes of an Onotlogy concept

Example ID: 293572

Lookup concepts part of another concept

Example ID: 293121

Lookup instances of an Ontology concept

Example ID: 293121

Qabas - Open Source Arabic Lexicographic Database

Lexicographic Search Engine (150 lexicons)

Search dictionaries for a term

Example: virus

Retrieve definitions of the term "virus"

Lookup a lexical concept by ID

Example ID: 1520039900

Curras - Palestinian Dialect Corpus.

Baladi - Annotated Lebense Dialect Corpus

Nabra: Syrian Arabic Dialect Corpora with Morphological Annotations.

Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Corpora with Morphological Annotations.

Open-source Python toolkit for Arabic Natural Understanding,

Arabic morphology taggers (Lemmatizer, POS tagger, and root tagger).

WSD Pipeline and Datasets

Flat, Nested and Fine-Grained Arabic Named Entity Recognition

Event argument relation extraction (Corpus and models)

Offensive Hebrew Corpus and Detection using BERT

Download

Bias and propaganda detection in social media

Download

Benchmark for detecting bias in LLMs

Nakba-NLP Workshop

Tool for extending and evaluating synonyms

ArBanking77 Intent Detection Corpus and Models

Download

Shared Task

Lexicography & Semantic

Classical Arabic

Dialects & Morphology

Information Extractions

Chatbots & Dialect Translation

Social Computing