Resources

Demo and download our tools and datasests

  • An Arabic Wordnet with ontologically-clean content. Classification of the meanings of the Arabic terms, (see Article, see FAQ).

    Actors Authenticated user.
    URL schema https://{domain}/api/OntologyTermSearch/{term}?page={page-no}&limit={pageSize}&apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • page: number of results page.
    • limit: number of results per page.
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system checks the page size to be between 1 and 1000.
    5. If so, the system performs the required search query.
    6. Otherwise, the system returns (-4) error code.
    7. The system returns the JSON data object.
    Data results JSON object (list of ontology concepts).

    Example: virus

    Actors Authenticated user.
    URLs schema https://{domain}/api/OntologyConcept/{conceptID}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system performs the required search query.
    5. The system return the JSON data object.
    Retrieved Data results JSON object (One concept from the Arabic Ontology).

    Example ID: 293572

    Actors Authenticated user.
    URL schema https://{domain}/api/OntologyConceptSubtypes/{superId}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system performs the required search query.
    5. The system return the JSON data object.
    Retrieved Data results JSON object (list of ontology concepts).

    Example ID: 293572

    Actors Authenticated user.
    URL schema https://{domain}/api/ConceptParts/{partOfID}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system performs the required search query.
    5. The system return the JSON data object.
    Retrieved Data results JSON object (list of ontology concepts).

    Example ID: 293121

    Actors Authenticated user.
    URL schema https://{domain}/api/ConceptInstances/{instanceOfID}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system performs the required search query.
    5. The system return the JSON data object.
    Retrieved Data results JSON object (list of ontology concepts).

    Example ID: 293121

  • An Arabic Lexicon (58K lemmas) linked with many NLP resources (110 lexicons and 12 corpora), represented as a lexicographic data graph (See Article, See About)

    150 Arabic-Multilingual dictionaries were manually digitized, then structured and integrated in one database, including definitions, synonyms, translations, morphological features, etc.(See Article, See About)

    Retrieves lexical concepts from all lexicons that have the SearchTerm in its synset. It allows an ​authenticated user (application or end-user) to search the dictionaries for a term they provide. They can set the results page size and the search filter to search either for definitions, translations, synonyms or a combination of them Request API Token.

    Actors Authenticated user.
    URL schema https://{domain}/api/term/{term}/?type={filter-no}&page={page-no}&limit={pageSize}&apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • type: search filter value (1: translations only, 2: synonyms only, 3: definitions only, 4: translations and synonyms , 5: translations and definitions , 6: synonyms and definitions, 7: translation, synonyms and definitions).
    • page: number of results page.
    • limit: number of results per page.
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system checks the page size to be between 1 and 1000, and the search filter to be between 1 and 7.
    5. If so the system performs the required search query.
    6. Otherwise, the system returns (-4) error code.
    7. The system returns the JSON data object.
    Retrieved Data results JSON object (list of lexical concepts).

    Example: virus

    Retrieves a certain lexical concept from a lexicon, given its IDff Request API Token.

    Actors Authenticated user.
    URL schema https://{domain}/api/lexicalconcept/{id}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
    • apikey: a key (provided offline) to access the API.
    Flow of events
    1. The system checks if the user is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. The system performs the required search query.
    5. The system return the JSON data object.
    Retrieved Data results JSON object (one lexical concept).

    Example ID: 1520039900

  • Palestinian morphologically-annotated corpus (56K tokens). Each token is annotated with 16 different features. (See Article  , See About).

    Lebanese morphologically-annotated corpus (10K tokens). Each token is annotated with 16 different features. (See Article  , see About).

    Syrian morphologically-annotated corpus (60K tokens). Each token is annotated with 16 different features.(See Article  , see About).

    Four corpora consists of about (1.2 million tokens) that we collected from different social media platforms. The Yemeni corpus (~1.05M tokens) was collected automatically from Twitter, while the other three dialects (~50K tokens each) were manually collected from Facebook and YouTube. Each word in the four corpora was annotated with different morphological features. (See Article  , see About).

  • Python APIs, command lines, colabs, and online demos.
    Modules: Morphology Tagging, Named Entity Recognition, Word Sense Disambiguation, Relation Extraction, Semantic Relatedness, Synonyms, Diacritic-Based Matching, Corpora Processing, Utilities (See Article).

  • The fastest and most accurate (See Article).

  • Pipeline: performs several task together. Given a sentence as input it tags all words with: Lemma, single-word sense, multi-word sense, and NER. The sense disambiguation is done using the ArabGlossBER TSV model using our single and multi word sense inventory (see Article). The lemmatization is done using Alma and the NER is done using Wojood.

    ArabGlossBERT dataset: 167K context-gloss pairs, labeled with True/False, to train a TSV model (see Article). The dataset was also augmented with more pairs (See Article).

    Salma Corpus: manually sense annotated corpus (34K tokens), (See Article).

    Request API Token to access WSD SALMA web service online


    Actors Authenticated user.
    URL schema https://{domain}/v2/api/SALMA/{text}?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    The text must be in the http request body.
    API Parameters
    1. text: arabic text
    2. apikey: a key (provided offline) to access the API.
    Flow of events
    1. 1.The system checks if the user is authenticated or not.
    2. 2.If not authenticated, the system returns (-3) error code in JSON format.
    3. 3.If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. 4.If so the system Semantically Analyse the sentence.
    5. 5.Otherwise, the system returns (-4) error code.
    6. 7.The system returns the JSON data object.
    Retrieved Data results JSON object.
  • Models: Flat and Nested BERT models.

    Wojood Corpus: 560K tokens (MSA and dialect), manually annotated with 21 entity types, covers multiple domains and was annotated with nested and flat entities (See Article).

    WojoodFineCorpus: Same as Wojood but extended with subtypes of entities (51 tags in total),(See Article).

    WojoodGazaCorpus: 60K tokens related to Israeli War on Gaza in domains (See Article).

    Request API Token to access Wojood web service online


    Actors Authenticated user.
    URL schema https://{domain}/sina/v2/api/wojood/?apikey={key}
    Pre-conditions The user has registered and provided their API Key.
    API Parameters
      mode and sentence are received through the body
    1. mode: output format (1) JSON IBO format, (2) XML format, or (3) entities and their positions in JSON.
    2. text: arabic text
    3. apikey: a key (provided offline) to access the API.
    Flow of events
    1. 1. The system checks if the user is authenticated or not.
    2. 2. If not authenticated, the system returns (-3) error code in JSON format.
    3. 3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. 4. If so the system extracts the entities from text.
    5. 5. Otherwise, the system returns (-4) error code.
    6. 6. The system returns the results in the specified format.
    Retrieved Data returns the results in the specified format.
  • Extract relations between events and their arguments within a sentence (hasLocation, hasDate, hasAgent).

    Corpus (WojoodHadath): We extended Wojood NER corpus with relations

    Method: Novel method with using BERT with 95% accuracy, implemented as part of SinaTools (See Article)

  • The corpus includes about 16K tweets manually labeled with (abusive, hate, violence, pornographic, or non-offensive) in addition to Target, Topic, and Phrase. We fined-tuned 8 models (using HeBERT and AlphaBERT). (See Article)

    A corpora of 12,000 Facebook posts in five languages (Arabic, Hebrew, English, French, Hindi), with 2,400 posts in each language, manually annotated with Bias and Propaganda. This dataset was collected during the Israeli War on Gaza from October 7, 2023, to January 31, 2024. (See Article)

    A dataset consisting of 1,800 pairs of ChatGPT responses was created to analyze potential biases related to Palestine and Israel. The dataset encompasses the 30 articles of international human rights law, about 60 pairs for each article. Each pair was manually classified into one of three categories (Biased against Palestine, Biased against Israel, No Bias) by 12 well-trained law master’s students.

    International Workshop on Nakba Narratives as Language Resources

  • Extend: Given a one or more synonyms the tool extends it with more synonyms.

    Evaluate: Given a set of synonyms the tool evaluates how much these synonyms belong to this set. The tools is based on a novel algorithm and datasets, treating synonymy as a fuzzy relation. (See Article ).

    Request API Token to access Synonyms Generator web service online


    Actors Authenticated user.
    URL schema https://{domain}/sina/v2/api/SynonymGenerator/?apikey={key}
    Pre-conditions The user has registered and provided their API Token.
    API Parameters
      Synset, lexicons, pos and level are received through the body
    1. Synset: mono/multilingual synset.
    2. Lexicons: Select one or more of these lexicons (AWN, مكنز بيرزيت, Princeton WordNet, ALECSO, Cairo Academy).
    3. POS: part of speech (noun, verb).
    4. Level: Level3 and Level4.
    5. Apikey: A key (provided offline) to access the API.
    Flow of events
    1. The system checks if the API Key (i.e., Token) is authenticated or not.
    2. If not authenticated, the system returns (-3) error code in JSON format.
    3. If authenticated, and the access limit is not exceeded (if exceeded returns -1 in JSON format), then the system logs the request.
    4. If so the system extracts the entities from text.
    5. Otherwise, the system returns (-4) error code.
    6. The system returns the results in the specified format.
    Retrieved Data Return the candidate synonyms with their fuzzy values.
  • The dataset consists of 31,404 (MSA and Palestinian dialect). Each query is classified into one of the 77 classes (intents) including card arrival, card linking, exchange rate, etc. A set of BERT models are fine-tuned on the ArBanking77 dataset (F1-score 92% for MSA, 90% for PAL). (See Article )

  • Details of error messages returned by the APIs.


    Error Code Error Message
    -1 User blocked, exceeded access limit
    -3 user is not authenticated
    -4 Incorrect API parameter value
    -5 No Data Records Found
    -6 Incorrect Data Value
    login-error {"error":"invalid_grant","error_description":"Bad credentials"}

  • Corpus name License Description
    Synonyms CC BY 4.0 The dataset is a set of 500 synsets (extracted from the Arabic Wordnet). Each synset is enriched with a list of candidate synonyms. The total number is 3K candidates. Each candidate synonym is then annotated with a fuzzy value by four linguists (in parallel). The dataset is important for understanding how much linguists (dis/)agree on synonymy (which we found RMSE: 32% and MAE: 27%). In addition, we used the dataset as a baseline to evaluate our algorithm. See the scoring guidelines, figures, and details in section 3.
    Curras CC BY 4.0 The corpus consists of about 56K words/tokens collected from Facebook, Twitter, "Watan Aa Water" scripts, and others. Each word in the corpus was annotated with different morphological features, including (CODA, Prefixes, Stem, Suffixes, MSA lemma, Dialect Lemma, Gloss, Part-of-Speech, Gender, Number, and Aspect). The corpus was collected using the LDC’s SAMA tagsets. The first version of this corpus was released in 2013, and the 2nd version is a complete revision of the annotations released in 2022.
    Baladi CC BY 4.0 The corpus consists of about 9.6K words/tokens collected from Facebook, blog posts and traditional poems. The corpus was annotated as an extension to Curras and following the same annotation methodology to form a Levantine Corpus.
    Lisan Yemeni CC BY 4.1 The Yemeni corpus (~1.05M tokens) was collected automatically from Twitter. Each word in this corpus was annotated with different morphological features, such as POS, stem, prefixes, suffixes, lemma, and a gloss in English. The annotation process was carried out by 35 annotators who are native speakers of the target dialects. The annotators were trained on a set of guidelines and on how to use our Arabic Dialect Annotation Toolkit (ADAT), which is open source.
    Lisan Iraqi CC BY 4.2 The Iraqi corpus (~50K tokens) was manually collected from traditional Iraqi poems and blog posts. It was annotated with morphological features, lemma, and gloss.
    Lisan Egyptian CC BY 4.0 The Egyptian corpus (~450K tokens) was automatically collected from online news sources and manually annotated for morphological features. The corpus was later used to train a morphological disambiguation tool for Egyptian Arabic.
    Lisan Levantine CC BY 4.0 The Levantine corpus (~350K tokens) consists of transcriptions from TV shows and online news sources. Each token in the corpus was annotated with morphological features, including lemma and gloss.
    Nabra CC BY 4.4 N/A
    Arabic ontology CC BY 4.4 N/A
    Qabas CC BY-ND 4.0 N/A
    abGlossBERT CC BY 4.4 N/A
    Wojood CC BY 4.0 Wojood consists of about 550K tokens (MSA and dialect) that are manually annotated with 21 entity types (e.g., person, organization, location, event, date, etc). It covers multiple domains and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. A nested named entity recognition (NER) model based on BERT was trained (F1-score 88.4%).
    WojoodFine CC BY 4.0 N/A
    WojoodGaza CC BY 4.0 N/A
    Wojood Flat model CC BY 4.0 N/A
    Wojood Nested model CC BY 4.0 N/A
    WojoodHadath CC BY 4.0 N/A
    Offensive Hebrew Corpus CC BY 4.0 N/A
    SinaTools MIT N/A
    ArBanking77 CC BY-SA 4.0 N/A