SinaTools: Toolkit for Arabic NLP

SinaTools

Open-source Toolkit for Arabic NLP

Python APIs, command lines, colabs, online demos, and many datasets. Outperformed all related tools in all tasks.

Documentation GitHub Download MIT License Discussion

Modules:

Morphology Tagging

Lemmatizer and POS tagger, outperform all related tools [1].
Performance: Speed (33K tokens/sec), lemmatization(90.5%), POS(93.8%)

DEMO Documentation Benchmarks Article

                                from sinatools.morphology import morph_analyzer
morph_analyzer.analyze('ذهب الولد إلى المدرسة')
[{ 
    "token": "ذهب",
    "lemma": "ذَهَبَ",
    "lemma_id": "202001617",
    "root": "ذ ه ب",
    "pos": "فعل ماضي",
    "frequency": "82202"
  },{ 
    "token": "الولد",
    "lemma": "وَلَدٌ",
    "lemma_id": "202003092",
    "root": "و ل د",
    "pos": "اسم",
    "frequency": "19066"
  },{ 
    "token": "إلى",
    "lemma": "إِلَى",
    "lemma_id": "202000856",
    "root": "إ ل ى",
    "pos": "حرف جر",
    "frequency": "7367507"
  },{ 
    "token": "المدرسة",
    "lemma": "مَدْرَسَةٌ",
    "lemma_id": "202002620",
    "root": "د ر س",
    "pos": "اسم",
    "frequency": "145285"
}]

                              

Word Sense Disambiguation

Performs three tasks together. Given a sentence as input it tags (single-word WSD, multi-word WSD, and NER) in this sentence.
Performance: single-word WSD (81.73%) multi-word WSD (88.92%) - SALMA corpus See [5].

DEMO Documentation Benchmarks Articles 5, 6, 7, 8 Video

                                from sinatools.wsd.disambiguator import disambiguate
disambiguate('تمشيت بين الجداول والأنهار')
[{
    'concept_id': '303051631',
    'word': 'تمشيت',
    'undiac_lemma': 'تمشى',
    'diac_lemma': 'تَمَشَّى'
},{
    'word': 'بين',
    'undiac_lemma': 'بين',
    'diac_lemma': 'بَيْنَ'
},{
    'concept_id': '303007335',
    'word': 'الجداول',
    'undiac_lemma': 'جدول',
    'diac_lemma': 'جَدْوَلٌ'
},{
    'concept_id': '303056588',
    'word': 'والأنهار',
    'undiac_lemma': 'نهر',
    'diac_lemma': 'نَهْرٌ'
}]

                              

Named Entity Recognition

Nested and Flat NER, 21 entity classes, and 31 entity subtypes.
Performance: Nested(89.42%) Flat(87.33%) - Wojood corpus, See [2, 3, 4].

DEMO Documentation Benchmarks Articles 2, 3, 4 Video

                                from sinatools.ner.entity_extractor import extract
extract('ذهب محمد إلى جامعة بيرزيت')
[{
    "word":"ذهب",
    "tags":"O"
  },{
    "word":"محمد",
    "tags":"B-PERS"
  },{
    "word":"إلى",
    "tags":"O"
  },{
    "word":"جامعة",
    "tags":"B-ORG"
  },{
    "word":"بيرزيت",
    "tags":"B-GPE I-ORG"
}]

                              

Relation Extraction

Extract events and their corresponding arguments (agents, locations, and dates).
Performance:
Wojood^Hadath (93.99%)
Wojood^OutOfDoman (74.90%)

Online Demo

Download Corpus

Download Code

from sinatools.relations.relation_extractor import event_argument_relation_extraction
event_argument_relation_extraction('اندلعت انتفاضة الأقصى في 28 سبتمبر 2000')
#the output
[{
     "TripleID":"1",
    "Subject":{"ID": 1, "Type": "EVENT", "Label": "انتفاضة الأقصى"}
    "Relation":"location",
    "Object":{"ID": 2, "Type": "FAC", "Label": "الأقصى"}
},{
    "TripleID":"2",
    "Subject":{"ID": 1, "Type": "EVENT", "Label": "انتفاضة الأقصى"}
    "Relation":"happened at",
    "Object":{"ID": 3, "Type": "DATE", "Label": "28 سبتمبر 2000"}    
}]

Synonyms

Extend: Given one or more synonyms this module extends it with more synonyms.
Performance: 3rd level (98.7%), 4th level (92%) - Algorithm, See [10].

DEMO Documentation Article

                                from sinatools.synonyms.synonyms_generator import extend_synonyms
extend_synonyms('ممر | طريق',2)
[["مَسْلَك","61%"],["سبيل","61%"],["وَجْه","30%"],["نَهْج", "30%"],["نَمَطٌ","30%"],["مِنْهَج","30%"],["مِنهاج", "30%"],["مَوْر","30%"],["مَسَار","30%"],["مَرصَد", "30%"],["مَذْهَبٌ","30%"],["مَدْرَج","30%"],["مَجَاز","30%"]]

                              

Evaluate: Given a set of synonyms this module evaluates how much these synonyms are really synonyms in this set, See [11]

                                from sinatools.synonyms.synonyms_generator import evaluate_synonyms
evaluate_synonyms('ممر | طريق | مَسْلَك | سبيل')
[["مَسْلَك","61%"],["سبيل","60%"],["طريق","40%"],["ممر", "40%"]]

                              

Semantic Relatedness

Computes the degree of association between two sentences across various dimensions, meaning, underlying concepts, domain-specificity, topic overlap, and viewpoint alignment.
Performance: correlation score (49%), See [9].

DEMO Documentation Article

                                from sinatools.semantic_relatedness.compute_relatedness import get_similarity_score
sentence1 = "تبلغ سرعة دوران الأرض حول الشمس حوالي 110 كيلومتر في الساعة."
sentence2 = "تدور الأرض حول محورها بسرعة تصل تقريبا 1670 كيلومتر في الساعة."
get_similarity_score(sentence1, sentence2)
Score = 0.90

                              

+-
Diacritic-Based Matching
Decides whether two Arabic words are the same or not, taking into account diacratization compatibility - Algorithm, See [12].
DEMO Documentation Article
from sinatools.utils.implication import Implication word1 = "قالَ" word2 = "قْال" implication = Implication(word1, word2) result = implication.get_result() print(result) Output: "Same"
+-
Utilites
A set of useful NLP methods for sentence splitting, duplicate word removal, Arabic Jaccard similarity metrics, transliteration, and others.
DEMO Documentation
Jaccard Similarity Function: Calculates and returns the Jaccard similarity values (union, intersection, or Jaccard similarity) between two lists of Arabic words, considering the differences in their diacritization.
jaccard_similarity --list1="WORD1, WORD2" --list2="WORD1,WORD2" --delimiter="DELIMITER" --selection="SELECTION"
Corpus Tokenizer: Receives a directory of files as input, splits the text into sentences and tokens, and assigns an auto-incrementing ID, sentence ID, and global sentence ID to each token.
corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv "outputFile.csv"
Text Duplication Detector: Processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold and cosine similarity. It saves the filtered results and the identified duplicates into separate files.
text_dublication_detector --csv_file "/path/to/text/file" --column_name "name of the column" --final_file_name "Final.csv" --deleted_file_name "deleted.csv" --similarity_threshold 0.8

Publication:
Tymaa Hammouda, Mustafa Jarrar, Mohammed Khalilia: SinaTools: Open Source Toolkit for Arabic Natural Language Understanding. In Proceedings of the 2024 AI in Computational Linguistics (ACLING 2024), Procedia Computer Science, Dubai. ELSEVIER.