Open-source Python toolkit for Arabic Natural Understanding, allowing people to integrate it in their system workflow.

Python APIs, command lines, colabs, and online demos.

Documentation Download MIT License
SinaTools Modules
  • Lemmatizer and POS tagger, outperforming all related tools [1].
    Performance: Speed (33K tokens/sec), lemmatization(95.9%), POS(98.2%)
    from sinatools.morphology import morph_analyzer
    morph_analyzer.analyze('ذهب الولد إلى المدرسة')
        "token": "ذهب",
        "lemma": "ذَهَبَ",
        "lemma_id": "202001617",
        "root": "ذ ه ب",
        "pos": "فعل ماضي",
        "frequency": "82202"
        "token": "الولد",
        "lemma": "وَلَدٌ",
        "lemma_id": "202003092",
        "root": "و ل د",
        "pos": "اسم",
        "frequency": "19066"
        "token": "إلى",
        "lemma": "إِلَى",
        "lemma_id": "202000856",
        "root": "إ ل ى",
        "pos": "حرف جر",
        "frequency": "7367507"
        "token": "المدرسة",
        "lemma": "مَدْرَسَةٌ",
        "lemma_id": "202002620",
        "root": "د ر س",
        "pos": "اسم",
        "frequency": "145285"
  • Nested and Flat NER, 21 enity classes, and 31 entity sytpyes.
    Performance: Nested(88%) Flat(89%) - Wojood corpus, See [2, 3, 4].
    DEMO Documentation Articles 2, 3, 4 Video
    from sinatools.ner.entity_extractor import extract
    extract('ذهب محمد إلى جامعة بيرزيت')
        "tags":"B-GPE I-ORG"
  • Performs three task all together. Given a sentence as input it tags (single-word WSD, multi-word WSD, and NER) in this sentence.
    Performance: single-word WSD (89%) multi-word WSD (96%) - SALMA corpus See [5].
    DEMO Documentation Articles 5, 6, 7, 8 Video
    from sinatools.wsd.disambiguator import disambiguate
    disambiguate('تمشيت بين الجداول والأنهار')
        'concept_id': '303051631',
        'word': 'تمشيت',
        'undiac_lemma': 'تمشى',
        'diac_lemma': 'تَمَشَّى'
        'word': 'بين',
        'undiac_lemma': 'بين',
        'diac_lemma': 'بَيْنَ'
        'concept_id': '303007335',
        'word': 'الجداول',
        'undiac_lemma': 'جدول',
        'diac_lemma': 'جَدْوَلٌ'
        'concept_id': '303056588',
        'word': 'والأنهار',
        'undiac_lemma': 'نهر',
        'diac_lemma': 'نَهْرٌ'
  • Computes the degree of association between two sentences across various dimensions, meaning, underlying concepts, domain-specificity, topic overlap, viewpoint alignment.
    Performance: correlation score (49%), See [9].
    from sinatools.semantic_relatedness.compute_relatedness import get_similarity_score
    sentence1 = "تبلغ سرعة دوران الأرض حول الشمس حوالي 110 كيلومتر في الساعة."
    sentence2 = "تدور الأرض حول محورها بسرعة تصل تقريبا 1670 كيلومتر في الساعة."
    get_similarity_score(sentence1, sentence2)
    Score = 0.90
  • Extend: Given a one or more synonyms this module extend it with more synonyms.
    Performance: 3rd level (98.7%), 4th level (92%) - Algorithm, See [10].
    from sinatools.synonyms.synonyms_generator import extend_synonyms
    extend_synonyms('ممر | طريق',2)
    [["مَسْلَك","61%"],["سبيل","61%"],["وَجْه","30%"],["نَهْج", "30%"],["نَمَطٌ","30%"],["مِنْهَج","30%"],["مِنهاج", "30%"],["مَوْر","30%"],["مَسَار","30%"],["مَرصَد", "30%"],["مَذْهَبٌ","30%"],["مَدْرَج","30%"],["مَجَاز","30%"]]
    Evaluate: Given a set of synonyms this module evaluate how much these synonyms are realy a synonyms in this set, See [11]
    from sinatools.synonyms.synonyms_generator import evaluate_synonyms
    evaluate_synonyms('ممر | طريق | مَسْلَك | سبيل')
    [["مَسْلَك","61%"],["سبيل","60%"],["طريق","40%"],["ممر", "40%"]]
  • Decides whether two Arabic words are the same or not, taking into account diacratization compatibility - Algorithm, See [12].
    from sinatools.utils.implication import Implication
    word1 = "قالَ"
    word2 = "قْال"
    implication = Implication(word1, word2)
    result = implication.get_result()
    Output: "Same"
  • Corpus Tokenizer: Receives a directory of files as input, splits the text into sentences and tokens, and assigns an auto-incrementing ID, sentence ID, and global sentence ID to each token.
    corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv  "outputFile.csv"
    Text Duplication Detector: Processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold and cosine similarity. It saves the filtered results and the identified duplicates to separate files.
    text_dublication_detector --csv_file "/path/to/text/file" --column_name "name of the column" --final_file_name "Final.csv" --deleted_file_name "deleted.csv" --similarity_threshold 0.8  
  • A set of useful NLP methods for sentence splitting, duplicate word removal, Arabic Jaccard similarity metrics, transliteration and others.
Tymaa Hammouda, Mustafa Jarrar, Mohammed Khalilia: SinaTools: Open Source Toolkit for Arabic Natural Language Understanding. In Proceedings of the 2024 AI in Computational Linguistics (ACLING 2024), Dubai.