sinatools.utils.tokenizer

sinatools.utils.tokenizer.sentence_tokenizer(text, dot=True, new_line=True, question_mark=True, exclamation_mark=True)

This method tokenizes a text into a set of sentences based on the selected separators, including the dot, new line, question mark, and exclamation mark.

Parameters
  • text (bool) – Arabic text to be tokenized.

  • dot (bool) – flag to split text based on Dot (default is True).

  • new_line (bool) – flag to split text based on new_line (default is True).

  • question_mark (bool) – flag to split text based on question_mark (default is True).

  • exclamation_mark (bool) – flag to split text based on exclamation_mark (default is True).

Returns

list of sentences.

Return type

list

Example:

from sinatools.utils.tokenizer import sentence_tokenizer
sentences = tokenizer.sentence_tokenizer("مختبر سينا لحوسبة اللغة والذكاء الإصطناعي. في جامعة بيرزيت.", dot=True, new_line=True, question_mark=True, exclamation_mark=True)
print(sentences)

#output
['مختبر سينا لحوسبة اللغة والذكاء الإصطناعي.', 'في جامعة بيرزيت.']
sinatools.utils.tokenizer.corpus_tokenizer(dir_path, output_csv, row_id=1, global_sentence_id=True)

This method receives a directory and tokenizes all files within the input directory, as well as all files within subdirectories within the main directory. The results are then stored in a CSV file.

Parameters
  • dir_path (str) – The path of the directory containing multiple Arabic txt files.

  • output_csv (str) – The name of the output CSV file, which will be generated in the current directory where this function is used.

  • row_id (int) – Specifies the row_id you wish to start with; the (default value is 1).

  • global_sentence_id (int) – Specifies the global_sentence_id you wish to start with; the (default value is 1).

Returns

csv file (str): The CSV file contains the following fields:

  • Row_ID (int) - primary key, unique for all records in outputfile

  • Docs_Sentence_Word_ID (str) - DirectoryName_FileName_GlobalSentenceID_SentenceID_WordPosition

  • GlobalSentenceID ( str) - a unique identifier for each sentence in the entire file

  • SentenceID (int) - a unique identifier for each file within the CSV file

  • Sentence (str) - Generated text that forms a sentence

  • Word Position (Integer, the position of each word within the sentence)

  • Word (Each row contains a word from the generated sentence).

Return type

str

Example:

from sinatools.utils.tokenizer import corpus_tokenizer
tokenizer.corpus_tokenizer( dir_path="History", output_csv="ouputFile.csv", row_id=1, global_sentence_id=1)
  
#output
# csv file called: ouputFile.csv 
# For example, if the 'History' directory contains 2 files named 'h1.txt' and 'h2.txt'. 
# The output file will contain:
# Row_ID, Docs_Sentence_Word_ID, Global Sentence ID, Sentence ID, Sentence, Word Position, Word
# 1,History_h1_1_1_1,1,1,الطيور الضارة ومكافحتها,1,الطيور
# 2,History_h1_1_1_2,1,1,الطيور الضارة ومكافحتها,2,الضارة
# 3,History_h1_1_1_3,1,1,الطيور الضارة ومكافحتها,3,ومكافحتها
# 4,History_h2_2_1_1,1,1,بشكل عام,1,بشكل
# 5,History_h2_2_1_2,1,1,بشكل عام,2,عام