sinatools.utils.tokenizer¶

sinatools.utils.tokenizer.sentence_tokenizer(text, dot=True, new_line=True, question_mark=True, exclamation_mark=True)¶

This method tokenizes a text into a set of sentences based on the selected separators, including the dot, new line, question mark, and exclamation mark.

Parameters

text (bool) – Arabic text to be tokenized.
dot (bool) – flag to split text based on Dot (default is True).
new_line (bool) – flag to split text based on new_line (default is True).
question_mark (bool) – flag to split text based on question_mark (default is True).
exclamation_mark (bool) – flag to split text based on exclamation_mark (default is True).

Returns

list of sentences.

Return type

list

Example:

from sinatools.utils.tokenizer import sentence_tokenizer
sentences = tokenizer.sentence_tokenizer("مختبر سينا لحوسبة اللغة والذكاء الإصطناعي. في جامعة بيرزيت.", dot=True, new_line=True, question_mark=True, exclamation_mark=True)
print(sentences)

#output
['مختبر سينا لحوسبة اللغة والذكاء الإصطناعي.', 'في جامعة بيرزيت.']

sinatools.utils.tokenizer.corpus_tokenizer(dir_path, output_csv, row_id=1, global_sentence_id=True)¶

This method is designed to tokenize a corpus into words. It receives a directory and tokenizes all files within the input directory, as well as all files within subdirectories within the main directory. The results are then stored in one CSV file. The data within files was split into sentences using the sentence_tokenizer module and into words using a word tokenizer. Additionally, it added a set of ids (row_id, docs_sentence_word_id, global_sentence_id, sentence_id, word_position).

Parameters

dir_path (str) – The path of the directory containing multiple Arabic txt files.
output_csv (str) – The name of the output CSV file, which will be generated in the current directory where this function is used.
row_id (int) – By default, the row ID is an auto-incrementing number that starts from 1. This variable allows the user to change the starting point for this ID.
global_sentence_id (int) – By default, the global sentence id is an auto-incrementing number for each sentence that starts from 1. This variable allows the user to change the starting point for this ID.

Returns

csv file (str): The CSV file contains the following fields:

Row_ID (int) - It's an auto-incrementing number, a unique identifier for each row.
Docs_Sentence_Word_ID (str) - The ID for each row contains a DirectoryName, FileName, GlobalSentenceID, SentenceID, and WordPosition, all concatenated together and separated by underscores in this format (DirectoryName_FileName_GlobalSentenceID_SentenceID_WordPosition).
GlobalSentenceID ( str) - a unique identifier for each sentence in the resulted CSV file
SentenceID (int) - an unique identifier within each file.
Sentence (str) - The splitted sentence.
Word Position (int) - The position of each word within the sentence
Word (str) - Each row contains a word from the sentence.

Return type

str

Example:

from sinatools.utils.tokenizer import corpus_tokenizer
tokenizer.corpus_tokenizer( dir_path="History", output_csv="ouputFile.csv", row_id=1, global_sentence_id=1)
  
#output
# csv file called: ouputFile.csv 
# For example, if the 'History' directory contains 2 files named 'h1.txt' and 'h2.txt'. 
# The output file will contain:
# Row_ID, Docs_Sentence_Word_ID, Global Sentence ID, Sentence ID, Sentence, Word Position, Word
# 1,History_h1_1_1_1,1,1,الطيور الضارة ومكافحتها,1,الطيور
# 2,History_h1_1_1_2,1,1,الطيور الضارة ومكافحتها,2,الضارة
# 3,History_h1_1_1_3,1,1,الطيور الضارة ومكافحتها,3,ومكافحتها
# 4,History_h2_2_1_1,1,1,بشكل عام,1,بشكل
# 5,History_h2_2_1_2,1,1,بشكل عام,2,عام