sinatools.utils.tokenizer¶
-
sinatools.utils.tokenizer.
sentence_tokenizer
(text, dot=True, new_line=True, question_mark=True, exclamation_mark=True)¶ This method tokenizes a text into a set of sentences based on the selected separators, including the dot, new line, question mark, and exclamation mark.
- Parameters
text (
bool
) – Arabic text to be tokenized.dot (
bool
) – flag to split text based on Dot (default is True).new_line (
bool
) – flag to split text based on new_line (default is True).question_mark (
bool
) – flag to split text based on question_mark (default is True).exclamation_mark (
bool
) – flag to split text based on exclamation_mark (default is True).
- Returns
list of sentences.
- Return type
Example:
from sinatools.utils.tokenizer import sentence_tokenizer sentences = tokenizer.sentence_tokenizer("مختبر سينا لحوسبة اللغة والذكاء الإصطناعي. في جامعة بيرزيت.", dot=True, new_line=True, question_mark=True, exclamation_mark=True) print(sentences) #output ['مختبر سينا لحوسبة اللغة والذكاء الإصطناعي.', 'في جامعة بيرزيت.']
-
sinatools.utils.tokenizer.
corpus_tokenizer
(dir_path, output_csv, row_id=1, global_sentence_id=True)¶ This method receives a directory and tokenizes all files within the input directory, as well as all files within subdirectories within the main directory. The results are then stored in a CSV file.
- Parameters
dir_path (
str
) – The path of the directory containing multiple Arabic txt files.output_csv (
str
) – The name of the output CSV file, which will be generated in the current directory where this function is used.row_id (
int
) – Specifies the row_id you wish to start with; the (default value is 1).global_sentence_id (
int
) – Specifies the global_sentence_id you wish to start with; the (default value is 1).
- Returns
csv file (
str
): The CSV file contains the following fields:Row_ID (
int
) - primary key, unique for all records in outputfileDocs_Sentence_Word_ID (
str
) - DirectoryName_FileName_GlobalSentenceID_SentenceID_WordPositionGlobalSentenceID (
str
) - a unique identifier for each sentence in the entire fileSentenceID (
int
) - a unique identifier for each file within the CSV fileSentence (
str
) - Generated text that forms a sentenceWord Position (Integer, the position of each word within the sentence)
Word (Each row contains a word from the generated sentence).
- Return type
Example:
from sinatools.utils.tokenizer import corpus_tokenizer tokenizer.corpus_tokenizer( dir_path="History", output_csv="ouputFile.csv", row_id=1, global_sentence_id=1) #output # csv file called: ouputFile.csv # For example, if the 'History' directory contains 2 files named 'h1.txt' and 'h2.txt'. # The output file will contain: # Row_ID, Docs_Sentence_Word_ID, Global Sentence ID, Sentence ID, Sentence, Word Position, Word # 1,History_h1_1_1_1,1,1,الطيور الضارة ومكافحتها,1,الطيور # 2,History_h1_1_1_2,1,1,الطيور الضارة ومكافحتها,2,الضارة # 3,History_h1_1_1_3,1,1,الطيور الضارة ومكافحتها,3,ومكافحتها # 4,History_h2_2_1_1,1,1,بشكل عام,1,بشكل # 5,History_h2_2_1_2,1,1,بشكل عام,2,عام