sinatools.utils.text_dublication_detector¶

sinatools.utils.text_dublication_detector.removal(csv_file, column_name, final_file_name, deleted_file_name, similarity_threshold )¶

This method is designed to identify dublicate text in a given corpora/text. It processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold. We used cosine similarity to measure similarity between words and sentences. The method saves the filtered results and the identified duplicates to separate files.

Parameters

csv_file (str) – The CSV file contains Arabic text that needs to be cleaned.
column_name (str) – This is the name of the column containing the text that needs to be checked for duplicate removal.
final_file_name (str) – This is the name of the CSV file that will contain the data after duplicate removal.
deleted_file_name (str) – This is the name of the file that will contain all the duplicate records that are deleted.
similarity_threshold (float) – This is a floating-point number. The default value is 0.8, indicating the percentage of similarity that the function should use when deleting duplicates from the text column.

Returns

csv files.

Example:

from sinatools.utils.text_dublication_detector import removal
removal("/path/to/csv/file1", sentences, "/path/to/csv/file2", 0.8)