sinatools.utils.text_dublication_detector¶
-
sinatools.utils.text_dublication_detector.
removal
(csv_file, column_name, final_file_name, deleted_file_name, similarity_threshold )¶ This method is designed to identify dublicate text in a given corpora/text. It processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold. We used cosine similarity to measure similarity between words and sentences. The method saves the filtered results and the identified duplicates to separate files.
- Parameters
csv_file (
str
) – The CSV file contains Arabic text that needs to be cleaned.column_name (
str
) – This is the name of the column containing the text that needs to be checked for duplicate removal.final_file_name (
str
) – This is the name of the CSV file that will contain the data after duplicate removal.deleted_file_name (
str
) – This is the name of the file that will contain all the duplicate records that are deleted.similarity_threshold (
float
) – This is a floating-point number. The default value is 0.8, indicating the percentage of similarity that the function should use when deleting duplicates from the text column.
- Returns
csv files.
Example:
from sinatools.utils.text_dublication_detector import removal removal("/path/to/csv/file1", sentences, "/path/to/csv/file2", 0.8)