sinatools.utils.text_dublication_detector¶
-
sinatools.utils.text_dublication_detector.
removal
(csv_file, column_name, final_file_name, deleted_file_name, similarity_threshold )¶ Processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold and cosine similarity. It saves the filtered results and the identified duplicates to separate files.
- Parameters
csv_file (
str
) – The CSV file contains Arabic text that needs to be cleaned.column_name (
str
) – This is the name of the column containing the text for duplicate removal.final_file_name (
str
) – This is the name of the CSV file that will contain the data after duplicate removal.deleted_file_name (
str
) – This is the name of the file that will contain all the duplicate records that are deleted.similarity_threshold (
float
) – This is a floating-point number. The default value is 0.8, indicating the percentage of similarity that the function should use when deleting duplicates from the text column.
- Returns
csv files.
Example:
from sinatools.utils.text_dublication_detector import removal removal("/path/to/csv/file", sentences, "/path/to/csv/file", 0.8)