CLI.utils.text_dublication_detector¶

About:¶

The text_dublication_detector command processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold and cosine similarity. It saves the filtered results and the identified duplicates to separate files.

Usage:¶

Below is the usage information that can be generated by running text_dublication_detector -–help.

text_dublication_detector --csv_file "path/to/csv/file" --column_name "text" --final_file_name "path/to/csv/file" --deleted_file_name "path/to/csv/file" --similarity_threshold 0.8

Options:¶

--csv_file 
    The CSV file contains Arabic text that needs to be cleaned.
--column_name WORD1 WORD2 ...
    This is the name of the column containing the text for duplicate removal.
--final_file_name
    This is the name of the CSV file that will contain the data after duplicate removal.
--deleted_file_name      
    This is the name of the file that will contain all the duplicate records that are deleted.
--similarity_threshold 
    This is a floating-point number. The default value is 0.8, indicating the percentage of similarity that the function should use when deleting duplicates from the text column.

Examples:¶

text_dublication_detector --csv_file "text.csv" --column_name "A" --final_file_name "Final.csv" --deleted_file_name "deleted.csv" --similarity_threshold 0.8