CLI.utils.text_dublication_detector¶
About:¶
The text_dublication_detector command processes a CSV file of sentences to identify and remove duplicate sentences based on a specified threshold and cosine similarity. It saves the filtered results and the identified duplicates to separate files.
Usage:¶
Below is the usage information that can be generated by running text_dublication_detector -–help.
text_dublication_detector --csv_file "path/to/csv/file" --column_name "text" --final_file_name "path/to/csv/file" --deleted_file_name "path/to/csv/file" --similarity_threshold 0.8
Options:¶
--csv_file
The CSV file contains Arabic text that needs to be cleaned.
--column_name WORD1 WORD2 ...
This is the name of the column containing the text for duplicate removal.
--final_file_name
This is the name of the CSV file that will contain the data after duplicate removal.
--deleted_file_name
This is the name of the file that will contain all the duplicate records that are deleted.
--similarity_threshold
This is a floating-point number. The default value is 0.8, indicating the percentage of similarity that the function should use when deleting duplicates from the text column.
Examples:¶
text_dublication_detector --csv_file "text.csv" --column_name "A" --final_file_name "Final.csv" --deleted_file_name "deleted.csv" --similarity_threshold 0.8