CLI.utils.jaccard_similarity¶
About:¶
The jaccard_similarity command computes the Jaccard similarity between two sets of strings. The Jaccard similarity is the size of the intersection divided by the size of the union of the sample sets. It provides a measure of similarity between two sets.
Usage:¶
Below is the usage information that can be generated by running jaccard_similarity -–help.
jaccard_similarity --list1="WORD1, WORD2" --list2="WORD1,WORD2" --delimiter="DELIMITER" --selection="SELECTION"
jaccard_similarity --file1=File1 --file2=File2 --delimiter="DELIMITER" --selection="SELECTION"
Options:¶
--list1 WORD1 WORD2 ...
First list of strings (delimiter-separated).
--list2 WORD1 WORD2 ...
Second list of strings (delimiter-separated).
--file1
First file containing the first set of words
--file2
Second file containing the second set of words
--delimiter
Denote the bounds between regions in a text
--selection
Selecting the Jaccard function type, which can be one of the following options: 'jaccardAll', 'intersection', 'union', or 'similarity'.
--ignoreAllDiacriticsButNotShadda
If this option is selected, the comparison will be between two lists after ignoring all diacritics from the lists but keeping the shadda.
--ignoreShaddaDiacritic
If this option is selected, the comparison will be between two lists after ignoring diacritics (shadda) from lists of strings.
Examples:¶
jaccard_similarity --list1 "word1,word2" --list2 "word1, word2" --delimiter "," --selection "jaccardAll" --ignoreAllDiacriticsButNotShadda --ignoreShaddaDiacritic
jaccard_similarity --file1 "path/to/your/file1.txt" --file2 "path/to/your/file2.txt" --delimiter "," --selection "jaccardAll" --ignoreAllDiacriticsButNotShadda --ignoreShaddaDiacritic