CLI.utils.jaccard_similarity

About:

The jaccard_similarity command computes the Jaccard similarity between two sets of strings. The Jaccard similarity is the size of the intersection divided by the size of the union of the sample sets. It provides a measure of similarity between two sets.

Usage:

Below is the usage information that can be generated by running jaccard_similarity -–help.

jaccard_similarity --list1="WORD1, WORD2"  --list2="WORD1,WORD2" --delimiter="DELIMITER"  --selection="SELECTION" 
jaccard_similarity --file1=File1 --file2=File2 --delimiter="DELIMITER"  --selection="SELECTION" 

Options:

--list1 WORD1 WORD2 ...
      First list of strings (delimiter-separated).
--list2 WORD1 WORD2 ...
      Second list of strings (delimiter-separated).
--file1
      First file containing the first set of words
--file2      
      Second file containing the second set of words
--delimiter 
      Denote the bounds between regions in a text
--selection
      Selecting the Jaccard function type, which can be one of the following options: 'jaccardAll', 'intersection', 'union', or 'similarity'.
--ignoreAllDiacriticsButNotShadda 
      If this option is selected, the comparison will be between two lists after ignoring all diacritics from the lists but keeping the shadda.
--ignoreShaddaDiacritic        
      If this option is selected, the comparison will be between two lists after ignoring diacritics (shadda) from lists of strings.

Examples:

jaccard_similarity --list1 "word1,word2"  --list2 "word1, word2" --delimiter ","  --selection "jaccardAll" --ignoreAllDiacriticsButNotShadda --ignoreShaddaDiacritic 
jaccard_similarity --file1 "path/to/your/file1.txt"  --file2 "path/to/your/file2.txt" --delimiter ","  --selection "jaccardAll" --ignoreAllDiacriticsButNotShadda --ignoreShaddaDiacritic