sinatools.utils.similarity

sinatools.utils.similarity.get_intersection(list1, list2, ignore_all_diacratics_but_not_shadda=False, ignore_shadda_diacritic=False)

Computes the intersection of two sets of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.

Parameters
  • list1 (list) – The first list.

  • list2 (list) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool, optional) – A flag to ignore all diacratics except for the shadda. Defaults to False.

  • ignore_shadda_diacritic (bool, optional) – A flag to ignore the shadda diacritic. Defaults to False.

Returns

The intersection of the two lists, ignores diacritics if flags are true.

Return type

list

Example:

from sinatools.utils.similarity import get_intersection
list1 = ["كتب","فَعل","فَعَلَ"]
list2 = ["كتب","فَعّل"]
print(get_intersection(list1, list2, False, True))
#output: ["كتب" ,"فَعل"]
sinatools.utils.similarity.get_union(list1, list2, ignore_all_diacratics_but_not_shadda, ignore_shadda_diacritic)

Computes the union of two sets of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.

Parameters
  • list1 (str) – The first list.

  • list2 (str) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool): Whether to ignore all diacratics except shadda or not.
  • ignore_shadda_diacritic (bool): Whether to ignore shadda diacritic or not.

Returns

The union of the two lists, ignoring diacritics if flags are true.

Return type

list

Example 1:

from sinatools.utils.similarity import get_union
list1 = ["كتب","فَعل","فَعَلَ"]
list2 = ["كتب","فَعّل"]
print(get_union(list1, list2, False, True))
#output: ["كتب" ,"فَعل" ,"فَعَلَ"]

Example 2:

from sinatools.utils.similarity import get_union
list1 = ["كتب","فَعل","فَعَلَ"]
list2 = ["كتب","فَعّل"]
print(get_union(list1, list2, True, True))
#output: ["كتب" ,"فعل"]
sinatools.utils.similarity.get_jaccard_similarity(list1: list, list2: list, ignore_all_diacratics_but_not_shadda: bool, ignore_shadda_diacritic: bool)

Calculates the Jaccard similarity coefficient between two lists of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.

Parameters
  • list1 (list) – The first list.

  • list2 (list) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool) – A flag indicating whether to ignore all diacratics except for shadda.

  • ignore_shadda_diacritic (bool) – A flag indicating whether to ignore the shadda diacritic.

Returns

The Jaccard similarity coefficient between the two lists, ignoring diacritics if flags are true.

Return type

float

Example 1:

from sinatools.utils.similarity import get_jaccard_similarity
list1 = ["كتب","فَعل","فَعَلَ"]
list2 = ["كتب","فَعّل"]
print(get_jaccard_similarity(list1, list2, True, True))
#output: 1.0

Example 2:

from sinatools.utils.similarity import get_jaccard_similarity
list1 = ["كتب","فَعل","فَعَلَ"]
list2 = ["كتب","فَعّل"]
print(get_jaccard_similarity(list1, list2, False, False))
#output: 0.25
sinatools.utils.similarity.get_jaccard(delimiter, str1, str2, selection, ignoreAlldiacraticsButNotShadda=True, ignoreShaddaDiacritic=True)

Calculates and returns the Jaccard similarity values (union, intersection, or Jaccard similarity) between two lists of Arabic words, considering the differences in their diacritization. The method provides two options for handling diacritics: (i) ignore all diacritics except for shadda, and (ii) ignore the shadda diacritic as well. You can try the demo online.

Parameters
  • delimiter (str) – The delimiter used to split the input strings.

  • str1 (str) – The first input string to compare.

  • str2 (str) – The second input string to compare.

  • selection (str) – The desired operation to perform on the two sets of strings. Must be one of intersection, union, jaccardSimilarity, or jaccardAll.

  • ignoreAlldiacraticsButNotShadda (bool) – If True, ignore all diacratics except for the Shadda diacritic. (Default is True)

  • ignoreShaddaDiacritic (bool) – If True, ignore the Shadda diacritic.(Default is True)

Returns

Three values (Jaccard similarity, union, or intersection) between the two lists of Arabic words depending on the parameter selection.

Example:

from sinatools.utils.similarity import get_jaccard
str1 = "فَعَلَ | فَعل"
str2 = "فَعّل"
print(get_jaccard("|", "jaccardAll", str1, str2, True, True))
#output: ['intersection:', ['فعل'], 'union:', ['فعل'], 'similarity:', 1.0]