sinatools.utils.jaccard

sinatools.utils.jaccard.get_intersection(list1, list2, ignore_all_diacratics_but_not_shadda=False, ignore_shadda_diacritic=False)

Get the intersection of two lists after normalization and ignoring diacratics based on input flags. You can try the demo online.

Parameters
  • list1 (list) – The first list.

  • list2 (list) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool, optional) – A flag to ignore all diacratics except for the shadda. Defaults to False.

  • ignore_shadda_diacritic (bool, optional) – A flag to ignore the shadda diacritic. Defaults to False.

Returns

The intersection of the two lists after normalization and ignoring diacratics.

Return type

list

sinatools.utils.jaccard.get_non_preferred_word(word1, word2)

Returns the non-preferred word between the two input words.

Parameters
  • word1 (str) – The first word.

  • word2 (str) – The second word.

Returns

The non-preferred word. If there is no non-preferred word, the ‘#’ is returned.

Return type

str

sinatools.utils.jaccard.get_preferred_word(word1, word2)

Returns the preferred word among two given words based on their implication.

Parameters
  • word1 (str) – The first word.

  • word2 (str) – The second word.

Returns

The preferred word among the two given words.

Return type

str

sinatools.utils.jaccard.get_union(list1, list2, ignore_all_diacratics_but_not_shadda, ignore_shadda_diacritic)

Finds the union of two lists by removing duplicates and normalizing words.

Parameters
  • list1 (str) – The first list.

  • list2 (str) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool): Whether to ignore all diacratics except shadda or not.
  • ignore_shadda_diacritic (bool): Whether to ignore shadda diacritic or not.

Returns

The union of the two lists after removing duplicates and normalizing words.

Return type

list

sinatools.utils.jaccard.jaccard(delimiter, str1, str2, selection, ignoreAlldiacraticsButNotShadda=True, ignoreShaddaDiacritic=True)

Compute the Jaccard similarity, union, or intersection of two sets of strings.

Parameters
  • delimiter (str) – The delimiter used to split the input strings.

  • str1 (str) – The first input string to compare.

  • str2 (str) – The second input string to compare.

  • selection (str) – The desired operation to perform on the two sets of strings. Must be one of intersection, union, jaccardSimilarity, or jaccardAll.

  • ignoreAlldiacraticsButNotShadda (bool) – If True, ignore all diacratics except for the Shadda diacritic. (Defualt is True)

  • ignoreShaddaDiacritic (bool) – If True, ignore the Shadda diacritic.(Default is True)

Returns

The Jaccard similarity, union, or intersection of the two sets of strings, depending on the value of the selection argument.

sinatools.utils.jaccard.jaccard_similarity(list1: list, list2: list, ignore_all_diacratics_but_not_shadda: bool, ignore_shadda_diacritic: bool)

Calculates the Jaccard similarity coefficient between two lists.

Parameters
  • list1 (list) – The first list.

  • list2 (list) – The second list.

  • ignore_all_diacratics_but_not_shadda (bool) – A flag indicating whether to ignore all diacratics except for shadda.

  • ignore_shadda_diacritic (bool) – A flag indicating whether to ignore the shadda diacritic.

Returns

The Jaccard similarity coefficient between the two lists.

Return type

float

sinatools.utils.jaccard.normalize_word(word: str, ignore_all_diacratics_but_not_shadda: bool = True, ignore_shadda_diacritic: bool = True)

Normalize a given Arabic word by removing diacratics and/or shadda diacritic.

Parameters
  • word (str) – The input text.

  • ignore_all_diacratics_but_not_shadda (bool) – A boolean flag indicating whether to remove all diacratics except shadda (default is True).

  • ignore_shadda_diacritic (bool) – A boolean flag indicating whether to remove shadda diacritic (default is True).

Returns

str Normalized Arabic word.