sinatools.utils.word_compare

class sinatools.utils.word_compare.Implication(word1, word2)

Compares two Arabic words to find out whether they have compatible diacratization (i.e., implication between the diacrtics).

For example: (فَعل) and (فَعَل) are compatible words. The first implies the second because it has less diacritics. Based on the implication direction score, the class determines the verdict (Same or different), as well as the diacritic distance and the number of diacritic conflicts between them. The class also returns the preferredWord, which is the “implied” word that has more diacritics.

You can try the demo online, and see the article for more details.

Parameters:

word1 (str), word2 (str) – The input words.

Note

The implication ignores the diacritics (except Shadda) in the last letter when comparing between words.

Given two words, the implication class includes five functions: (1) get_implication_score, (2) get_distance, (3) get_conflicts, (4) get_verdict, (5) get_preferred_word.

get_implication_score()

Returns the implication direction score between two words based on their diacritization. It returns an integer {0, 1, 2, 3, -1, -2} as the following: 0 if w1 and w2 implies each other, 1 if w1 implies w2, 2 if w2 implies w1, 3 if both are equal. It also returns -1 if the two words have the same letters but with conflicting diacritics, and -2 if they have different letters.

get_distance()

Returns an integer value representing the difference between the diacritics in the two words. Each diacritic difference is assigned a value of 1, but certain diacritics (specifically Sukoon, Shada, and Hamza) are weighted differently (see the article). It returns 101 if the implication score is -1 and 1000 if the implication score is -2.

get_conflicts()

Returns an integer value representing the number of conflicting letters in the diacritics in the two words. This value can range from 0 (no conflicts) to a higher integer value based on the number of different conflicts detected.

get_verdict()

Returns “Same” or “Different” representing a matching verdict. The verdict is typically determined based on a combination of distance, conflicts, and implication direction. If the distance is below 15, no conflicts, and the implication is between 0 and 3, then the verdict is Same; otherwise, the verdict is Different.

get_preferred_word(word1, word2)

Returns the preferredWord, which is the “implied” word that has more diacritics, if both words implies each (i.e., the implication direction is 0) the preferredWord would be the merge of the diacritics in both words.

Example:

from sinatools.utils.word_compare import Implication
word1 = "فَعَلَ"
word2 = "فَعل"
implication = Implication(word1, word2)
result = implication.get_verdict()
print(result)
Output: "Same"