CLI.utils.jaccard_union¶
sina_union About: —— The sina_union tool generates the union of two lists after applying specific normalization rules, which may include ignoring diacritics. It has the flexibility to ignore all diacritics except the shadda, or to solely ignore the shadda diacritic, based on user preference.
Usage:¶
Below is the usage information that can be generated by running sina_union –help.
- Usage:
sina_union –list1=WORD1 WORD2 … –list2=WORD1 WORD2 … [options]
- Options:
- –list1 WORD1 WORD2 …
First list of strings (space-separated).
- –list2 WORD1 WORD2 …
Second list of strings (space-separated).
- --ignore_all_diacritics_but_not_shadda
Apply normalization rules to ignore all diacritics but not the shadda.
- --ignore_shadda_diacritic
Apply normalization rules to ignore the shadda diacritic.
Examples
sina_union –list1 word1 word2 word3 –list2 word4 word5 word6 –ignore_all_diacritics_but_not_shadda –ignore_shadda_diacritic
Note:¶
The two normalization options can be used individually or together. However, the combination will result in both rules being applied, and thus, the shadda diacritic will be ignored as well. The union is generated by keeping distinct words after normalization. If two words, after normalization, become identical, one of them is discarded based on a set of predefined criteria (defined in the get_non_preferred_word function).