CLI.utils.corpus_tokenizer¶

About:¶

The sina_corpus_tokenizer command offers functionality to tokenize a corpus and write the results to a CSV file. It recursively searches through a specified directory for text files, tokenizes the content, and outputs the results, including various metadata, to a specified CSV file.

Usage:¶

Below is the usage information that can be generated by running sina_corpus_tokenizer --help.

Usage:
    sina_corpus_tokenizer dir_path output_csv

Positional Arguments:
  dir_path
        The path to the directory containing the text files.

output_csv
        The path to the output CSV file.

Examples:¶

sina_corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv  "outputFile.csv"

The tool only processes text files (with a .txt extension).
The output CSV will contain the following columns:
- ‘Row_ID’: a unique identifier for each word token.
- ‘Sentence ID’: the identifier for the sentence in which the word appears.
- ‘Docs_Sentence_Word_ID’: a concatenated identifier comprising directory name, file name, sentence id, and word position.
- ‘Word Position’: the position of the word within the sentence.
Ensure that the text files are appropriately encoded in UTF-8 or compatible formats.
The tool uses the nltk library for sentence and word tokenization. Make sure to have the library installed in your environment.