CLI.utils.corpus_tokenizer

About:

The sina_corpus_tokenizer command offers functionality to tokenize a corpus and write the results to a CSV file. It recursively searches through a specified directory for text files, tokenizes the content, and outputs the results, including various metadata, to a specified CSV file.

Usage:

Below is the usage information that can be generated by running sina_corpus_tokenizer --help.

Usage:
    sina_corpus_tokenizer dir_path output_csv
Positional Arguments:
  dir_path
        The path to the directory containing the text files.

output_csv
        The path to the output CSV file.

Examples:

sina_corpus_tokenizer --dir_path "/path/to/text/directory/of/files" --output_csv  "outputFile.csv"
  • The tool only processes text files (with a .txt extension).

  • The output CSV will contain the following columns:
    • ‘Row_ID’: a unique identifier for each word token.

    • ‘Sentence ID’: the identifier for the sentence in which the word appears.

    • ‘Docs_Sentence_Word_ID’: a concatenated identifier comprising directory name, file name, sentence id, and word position.

    • ‘Word Position’: the position of the word within the sentence.

  • Ensure that the text files are appropriately encoded in UTF-8 or compatible formats.

  • The tool uses the nltk library for sentence and word tokenization. Make sure to have the library installed in your environment.