The first step to preprocessing is segmenting your running text into words, a process called tokenization.
Tokenization will split each text in your corpus into lists of tokens, which can be words, numbers, or punctuation. If you’d like to focus your analysis only on the tokens that are likely to be words, ignoring any numbers and punctuation marks, you can create lists of alphabetic tokens. Using the lists of tokens that word segmentation produces for your corpus, you can calculate word frequencies and lexical diversity. When using word segmentation in combination with sentence segmentation, more complex analyses become possible.
The second step to preprocessing is segmenting your running text into sentences. Your text analysis tool will use punctuation marks and capitalization to estimate where sentences begin and end, transforming the running texts in your corpus into lists of sentences. Using the lists of sentences, combined with the lists of tokens, for your corpus, you can identify the parts of speech in each sentence, estimate the named people and places in your corpus, and much more!







