Preprocess Text

Preprocessing a corpus transforms the running text in your corpus into individual words and sentences in a standardized format, preparing the text for computational analysis.

Automated preprocessing tools won’t be perfect, especially if your text has been digitized rather than being born-digital, but they’re still a key resource for all text analysis work. Coding libraries (e.g., Python’s NLTK) and graphical user interface tools (e.g., AntConc, Voyant Tools) both provide preprocessing capabilities.

Tokenisation and Word Segmentation

The first step to preprocessing is segmenting your running text into words, a process called tokenization.

Tokenization will split each text in your corpus into lists of tokens, which can be words, numbers, or punctuation. If you’d like to focus your analysis only on the tokens that are likely to be words, ignoring any numbers and punctuation marks, you can create lists of alphabetic tokens. Using the lists of tokens that word segmentation produces for your corpus, you can calculate word frequencies and lexical diversity. When using word segmentation in combination with sentence segmentation, more complex analyses become possible.

The second step to preprocessing is segmenting your running text into sentences. Your text analysis tool will use punctuation marks and capitalization to estimate where sentences begin and end, transforming the running texts in your corpus into lists of sentences. Using the lists of sentences, combined with the lists of tokens, for your corpus, you can identify the parts of speech in each sentence, estimate the named people and places in your corpus, and much more!

Further Refine your Text

During preprocessing, you may wish to remove small, very frequent words such as “a,” “is,” and “the” (called stop words), along with other words or phrases that occur very frequently and won’t add meaning to your analysis (for example, if the title of a book appears on every text file in your corpus, then you may want to remove all instances of the book title).

If you’re interested in word frequencies, during preprocessing you may wish to reduce the words in your texts to their root form. To reduce words to their root form, you could stem your lists of tokens, replacing words with their stem (which may or may not be a true word; for example, flew as fl). Alternatively, you could lemmatise your lists of tokens, reducing each word to a root form that is itself a word (for example, flew as fly).

When studying word frequencies, you will generally want to casefold the texts in your corpus. Casefolding transforms all capitalized letters to lowercase, so a word that appears at the start and in the middle of two different sentences will be interpreted as the same word. If you’re interested in identifying the people and places named in your texts, you’ll want to keep a version of your texts with their original capitalization, too (un-casefolded), because capitalization helps text analysis tools identify people, places, and other named entities.

Upcoming CDCS Training