Wishlist: corpus analysis and sociolinguistics

Traditionally, sociolinguistics has examined language variation as a function of independent social variables: gender, class, geography, time, and so on. Texts marked up for the web (or any other digital medium that uses structured meta-data) might potentially allow researchers with the right data-mining scripts to extract some of these variables from the text itself: potentially overcoming some of the barriers presented by traditional sociolinguistic field work (especially for longitudinal studies on language variation).

COHA, a part-of-speech tagged, 400-million word corpus with texts from 1810 to 2009, may be the best general resource for these kinds of studies. Although it does not capture all of the variables a sociolinguist might want, it still provides strong examples of diachronic variation — change over time.

Unfortunately this capability can’t be applied to other texts besides those collected by COHA without some serious programming chops. The programs I know of that use a friendly graphical user-interface (I mentioned a few in my post on Tools for building corpora) don’t support the features a researcher would need to create a COHA-like corpus for a specific research question.

COHA gives us a model of what a diachronic corpus might look like, but (for very good reasons), it won’t support studies that either (a) need to include a close reading of individual texts, not just aggregated data or (b) require specific texts to be used (say a collection of diplomatic cables).

What specifications might go into a program that can support a broader range of sociolinguistic projects?


  • A metadata mining system that looks for date entities, associates them with individual texts, and offers them up somehow as a pre-search filter. (Without some sort of metadata mining system, texts will need to be marked up with metadata manually when they’re placed in the corpus).
  • Date-range interface that allows only a subsection of the corpus to be searched
  • Reports (search results) that incorporate date entities (graphs that present changinging frequency of a keyword, n-gram or syntax element over time, preferably with scales and ranges that can be adjusted by the user after the data have been tabulated by a particular search).
  • Data normalization scripts that help account for time periods containing fewer texts or tokens.
  • Support either for Corpus Query Processor (CQP) syntax, or regular expressions, or both. This is fundamental to most existing corpus analysis systems, but I’m adding it to the list just because one way of approaching this could be to incorporate a search engine like Lucene, which of course wasn’t designed with corpus analysis in mind, but still can be configured with “facets” driven by meta-data.
  • Ability to access textual data. For copyright reasons, COHA doesn’t allow texts to be accessed in their entirety.

Currently, however, the applications I’m familiar with don’t provide an easy way to tag corpora with rich meta-textual features and run reports based on these features. As I mentioned with regards to COHA, the existing large corpora that do provide these features don’t give the end user ways to access texts or upload their own.

Leave a Reply

Your email address will not be published. Required fields are marked *