Tools for building corpora

As I mentioned in my Cablegate SQL post, I’ve been working lately to learn about tools for extending my usual Discourse Analysis research with some computational tools for processing large collections of texts. In my department (English), these methods are usually called “Corpus Analysis.” Read on for a brief description of the software I’ve used to help isolate language patterns across large bodies of text.

In the 2011 Spring quarter, I began using web interfaces for analyzing texts that have already been parsed and tagged for linguistic analysis (like the Corpus of Contemporary American English), or with small bodies of non-parsed text that can be imported into desktop software like WordSmith Tools and AntConc.

This quarter, however, I’m looking at building my own large corpora, mostly with Corpus Workbench. I’m interested in big chunks of data like the Enron Email Dataset, Twitter corpora (preferably real-time analysis for this one), and Wikileaks’ Cablegate files. I may talk about my research goals and motivations in a future post.

While Corpus Workbench offers a number of tools for doing really deep linguistic analysis, I’m also concerned that it may not be easy to bend towards (close to) real-time analysis, and that it may not be able to handle some of the massive corpora I’m looking at, at least not without some additional programming or server administration that I’m just not competent enough to do right now.

So the other tool I’m looking at is the Python Natural Language Toolkit, which seems to have significantly more documentation (including several books) than CWB. Python itself is supposed to be a good language for beginners, with an active developer community, and lots of modules and libraries that could (theoretically) be assembled into a series of scripts or programs that do what I need them to do.

Have you worked with CWB? What about the Python NLTK? What other tools do you use for building and querying large corpora?

Leave a Reply

Your email address will not be published. Required fields are marked *