I originally posted this question to the Systemic Functional Linguistics list at the University of Technology Sydney:
I’m looking for research into the development of automated (or partially automated) parsing and tagging aimed at surfacing patterns related to SFL across large textual aggregates. Are there any particular sources you would recommend?
Anything in the way of conventional academic research, technical documentation, scripts, RegEx patterns, or open source applications would be helpful to me. There seems to be a relative paucity of research and development in this area — due in large part, I suspect, to the conceptual complexity (if not impossibility) of approaching SFL this way. On the other hand, maybe I just haven’t hit on the right search terms.
For those interested in M.A.K. Halliday and SFL: do you have any recommendations? What do you think might be the barriers to formalizing SFL in a way that would make it more conducive to computational linguistics? How would you begin to approach this task?
Traditionally, sociolinguistics has examined language variation as a function of independent social variables: gender, class, geography, time, and so on. Texts marked up for the web (or any other digital medium that uses structured meta-data) might potentially allow researchers with the right data-mining scripts to extract some of these variables from the text itself: potentially overcoming some of the barriers presented by traditional sociolinguistic field work (especially for longitudinal studies on language variation).
Messages authored within the US Department of State’s “cable” system all receive abbreviations that play a role similar to the “tags” that bloggers often apply to their posts. For researchers (including linguists and reporters) who are interested in reading these cables, TAGs are a valuable way to cluster several different cables on a similar topic. For instance, it might be interesting for a linguist to isolate all of the cables bearing the “PTER” tag (“PTER” stands for “Terrorism”). The Department of State has defined some but not all of these TAGs. Read on to learn more about the DoS system, and a corpus analysis method for determining the meaning of TAGs that were formerly undefined.
As I mentioned in my Cablegate SQL post, I’ve been working lately to learn about tools for extending my usual Discourse Analysis research with some computational tools for processing large collections of texts. In my department (English), these methods are usually called “Corpus Analysis.” Read on for a brief description of the software I’ve used to help isolate language patterns across large bodies of text.
This post is for non-developers who are interested in working with the cable_db_full.sql database of diplomatic cables that Wikileaks released as a part of their “Cablegate” project. It tells you how to download the file, unpack it, and install it in PostgreSQL under XAMPP, allowing you to view and query the database in a browser-based graphical interface (phpPgAdmin) from your local drive rather than a dedicated server.