Defining Cablegate TAGs

Messages authored within the US Department of State’s “cable” system all receive abbreviations that play a role similar to the “tags” that bloggers often apply to their posts. For researchers (including linguists and reporters) who are interested in reading these cables, TAGs are a valuable way to cluster several different cables on a similar topic. For instance, it might be interesting for a linguist to isolate all of the cables bearing the “PTER” tag (“PTER” stands for “Terrorism”). The Department of State has defined some but not all of these TAGs. Read on to learn more about the DoS system, and a corpus analysis method for determining the meaning of TAGs that were formerly undefined.

But the DoS version of tagging differs from a run-of-the-mill blog tagging system. First, DoS tags are centrally maintained: cable authors aren’t permitted to make up tags on the spot (there is a class of DoS tags that falls outside the central term store, but I won’t go into these here). Second, DoS tags aren’t usually intuitive enough to be read by a human unless you happen to have them memorized (although some patterns do obtain… sometimes). Examples of DoS tags include PREL, PGOV, CU, and SP. DoS tags are moderately patterned abbreviations, many (but not all) of which are defined in a manual handed out to diplomatic staff involved in transmitting cables. This manual is publicly available on the DoS website.

DoS “tags” also have a special name (or anyway a special typographical convention). Like any government agency worth its discursive salt, the DoS turned “tag” into an acronym, so that’s T.A.G., not “tag.” In the DoS manual, TAG stands for Terms And Groups.

For researchers (including linguists and reporters) who are interested in reading these cables, TAGs are a valuable way to cluster several different cables on a similar topic. For instance, it might be interesting for a linguist to isolate all of the cables bearing the “PTER” tag (“PTER” stands for “Terrorism”).

But why would we need to find cables about terrorism using the PTER tag when we could just as easily find the texts we’re interested in by searching their body content for keywords like “terrorist” and “terrorism”? Following the usual corpus analysis process (if our texts did not have TAGs), we might isolate terrorism-related texts simply by looking for those cables that exhibited a certain density of terrorism-related lexical types (cables that use the words “terrorist” and “terrorism” over and over, in other words). But by using TAGs, you can find all of the cables that the language community itself has defined as relating to the topic of terrorism, even if the key lexical types themselves aren’t present in the text.

For example, one text describing an act that the entire diplomatic corps believes to have been related to terrorism may still not have occasion to use the actual term “terrorism,” especially if the act is being discussed at length in other cables as well. But even if the author of the cable didn’t use the key lexical types we’re looking for (“terrorism” or “terrorist”), the cable author may still have decided that because of the topic, the cable needed to receive the PTER TAG.

So these TAGs provide a semantic bridge that would ordinarily be considered too intuition-driven for a truly pragmatic corpus linguist: a semantic bridge that can still be justified on empirical grounds because it exists within the textual data rather than within the internal intuition of the researcher. In other words, the researcher needn’t resort to arguing for the examination of a terrorism-related text that doesn’t contain the usual terrorism-related keywords, since the language community has already made the link explicit through meta-textual TAGs.

Once the texts related to our TAG of interest (PTER) have been found, we can then run a battery of tests (maybe things like word frequencies, n-grams, semantic associations, POS/word-function analysis) to look for ways in which the meaning of “terrorism” in diplomatic discourse has shifted since, say, the 1970s.

One might expect to find evidence of the ways in which shifting political alliances have exerted an influence on the very language we use to describe the concept of terrorism. Or perhaps we might find diachronic language variations along a semantic axis defined by domestic vs. international events and perceived threats.

Or at any rate trends like these would be interesting to find…. But as it turns out, the Cablegate files are heavily weighted towards cables transmitted in 2008 and 2009. Only a handful of files are available from the 1970s. Data normalization doesn’t help either. In the Cablegate files, the earlier decades simply don’t have enough texts to draw any reliable conclusions about what diplomatic discourse generally had to say about terrorism in the 1970s, or how the language about terrorism has changed since then.

Still, though, these TAGs are attractive objects to analyze: they seem like they could be very useful, even if the shiny example I offered here doesn’t work out in practice. TAGs provide one more way to help cut the corpus down to a slightly more manageable size depending on a predefined research interest, which is pretty important considering the size of the corpus and the potential complexity (to me and my pitiful computer system, anyway) of the parsing, tagging, chunking, and querying I’d like to undertake.

So onward with the TAGs: the State Department has defined quite a few of them already. The Guardian UK has defined even more. (You’ll have to read my paper for links to their glossaries). However there are still a handful of relatively frequent TAGs that remain undefined.

This is where my working paper (the one I alluded to in the title of this post) comes in. Last year I completed a seminar paper about my process for defining TAGs in the Cablegate files released recently from Wikileaks. You can download it here:

Expanding the TAG Lexicon in WikiLeaks’ Cablegate Files: A Pilot Study of Corpus Analysis Methods

The project in general was an exercise in computational lexicography, and it was interesting to see how the data help to showcase the fact that corpus analysis really can work to isolate semantic meanings –- even when the types being analyzed are arbitrary strings (abbreviations) rather than complete lexical items with a recognizable morphology (or graphology in this case, I suppose, since the cables are exclusively written texts) as one would find in a natural language.

I also used this paper as an excuse to document the Regular Expressions I employed to clean Wikileaks’ HTML files and to mine the State Department TAGs, so even if stepping through the academic exercise in corpus analysis isn’t interesting for you, you still may be able to find some use in adapting my RegEx.

I do plan to revise this paper to submit for publication (someday), although it’s likely that I will start working with the Cablegate PostgreSQL file (read my prior post about using the Cablegate SQL file) to enable me to query the entire database before I do so. (At the time that I wrote my paper, only a small fraction of the Cablegate files were available to the public). However even the database version of the Cablegate files still does not isolate TAGs into their own attribute/column, so I’m going to have to go back to my RegEx again to scrub the data and create new tables for the TAGs and their definitions. I guess I have my work cut out for me. If you’re working on something similar, let me know. I’d love to hear about it.

Leave a Reply

Your email address will not be published. Required fields are marked *