This post is for non-developers who are interested in working with the cable_db_full.sql database of diplomatic cables that Wikileaks released as a part of their “Cablegate” project. It tells you how to download the file, unpack it, and install it in PostgreSQL under XAMPP, allowing you to view and query the database in a browser-based graphical interface (phpPgAdmin) from your local drive rather than a dedicated server.
I’ve been working with Wikileaks’ Cablgate corpus for almost a year now — looking for language patterns by running the HTML version of some of these cables through desktop software designed for linguistic analysis: WordSmith Tools, AntConc, and TextCrawler. (Actually TextCrawler is more like a sophisticated tool for doing search-and-replace operations: it was handy for processing files before I analyzed them in WordSmith and AntConc.)
In September 2011, however, Wikileaks released the entire 250k + files. Unpacked, the HTML files are a whopping 60 GB now, which is far more than any of my desktop programs can handle. Wikileaks also released an *.sql file containing the same information at a fraction of the weight: 1.7 GB.
As an aside, this last data dump came as a result of some drama between Wikileaks and a reporter who had been working with the organization to redact sensitive information before the cables were released. The latest data are completely un-redacted, and I have persisted in my studies of this corpus despite feeling some non-trivial ethical queasiness. My doubts on this score may become the subject of a future post.
Not a developer
In order to continue working with this corpus, I realized that I was going to need to learn about relational databases (or rather to expand my knowledge from MS Access out to larger, server-oriented solutions), so I spent this weekend repeatedly running my head against a brick wall in an attempt to get at the data in cable_db_full.sql, a PostgreSQL dump that you can find by going to the Cableviewer page on Wikileaks and following the BitTorrent link at the bottom of the page titled, “Cable database in searchable format – for developers (updated!).”
If you even know what PostgreSQL is, then you already know more than I did when I started hammering away at this file Saturday afternoon. The steps that follow are probably painfully obvious to the developers for whom this file was intended. I’m not a developer. I’m just getting started. Also, I don’t have access to a server that can handle this much data yet, so I had to find a way to access it on my local PC. I may expand on my environment constraints and the steps below in a future post. For now, here’s what I did, for what it’s worth.
Steps to open cable_db_full.sql
- Install Xampp
- Realize that this is a PostgreSQL database, which is very different from a MySQL database (if you remember only this step, you will already be 5 hours ahead of me)
- Add PostgreSQL to Xampp (here’s a tutorial).
- Make sure Xampp is running, then go to http://localhost/phppgadmin (the browser-based graphical manager for PostgreSQL).
- Create new database, cable_db_full
- Add PostgreSQL’s bin folder (should be c:/xampp/pgsql/bin) to PATH environment variable in Windows, or do whatever you need to do to send PostgreSQL commands through the command line in another OS
- Download cable_db_full.7z.torrent (requires BitTorrent)
- Extract cable_db_full.7z (requires 7zip)
- Open command line and go to the directory containing the extracted file, cable_db_full.sql
- Run this command (assumes that PostgreSQL username is the default ‘postgres’ created in step 3, and that the environment variable for PostgreSQL has been added as in step 6):
psql -U postgres cable_db_full < cable_db_full.sql
- Your command line should start chewing away at the import, showing a bunch of INSERT 0 1 messages. If you get through all of these and end with ALTER TABLE, then you’re done. You can start browsing and querying your database in http://localhost/phpgadmin, which you installed in step 3.
I ran into some problems with errors in the *.sql file, so I had to edit it with a trial version of 010 Editor (none of my usual text editors could handle a file this large). I also ran into a problem with some missing *.dll libraries after step 3. I did a lot of random Googling for this one, and ended up copying the two “missing” files from one folder to another. If you run into this problem, mention it in a comment and I’ll try to be more specific about my resolution.