Using the Cablegate SQL file

This post is for non-developers who are interested in working with the cable_db_full.sql database of diplomatic cables that Wikileaks released as a part of their “Cablegate” project. It tells you how to download the file, unpack it, and install it in PostgreSQL under XAMPP, allowing you to view and query the database in a browser-based graphical interface (phpPgAdmin) from your local drive rather than a dedicated server.

Research background
I’ve been working with Wikileaks’ Cablgate corpus for almost a year now — looking for language patterns by running the HTML version of some of these cables through desktop software designed for linguistic analysis: WordSmith Tools, AntConc, and TextCrawler. (Actually TextCrawler is more like a sophisticated tool for doing search-and-replace operations: it was handy for processing files before I analyzed them in WordSmith and AntConc.)

In September 2011, however, Wikileaks released the entire 250k + files. Unpacked, the HTML files are a whopping 60 GB now, which is far more than any of my desktop programs can handle. Wikileaks also released an *.sql file containing the same information at a fraction of the weight: 1.7 GB.

As an aside, this last data dump came as a result of some drama between Wikileaks and a reporter who had been working with the organization to redact sensitive information before the cables were released. The latest data are completely un-redacted, and I have persisted in my studies of this corpus despite feeling some non-trivial ethical queasiness. My doubts on this score may become the subject of a future post.

Not a developer
In order to continue working with this corpus, I realized that I was going to need to learn about relational databases (or rather to expand my knowledge from MS Access out to larger, server-oriented solutions), so I spent this weekend repeatedly running my head against a brick wall in an attempt to get at the data in cable_db_full.sql, a PostgreSQL dump that you can find by going to the Cableviewer page on Wikileaks and following the BitTorrent link at the bottom of the page titled, “Cable database in searchable format – for developers (updated!).”

If you even know what PostgreSQL is, then you already know more than I did when I started hammering away at this file Saturday afternoon. The steps that follow are probably painfully obvious to the developers for whom this file was intended. I’m not a developer. I’m just getting started. Also, I don’t have access to a server that can handle this much data yet, so I had to find a way to access it on my local PC. I may expand on my environment constraints and the steps below in a future post. For now, here’s what I did, for what it’s worth.

Steps to open cable_db_full.sql

  1. Install Xampp
  2. Realize that this is a PostgreSQL database, which is very different from a MySQL database (if you remember only this step, you will already be 5 hours ahead of me)
  3. Add PostgreSQL to Xampp (here’s a tutorial).
  4. Make sure Xampp is running, then go to http://localhost/phppgadmin (the browser-based graphical manager for PostgreSQL).
  5. Create new database, cable_db_full
  6. Add PostgreSQL’s bin folder (should be c:/xampp/pgsql/bin) to PATH environment variable in Windows, or do whatever you need to do to send PostgreSQL commands through the command line in another OS
  7. Download cable_db_full.7z.torrent (requires BitTorrent)
  8. Extract cable_db_full.7z (requires 7zip)
  9. Open command line and go to the directory containing the extracted file, cable_db_full.sql
  10. Run this command (assumes that PostgreSQL username is the default ‘postgres’ created in step 3, and that the environment variable for PostgreSQL has been added as in step 6):
         psql -U postgres cable_db_full < cable_db_full.sql
  11. Your command line should start chewing away at the import, showing a bunch of INSERT 0 1 messages. If you get through all of these and end with ALTER TABLE, then you’re done. You can start browsing and querying your database in http://localhost/phpgadmin, which you installed in step 3.

I ran into some problems with errors in the *.sql file, so I had to edit it with a trial version of 010 Editor (none of my usual text editors could handle a file this large). I also ran into a problem with some missing *.dll libraries after step 3. I did a lot of random Googling for this one, and ended up copying the two “missing” files from one folder to another. If you run into this problem, mention it in a comment and I’ll try to be more specific about my resolution.

5 thoughts on “Using the Cablegate SQL file

  1. Hi,

    I am getting this error:
    “ERROR: unterminated quoted string at or near “‘A?A?A?A!A±”
    ?s 1: …IPOLI358’, ‘CONFIDENTIAL’, ‘Embassy Tripoli’, ”, ‘A?A?A?A!…”

    Is there any way to fix this problem?

    Thanks in advance.

    • I ran into this too. I’d hoped Wikileaks had fixed this bad data in the updated version of the file, but I guess they haven’t.

      In this file, different columns in a row of data are delimited with a comma, and strings are enclosed in single quotation marks. In order to keep the data lined up correctly, each row needs to have the same number of columns… which means each row needs to have the same number of strings enclosed in single quotes and delimited with a comma. For example, here is a row with three columns:

      ‘column 1′,’column 2′,’column 3’

      But if another row had no data in column 2, it would look like this:

      ‘column 1′,”,’column 3’

      (Note that the empty column is two single quotes side-by-side, not one double quote).

      So what’s happening with that junked up data (“‘A?A?A?A!A±”?s 1:) and (‘A?A?A?A!…”) is probably that the junk contains an extra single-quote mark that’s throwing off the data structure.

      Rather than trying to hunt down that bad single quote mark, I went in and deleted all of the junk data. Just remember when you do this to leave behind the proper number of empty columns if the junk is occupying a column that really needs to be in there. (Take a look at the structure of a well-formed row to make sure you have the correct number of columns).

      The junk data seems to have a pretty unique pattern: not consistent enough for a global search-and-replace, but consistent enough so that you should be able to copy some small part of the junk string and use this as a search term to locate all of the offending bits. There aren’t many of these junky columns, but there are definitely more than one.

      In order to make the changes I’ve just described you’ll need to open the file in a text editor (before trying the import again). The text editor needs to be able to handle large files. None of my usual applications were up to the job. I used the trial version of 010 Editor, and this worked.

      Good luck!

  2. I almost going to delete this cable_db_full file downloaded about half year ago since I failed to import it to MySQL.

    Adding PostgreSQL to Xampp, I tried on several pcs, not working. Independently installed PostgreSQL, worked. Import encountered (“‘A?A?A?A!A±”?s 1:) and (‘A?A?A?A!…”) problems, about 18 times. Each has to stop, edit and redo. So it takes 5 hours to do so. But if without these above posts, I’ll never can. Thanks a lot!!!

  3. Hi Jeff:

    I am using the pgadmin gui (with the elephant) but can’t download the cable_db_full.sql file into the postgres database. Is it because I am missing a dll file? I see that libiconv-2.dll is missing. Do you have a copy of libiconv that you could email me? I don’t feel it is safe to download online. Any other suggestions?

    Thanks.

    • It’s been awhile since I worked with the DB, but I will probably be picking it up again in a couple of weeks. I’ll try to reproduce your issue then and let you know if I make any progress.

      The rest of this may not be so helpful since I’m working off the top of my head: The issues you ran into sound somewhat similar to mine. In your error message, does it give you the full path of the location where the program expects the dll to be? When I ran into this error, it turned out that I had all the dll files I needed — they were just in the wrong place. So once I knew where they were supposed to be, I located them in the program’s folders, then copied them over to the location where the program wanted to find them when it threw the error.

      The other common error I saw had to do with junk data in the DB itself. The comment above from “J” on March 5 goes into this a bit more. Basically you have to find another text editor besides your DB browser that can handle a file of this size, then search for the junk strings and remove them, making sure that you keep commas in place so that empty columns that formerly contained the bad data will still be delimited as columns when you import this into your database browser.

Leave a Reply

Your email address will not be published. Required fields are marked *