Joining Citizen Audit records with Detroit Ledger

Detroit Ledger is rising from a brief few months of hibernation: we received some funding to improve our database (more details soon), and all of a sudden, Jessica, Matt, and I are buzzing with ideas.

One of the ideas: relate all of our organization entities with IRS 990 and entity records found in Citizen Audit's dataset. With this JOIN, we can pull in EINs (a very nice unique key, right?) as well as Citizen Audit's yearly budget/salary figures, among other goodies.

Citizen Audit has an API that allows one-at-a-time row requests, as well as some data dumps (see link above) in the form of PostgreSQL tables. I want to look around this data and eventually import it into MySQL tables to live inside Detroit Ledger.

Most of this methodology can apply to any situation where you want to investigate postgres data and then pull it into mysql.

This is what I do:

Have a Debian Wheezy server

I'm on a 512M digitalocean instance.

Set up postgres

cat>>sources.list.d/pdgd.list<<EOF  
deb http://apt.postgresql.org/pub/repos/apt/ wheezy-pgdg main  
EOF  
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -  
sudo apt-get update  
sudo apt-get install postgresql-9.3 pgadmin3  

For our datamunging tasks, add a postgres user with no password who can create databases.

sudo -u postgres createuser -d -w citizenaudit  

Get the psql tables

wget http://s3.citizenaudit.org/irs/bulk/masterfile.sql.gz  
wget http://s3.citizenaudit.org/irs/bulk/extracts.sql.gz  
wget http://s3.citizenaudit.org/irs/bulk/manifest.csv.gz  

Look around data...

Looks useful. We can link on names and stuff. Now I need to import this biz into MySQL.

Install PHP and MySQL

# add repos from dotdeb.org first, and then...
sudo apt-get install php5-cli mysql-client mysql-server  

Convert and load

git clone git@github.com:swrobel/pg2mysql.git  
cd pg2mysql  
# increase memory limit
git apply<<EOF  
diff --git a/pg2mysql.inc.php b/pg2mysql.inc.php  
index a60bdda..09fca3f 100644  
--- a/pg2mysql.inc.php
+++ b/pg2mysql.inc.php
@@ -21,7 +21,7 @@ the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 Boston, MA 02111-1307, USA.
 */

-ini_set("memory_limit","512M");
+ini_set("memory_limit","2048M");
 error_reporting(E_ALL & ~E_DEPRECATED);
 define ('PRODUCT',"pg2mysql");
 define ('VERSION',"1.9");
EOF  
php pg2mysql_cli.php ../extracts.sql ../extracts-my.sql  
php pg2mysql_cli.php ../masterfile.sql ../masterfile-my.sql  
# as root in mysql:
CREATE DATABASE ca;  
CREATE USER 'bc'@'localhost' IDENTIFIED BY 'smelly';  
GRANT ALL PRIVILEGES ON ca . * TO 'bc'@'localhost';  

and load the data...

mysql -ubc -psmelly ca < extracts-my.sql  
mysql -ubc -psmelly ca < masterfile-my.sql  

Great. Look at those tables.

Next steps take place in Drupal: create Data schema definitions for our tables and join everything up!