Friday, January 20, 2012

GraphLab input via Greenplum SQL



I got some great tips from Brian Dolan, Founder and VP product of Discovix Inc, a cool California area startup specializing in building machine learning and mathematical products.

Brian asked me a few days ago if we thought about interfacing GraphLab to some commonly used databases, since usually there is where the client data is found. Since I have no green clue in databses, Brian volunteered to teach me (and you my blog reader!) how to do it.

And here is the tutorial as I got it from Brian. First we create table with fake data
to be used in the example:
DROP SCHEMA IF EXISTS glab;
CREATE SCHEMA glab;

-- Create a table to hold some bogus data
DROP TABLE IF EXISTS glab.bs_data;
CREATE TABLE glab.bs_data
(
  rw int DEFAULT NULL
, cl int DEFAULT NULL
, vl numeric DEFAULT NULL
) DISTRIBUTED BY (rw, cl)
;

-- Now create some bogus data.  I use a few postgres tricks.
-- Ultimately, this just creates a relatively sparse "matrix" in tall format
-- I'm trying to use best practices in formatting and stuff
INSERT INTO glab.bs_data (rw, cl, vl)
SELECT
  x
, y
, random() AS vl
FROM 
  (
  SELECT
    CASE WHEN random() < 0.1 THEN x ELSE NULL END AS x
  , CASE WHEN random() < 0.1 THEN y ELSE NULL END AS y
  FROM
    generate_series(1, 1000) AS x
  , generate_series(1, 1000) AS y
  ) AS A
WHERE 1=1
  AND a.x IS NOT NULL 
  AND a.y IS NOT NULL 
;
After the table is ready we can push it into a file:
-- So we have some data.  Let's create an external table
-- to push it OUT of the database
-- You need to run gpfdist to write to an external table.
-- In this example, from the directory /Users/gpadmin/glab/tables the command was
-- > gpfdist 
DROP EXTERNAL TABLE IF EXISTS glab.ext_out;
CREATE WRITABLE EXTERNAL TABLE glab.ext_out (
  i bigint
, j bigint
, v numeric
) 
LOCATION ('gpfdist://macbuddha.local/glab_out.txt')
FORMAT 'TEXT' (DELIMITER ',' NULL ' ')
DISTRIBUTED BY (i, j)
;

-- Right, now let's put the data in it.  This is two queries unioned so
-- we can create the first row.  Best to do them separately.
INSERT INTO glab.ext_out
SELECT
  max(rw) AS i
, max(cl) AS j
, count(*) AS v
FROM glab.bs_data;

INSERT INTO glab.ext_out
SELECT 
  rw AS i
, cl AS j
, vl AS v
FROM  glab.bs_data
ORDER BY 1,2
;
And here is the example output glab_out.txt:
1000,1000,10002
1,69,0.71019060537219
1,577,0.747299919836223
1,627,0.252593372948468
1,753,0.120338548440486
1,768,0.34520703041926
1,826,0.756854422856122
1,838,0.827316934708506
1,936,0.342057122848928
2,323,0.090937509201467
...
Now it is very easy to read it in GraphLab. In Matlab/Octave do:
load glab_out.txt; % load data to memory
A = sparse(glab_out(2:end,1), glab_out(2:end,2), glab_out(2:end,3), glab_out(1,1), glab_out(1,2), glab_out(1,3)); % create a sparse matrix
mmwrite('A', A);  % save matrix to file. You may need to donload the script mmwrite.m from http://graphlab.org/mmwrite.m

Since this is potentially useful, I am going to have support for this data format soon in GraphLab, without the matlab conversion.

Next, Brian promised to teach me how to export data from GraphLab back to SQL.

No comments:

Post a Comment