Thursday, November 29, 2012

Hadoop Mortar

I got this from Carlos Guestrin. Hadoop Mortar is an interesting software as service framework which allows running python and pig scripts on top of Hadoop:
It has a very nice "Illustrate" functionality which helps debug the code even before it is actually executed. Reminds a little the functionality of data wrangler.

Some cons: you have to know SQL/Pig. So in that sense the UI does not simplify Pig programming. Some queries preview make take the same time of executing the full query and thus you do not always gain much from the preview.

Monday, November 26, 2012

Misc updates

I got this from my collaborator Joey Gonzalez. Slides from Facebook about their database system.
This is the ultimate resource for anyone writing grant proposals on big data. For example, did you know that Facebook users send 6 billion messages per day?

Additional Facebook talk, this time related to recommendations was given by Ralf Herbrich in the machine learning summer school: Video part one. Part two. Previously working on the MatchBox project in Microsoft utilizing Bayesian message passing on factor graphs for computing recommendations. Now it seems Ralf is promoting similar techniques in Facebook.

My colleague Josh Patterson from Cloudera, sent me a link to his project called Knitting Boar - SGD implementation on top of Hadoop and Yarn. (Why boar?)

Now to some theory. I got this from Martin Takac:
I would like to share with you our new paper. We extended your Shotgun paper for wider class of functions and also showed that slight modification of step-size can guarantee convergence for any number of updates.

Thursday, November 22, 2012

GraphChi visual toolkit - or understanding your data

A few weeks ago I wrote about Orange d4d data of cellular user behavior in Africa.
The data of phone call patterns is given as a text file in the following format:
20000 20003
20000 20005
20000 20008
20000 20011
20000 20012
1052 20000
20001 20006
20002 20009
20002 20010
1052 20002 
 
With the following format:
[calling user] [receiving user]\n

Since there are hundreds of thousands of phone calls it is very hard to understand what is actually the network structure. I decided to write a quick visual tool that will help user examine their graphs and understand better their structure.

Here is how you can try it out:
1. Checkout GraphChi from mercurial using the instructions here.
2. # cd graphchi; bash install.sh; make parsers; make ga
3. # cd toolkits/visual
4. Run the visual toolkit to create a sub graph representation. You will need to input the graph input file name, and the number of edges to extract. It is recommended to display less than 1000 edges or else the plot may be slow.
 # bash make_data.csv.sh -f [input graph name] -n [number of lines]
 For example, you can use the sample graph provided:
   # bash make_data.csv.sh -f `pwd`/sample_graph -n 1000
5. # firefox index.html

Here are some examples of the images I got when playing with orange data:



As you can see different kinds of users emerge very clearly.. the red nodes are the "seed" users where the graph was traversed from. Each edge is a phone call connection. We can see different users:
1) unsocial - rarely makes phone calls..
2) small network - few calls to neighbors
3) nagging - often calls to call centers (highly connected neighbors)
4) social - connected to a lot of friends which are interconnected together

Next I tried the same visualization on some twitter data I have. Each link is a twit or retwit directed to a certain user.

Next I looked at some phone calls data from a large European country. The graph captures only several minutes time span. It is interesting to see that from the gray node in the middle the is a 6 hop link of someone who called someone who called someone in a very short time.



And here is a sample webpage which shows the output of the visualization.

Advanced features:
1) It is possible to traverse a graph starting from a set of seed nodes.
Use the command line -s XXX for example: -s 12
or -s 192,31990,2312

2) When selecting a seed node, specify the number of hops to traverse using -h XX command. For example, -h 3 will traverse 3 hops around the sets of seed nodes.

3) If your input file is not in sparse matrix market format, but in [from] [to] format, you need
to specify an upper limit on the number of graph nodes using -o XX command.

How does your data look like? I would love any feedback from people who are trying to visualize their own graphs... let me know if you have any questions about the setup.


Credits: I am using the great d3.js package for performing the visualization. Thanks to Tyler Johnson, Shingo Takamatsu and Ali Bagheri Garakani from UW for teaching me how to deploy d3.js!

Thursday, November 8, 2012

Misc Updates

I got this from Tom Mitchell, CMU Machine Learning Department Head:
Here's a nice New York Times article on the role that datamining played in Obama's election campaign.  The head of the effort:  Rayid Ghani, one of the first graduates of our Machine Learning graduate program!
http://www.nytimes.com/2012/03/08/us/politics/obama-campaigns-vast-effort-to-re-enlist-08-supporters.html?pagewanted=all

I got this from my collaborator Joey Gonzalez:

Of course [my favorite method] can be used to solve this problem!
Additional hilarious academic jokes are found here.

Now to more important stuff. Joey just gave a superb lecture at Twitter course in Berkeley about GraphLab:
http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/10/02/video-lecture-video-learning-with-graphs-by-joey-gonzalez/

There are several rather interesting lectures available here.