Large Scale Machine Learning and Other Animals: June 2012

Friday, June 29, 2012

Alpine Data Labs

A couple of days ago I had an interesting introductory talk with Steven Hillion, VP Product in Alpine Data Labs. Alpine Data Labs is a Greenplum spin-off, based on some cool data analytics research initiated by my postdoc advisor Joe Hellerstein from Berkeley. You can learn some more about this research here.

So what is interesting about Alpine? This is one of the only companies I am aware that worries about the feature engineering BEFORE the data is actually fed into the analytics engine. They have a product which is called data miner which allows faster mining of the relevant data from the database. Here is a video with a demo:

Second, the MadLIB idea is to run the analytics on the actual database where the data is stored for faster processing. My guess this is useful for the rather simple algorithms.

Anyone who is interested in learning more about Alpine and its products is welcome to attend our first GraphLab workshop.

Thursday, June 28, 2012

Misc Updates

I got the following from both JustinYan, and Nimrod Priell in parallel. Xavier Amatriain has published some information about Netflix recommendation engine in his blog post: part one and part two.
According to Xavier, the best performing algorithms in the Netflix context where SVD and RBM (both of them implemented in our GraphLab collaborative filtering library):

We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD,Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88... we put the two algorithms into production, where they are still used as part of our recommendation engine.

...

Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts.

...

So, what about the models? One thing we have found at Netflix is that with the great availability of data, both in quantity and types, a thoughtful approach is required to model selection, training, and testing. We use all sorts of machine learning approaches: From unsupervised methods such as clustering algorithms to a number of supervised classifiers that have shown optimal results in various contexts. This is an incomplete list of methods you should probably know about if you are working in machine learning for personalization:

Linear regression

Logistic regression

Elastic nets

Singular Value Decomposition

Restricted Boltzmann Machines

Markov Chains

Latent Dirichlet Allocation

Association Rules

Gradient Boosted Decision Trees

Random Forests

Clustering techniques from the simple k-means to novel graphical approaches such as Affinity Propagation

Matrix factorization

If you are interested in learning more, you are invited to our GraphLab workshop where Xavier will give a talk about Netflix recommendation engine. I must say I am a bit disappointed since I would by happy to learn in some more details about the algorithms deployed and which one Xavier finds more useful - but I perfectly understand why he does not want to reveal their secret recipes..

I got the following from Nimrod Priell as well. A lecture in FourSquare describing their Hadoop stack.
If I got it correctly, it seems they are using item-item similarity matrix for recommendations. A second lecture about large scale machine learning in twitter. If you like to stay on top of things, I recommend subscribing to Nimrod's weekly digest.

Last, I got from Sebastian Schelter, a key Mahout contributor the following:

We submitted a paper to this years ACM RecSys which details some of Mahout's collaborative filtering algorithms, in which we also cite GraphLab as a potential solution for the distributed execution ofiterative algorithms. The title is 'Scalable Similarity-Based Neighborhood Methods withMapReduce'.

Once Sebastian paper is published online I will link to it.

Wednesday, June 27, 2012

Triangle Counting in GraphLab

My collaborator Yucheng Low have implemented a cool and simple triangle counting algorithm. The algorithm is based on the following PhD thesis: T. Schank. Algorithmic Aspects of Triangle-Based Network Analysis. Phd in computer science, University Karlsruhe, 2007.

Triangle counting algorithm counts for each graph node, in how many graph triangles this node participates in. In my mind, triangle counting relates to statistical mechanics and is completely useless. However, as we will show below, there are actual some interesting intuitions when you apply it to twitter social network.

I got the below text from Yucheng which simply summarizes the algorithm:
*

* The procedure is quite straightforward:

* - each vertex maintains a list of all of its neighbors in a hash table.

* - For each edge (u,v) in the graph, count the number of intersections

* of the neighbor set on u and the neighbor set on v.

* - We store the size of the intersection on the edge.

* This will count every triangle exactly 3 times. Summing across all the

* edges and dividing by 3 gives the desired result.

* The preprocessing stage take O(|E|) time, and it has been shown that this

* algorithm takes $O(|E|^(3/2))$ time.

* If we only require total counts, we can introduce a optimization that is

* similar to the "forward" algorithm

* described in thesis above. Instead of maintaining a complete list of all

* neighbors, each vertex only maintains a list of all neighbors with

* ID greater than itself. This implicitly generates a topological sort

* of the graph.

* Then you can see that each triangle

A----->C

| ^

| /

v /

* Must be counted only once. (Only when processing edge AB, can one

* observe that A and B have intersecting out-neighbor sets).

And here are some results mined from the Twitter graph and made visual by Joey Gonzalez.
The graph contains a twitter capture of 41M users from 2009 of the type user->follower.
Degree - the number of people the user is following.
Followers - the number of users following this user.
Cycle triangles - number of triangles of the form A->B->C->A for this user.
Through triangles - number of triangles of the type A->B, B->C, A<=>C for user B.
Triangles - number of triangles going out this user (A->B, A->C for user A).
In triangles - number of triangles entering this user (B->A, C->A, B<=>C for user A).

Monday, June 25, 2012

GraphChi - out of core graph computation on mac mini

Here is a nice image I got from my collaborator Aapo Kyrola:

Aapo's GraphChi system enables to run graph algorithms on graph with billions of edges on a mac mini computer. It is an out of core computation which makes online passes on the data from disk (without reading it fully to memory). If you are interested in learning more - come to see Aapo's demo at our GraphLab workshop.

Read more on GraphChi

Wednesday, June 20, 2012

Hadoop Fatigue

Here is a very interesting blog post by Ryan Rosario about some viable alternatives to Hadoop.
Thanks to Mohit Singh, our man in One Kings Lane for the link.

Sunday, June 17, 2012

Titan: new graph database

I got a link to this lecture from Matthias Broecheler. It looks like a very interesting new graph database. Anyone who is interested to learn more about Titan is welcome to our GraphLab workshop where Matthias will present Titan.

Titan: The Rise of Big Graph Data

View more presentations from Marko Rodriguez

Friday, June 8, 2012

YarcData - new graph analytics contest

I just heard from Andy Krasnow - the ultimate head hunter about this contest. Here is some interesting information from their press release (text marked in bold by me)

YarcData Inc., a Cray company (Nasdaq: CRAY), today announced the planned launch of a "Big Data" contest featuring $100,000 in prizes. The YarcData Graph Analytics Challenge will recognize the best submissions for solutions of un-partitionable, Big Data graph problems.

YarcData is holding the contest to showcase the increasing applicability and adoption of graph analytics in solving Big Data problems. The contest is also intended to promote the use and development of RDF and SPARQL (both standards developed by the World Wide Web Consortium) as the industry standard for graph analytics.

YarcData's uRiKA system is a Big Data appliance for graph analytics. uRiKA helps enterprises reveal unknown, unexpected or hidden relationships in Big Data by creating a highly-scalable, real-time graph analytics warehouse that supports ad hoc queries, pattern-based searches, inferencing and deduction. The uRiKA system is a purpose-built appliance for graph analytics featuring graph-optimized hardware that provides up to 512 terabytes of global shared memory, massively-multithreaded graph processors supporting 128 threads/processor, and highly scalable I/O with data ingest rates of up to 350 terabytes per hour - and an RDF/SPARQL database optimized for the underlying hardware enabling applications to interact with the appliance using industry standard interfaces. uRiKA complements an existing data warehouse or Hadoop cluster by offloading graph workloads and interoperating within the existing analytics workflow.

Well what is the best software for doing graph analytics?? :-)

An update- I have contacted YarcData, and they will be presenting a poster about their contest in our upcoming GraphLab workshop. Thanks to Venkat Krishnamurthy for setting up the connection.

Wednesday, June 6, 2012

AUC Explanation

I got this email from David Lexer:

I'm currently learning Mahout (e.g. - 'Mahout in Action') and discovered your Blog. Cool!
Question (if you don't mind too much):

I just finished the lectures from 'Learning from Data' @Caltech and am in week 8 of 'Machine Learning' @Stanford/Coursera.
http://work.caltech.edu/telecourse.html
https://class.coursera.org/ml/class/index
Neither course discussed the 'AUC' metric for classification.
Do you use it? What's the mathematical basis?
Thanks in Advance,
-Dave

My answer:
I am not using AUC so often, but I find it useful for cases the prediction is binary (0/1) and you need to estimate your prediction accuracy.
Another example where AUC is useful is in this year ACM KDD CUP 2012, track 2 they used AUC to evaluate accuracy of predicting CTR (click through rate) of ads. They recommend the paper ‘ROC graphs: Notes and practical considerations for researchers‘ by Tom Fawcett as a good resource for learning about AUC.

Some more detailed explanation of AUC is found here, along with a few more references.

By the way, if we brought up KDD CUP track2 this year, in a few days I will have some interesting news to report. As for now I can only hint by showing this image:

Stay tuned!

Large Scale Machine Learning and Other Animals