Large Scale Machine Learning and Other Animals: Misc Updates

Thursday, June 28, 2012

Misc Updates

I got the following from both JustinYan, and Nimrod Priell in parallel. Xavier Amatriain has published some information about Netflix recommendation engine in his blog post: part one and part two.
According to Xavier, the best performing algorithms in the Netflix context where SVD and RBM (both of them implemented in our GraphLab collaborative filtering library):

We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD,Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88... we put the two algorithms into production, where they are still used as part of our recommendation engine.

...

Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts.

...

So, what about the models? One thing we have found at Netflix is that with the great availability of data, both in quantity and types, a thoughtful approach is required to model selection, training, and testing. We use all sorts of machine learning approaches: From unsupervised methods such as clustering algorithms to a number of supervised classifiers that have shown optimal results in various contexts. This is an incomplete list of methods you should probably know about if you are working in machine learning for personalization:

Linear regression

Logistic regression

Elastic nets

Singular Value Decomposition

Restricted Boltzmann Machines

Markov Chains

Latent Dirichlet Allocation

Association Rules

Gradient Boosted Decision Trees

Random Forests

Clustering techniques from the simple k-means to novel graphical approaches such as Affinity Propagation

Matrix factorization

If you are interested in learning more, you are invited to our GraphLab workshop where Xavier will give a talk about Netflix recommendation engine. I must say I am a bit disappointed since I would by happy to learn in some more details about the algorithms deployed and which one Xavier finds more useful - but I perfectly understand why he does not want to reveal their secret recipes..

I got the following from Nimrod Priell as well. A lecture in FourSquare describing their Hadoop stack.
If I got it correctly, it seems they are using item-item similarity matrix for recommendations. A second lecture about large scale machine learning in twitter. If you like to stay on top of things, I recommend subscribing to Nimrod's weekly digest.

Last, I got from Sebastian Schelter, a key Mahout contributor the following:

We submitted a paper to this years ACM RecSys which details some of Mahout's collaborative filtering algorithms, in which we also cite GraphLab as a potential solution for the distributed execution ofiterative algorithms. The title is 'Scalable Similarity-Based Neighborhood Methods withMapReduce'.

Once Sebastian paper is published online I will link to it.

Large Scale Machine Learning and Other Animals

Thursday, June 28, 2012

Misc Updates

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax