Large Scale Machine Learning and Other Animals: May 2013

Tuesday, May 28, 2013

Nice Scala Tutorial

My collaborator Sagie Davidovich sent his Scala tutorial talk. Very recommended.
To me, which I am not that familiar with Scala there are two very impressive facts about the language. First, the length of resulting code vs. Java is super short. Second, the support for parallel execution of code is very impressive.

Sunday, May 26, 2013

Cascading and Scalding

I got the following interesting links from Shaul Dar, Director of Risk Data Science at Paypal. A brief overview of Cascading and Scalding, two higher level abstractions for data processing on top of Hadoop's map reduce. Cascading and Scalding are an interesting alternative to Pig.

1. Cascading - http://www.cascading.org/

In particular take the time to look at the “Cascading for the Impatient series”, start with Part 1. This will really give you the feel for it.

2. (Cascading vs Pig) http://stackoverflow.com/questions/3681494/does-anyone-find-cascading-for-hadoop-map-reduce-useful

E.g.: “The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.

The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense.”

3. Scalding - https://github.com/twitter/scalding

4. (Scalding examples) http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/

“Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala – which means all the power of Scala and the JVM is already built-in. No more UDFs, folks!”

Sunday, May 19, 2013

An Overview of Graph Processing Frameworks

A nicely written overview of Graph processing frameworks by Ben Lorica from O'Reilly media. Will help you master the essential buzzwords on big data analytics and graph processing.. And explains quite clearly why you should attend our 2nd GraphLab workshop!

Kaggle Titanic Contest

I got this from Sagie Davidovich a link to Kaggle's Titanic Contest. The task is to predict who survived the Titanic based on some features like age, cabin class, name, and presence of relatives on the Ship.
What I like about this task, is that you can actually explain to your gradma what you are predicting (unlike many other ML tasks which are hard to explain for the non expert..). The dataset is quite tiny in our standards. But still enables the usage of different ML methods. The best prediction on the leader-board has about 99% accuracy.

Tuesday, May 14, 2013

Funding for the next generation of GraphLab

The GraphLab journey began with the desire:

to rethink the way we approach Machine Learning and Graph analytics,
to demonstrate that with the right abstractions and system design we can achieve unprecedented levels of performance, and
to build a community around large-scale graph computation.

We have been blown away by the excitement and growth of the GraphLab community and have been unable to keep up with the incredible interest from our amazing users.

Therefore, we are proud to announce GraphLab Inc, a company devoted to accelerating the development of the open-source GraphLab project.

Why a company?

Put simply, we need a full-time dedicated effort to take GraphLab from where it is today to where we would like to see it go in the future. With a dedicated team, you will see exciting new features, more integration with Cloud infrastructure, easier installation and deployment, along with a revamped support effort, with additional commercial-grade options and tools.

What happens to the open-source project?

The open source project will remain the flagship technology where we push the limits of graph computation and develop new ideas. With a dedicated team focused on the code, documentation, deployment process and most importantly support, you can expect a regular release schedule, faster turnaround on bug fixes and improvements, and a higher quality bar for the code going forward. We will continue to rely on our community to push us to tackle bigger, harder problems, and to contribute back to the open-source effort.

Are you going to charge for GraphLab now?

We believe the only way the best ideas in machine learning are developed is with the help of a vibrant community (it takes a village). That is why GraphLab will remain an open-source project; now with a company behind it, we can double our efforts to keeping the project healthy.

What is the roadmap for the open-source project?

GraphLab 2.2 is just around the corner, see here for more details as to what is in it. Beyond that, we are exploring a new computation engine and further enhancements to the communication layer, as well as simpler integration with existing Cloud technologies, easier installation procedures, and an exciting new graph storage system. And of course, we look forward to working with you to develop the roadmap and build the next generation of the GraphLab system.

The next 12 months will be an exciting time for us, and we hope that everyone will come along for the ride

Learn more at graphlab.com

Sunday, May 12, 2013

Bond Percolation in GraphLab

I was asked by Prof. Scott Kirkpatrick to help and implement bond percolation in GraphLab. It is an oldie but goldie problem which is closely related to the connected components problem.

Here is an explanation about bond percolation from Wikipedia:
A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called "sites", in which the edgeor "bonds" between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.

2-d bond percolation problem. Will the water find their way

along the edges of this maze from top to bottom?

The algorithm for finding the connected edges is very simple:

In parallel

- for each edge, record the minimum id of the connected edges

- if there are no more changes in the network, break

This algorithm is frequently used in social networks to find groups of friends. And it is of course distributed and scales to very large problems. The output is the min component found for each edge. From this output we can find the edge clusters.

I have utilized the newest experimental version of GraphLab (v. 2.2) for quickly implementing the above algorithm. The resulting code is surprisingly simple. The main loop of the program:

//create a GraphLab engine
engine_type engine(dc, graph, clopts);
//register the map and combine operations that will be used in the update function

engine.register_map_reduce(BOND_PERCOLATION_MAP_REDUCE, bond_percolation_map, bond_percolation_combine);

  //in a loop
  for (int i=0; i< max_iter; i++){
     //perform update function on each node

     engine.parfor_all_local_vertices(bond_percolation_function);
     //wait until all nodes are done

     engine.wait();
     //count the number of unsatisfied links

     size_t diff = graph.map_reduce_edges<size_t>(count_component);

     //if no more links to explore we are done

     if (diff == 0)

       break;

}

bond_perculation_function which is executed in parallel across all node. In this function, we traverse all connected edges, and compare their minimum edges id using the bond_perculation_combine. As it is easy to verify, each of those implemented function has a single line of code.

//return the min component id found across this edges and its two connecting nodes

unsigned int bond_percolation_map(const graph_type::vertex_type& center,

                         graph_type::edge_type& edge,

                         const graph_type::vertex_type& other) {

   return edge.data().comp_id =  std::min(std::min(center.data().comp_id, edge.data().id), other.data().comp_id);

}

//find min component of two connected edges

void bond_percolation_combine(unsigned int& v1, const unsigned int& v2) {

    v1 = std::min(v1, v2);

}

//the main update function, go over all nodes and selects their min edge id

void bond_percolation_function(engine_type::context_type& context,

                  graph_type::vertex_type& vertex) {

     vertex.data().comp_id =  context.map_reduce<unsigned int>(BOND_PERCOLATION_MAP_REDUCE, graphlab::ALL_EDGES);

}

//count the number of components that are still not satisfied. When the number of components are zero, we are done

size_t count_components(const graph_type::edge_type & edge) {

  return = (edge.source().data().comp_id != edge.target().data().comp_id);

}

The full code is here. It is now part of the graph_analytics toolkit in v2.2. Needless to say it is working.. :-)

Saturday, May 11, 2013

Hadoop Mortar

Just got from my boss Prof. Carlos Guestrin this link: http://blog.mortardata.com/post/49934459499/recommender-systems-for-free

It looks like a nice publicity stunt - Hadoop Mortar is proposing to build recommender systems for free for selected companies. I wonder if this will actually work?

Thursday, May 9, 2013

Fun Readings about Startups

I got the following from my collaborator Jay Gu:

Here's a fun reading about startup: a well written lecture notes on Peter Thiel's course at Stanford, which covers topics like culture, hiring, strategy, lessons and mindset etc. I've read first 5 chapters and found it interesting and rewarding so I decided to share with all of you.

Html version

http://blakemasters.com/peter-thiels-cs183-startup

Epub version if you like to read on your iPad

http://stanfy.com/blog/epub-version-of-peter-thiel-lectures-at-stanford-university-cs183-startup/

Below are some fun quotes just give you a quick taste.

About hiring good engineers from "Google":

"So the way to compete against the giants is not with money. Google will outbid you. They have oil derrick that spits out $30bn in search revenue every year. To win, you need to tell a story about cogs. At Google, you’re a cog. Whereas with me, you’re an instrumental piece of this great thing that we’ll build together. Articulate the vision. Don’t even try to pay well. Meet people’s cash flow needs. Pay them so they can cover their rent and go out every once in awhile. It’s not about cash. It’s about breaking through the wall of cynicism. It’s about making 1% of this new thing way more exciting than a couple hundred grand and a cubicle at Google."

"We tend to massively underestimate the compounding returns of intelligence. As humans, we need to solve big problems. If you graduate Stanford at 22 and Google recruits you, you’ll work a 9-to-5. It’s probably more like an 11-to-3 in terms of hard work. They’ll pay well. It’s relaxing. But what they are actually doing is paying you to accept a much lower intellectual growth rate. When you recognize that intelligence is compounding, the cost of that missing long-term compounding is enormous. They’re not giving you the best opportunity of your life. Then a scary thing can happen: You might realize one day that you’ve lost your competitive edge. You won’t be the best anymore. You won’t be able to fall in love with new stuff. Things are cushy where you are. You get complacent and stall. So, run your prospective engineering hires through that narrative. Then show them the alternative: working at your startup."

The fundamental question

"The path from 0 to 1 might start with asking and answering three questions: First, what is valuable? Second, what can I do? And third, what is nobody else doing?

The intellectual rephrasing of these questions is: What important truth do very few people agree with you on?

The business version is: What valuable company is nobody building?"

About escaping competition:

"Intense competition makes things hard because you just beat heads with other people. The intensity of competition becomes a proxy for value. But value is a different question entirely."

About how to own a market:

"For a company to own its market, it must have some combination of brand, scale cost advantages, network effects, or proprietary technology."

About hiring nerds vs athletes:

"In thinking about building good company culture, it may be helpful to dichotomize two extreme personality types: nerds and athletes. Engineers and STEM people tend to be highly intelligent, good at problem solving, and naturally non zero-sum. Athletes tend to be highly motivated fighters; you only win if the other guy loses."

"The optimal spot on the matrix is monopoly capitalism with some tailored combination of zero-sum and non zero-sum oriented people. You want to pick an environment where you don’t have to fight. But you should bring along some good fighters to protect your non zero-sum people and mission, just in case."

New collaborative filtering functionality in GraphLab and GraphChi

A few days ago I wrote in this blog about parallel coordinate descent for speeding up ALS.
I have just implemented parallel ALS using coordinate descent a.k.a.

CCD++ algorithm. The algorithm is described in the following two papers:

H.-F. Yu, C.-J. Hsieh, S. Si, I. S. Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. IEEE International Conference on Data Mining(ICDM), December 2012.

Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 635-644.

In a nutshell, it speeds up ALS by avoiding the need for costly least square computation, each dimension (coordinate) is handled separately in parallel.

Documentation of the method for GraphChi is here:

http://bickson.blogspot.co.il/2012/12/collaborative-filtering-with-graphchi.html

Documentation of the method for GraphLab is here:

http://docs.graphlab.org/collaborative_filtering.html

Note: For GraphLab the algorithm is implemented in version 2.2 which will be release this summer. It is still possible to checkout this version and try it out using the mercurial command "hg up v2.2".

Let me know if you try it out!

Monday, May 6, 2013

Speeding up parallel ALS

My collaborator Joey Gonzalez sent me the following paper:
H.-F. Yu, C.-J. Hsieh, S. Si, I. S. Dhillon, Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems. IEEE International Conference on Data Mining(ICDM), December 2012.

Which got the best paper award in ICDM 2012. It is a well written a paper proposing a technique for speeding up ALS (alternating least squares) when executed in parallel, by using coordinate descent. It also has a good review of the two highly repeating building blocks in matrix factorization algorithms: alternating least squares and stochastic gradient descent and a discussion on how to parallelize them.

When I read the paper, it reminded me of previous work of Steffen Rendle, which I shared with my blog readers here:
Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2011. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 635-644.

Basically, it is the same construction. (Unfortunately it seems that ICDM guys where not aware to the previous SIGIR paper).

Here is the algorithm update rule from Steffen's paper:

And here is the update rule from ICDM paper:

Anyway aside from slightly different notations it is basically the same algorithm. And the conclusions of both papers are identical: this constructions improves performance of parallel ALS, and have favorable performance relative to SGD.

I asked Prof. Inderjit Dillon from The University of Texas at Austin about the relation between the papers and I got the following reply:

Solving the one-variable problem is simple, and was not intended to count as a "contribution" (we had similar 1-variable solutions in an earlier paper on coordinate descent for NMF, which appeared in KDD 2011). The main contribution of our ICDM paper is to present a parallel coordinate descent algorithm (on multi-core and distributed memory machines) for matrix factorization for missing value estimation. The results show it outperforms ALS and SGD, which is what most people seem to use in industry (and which is what prompted me to investigate co-ordinate descent for the problem).

To anyone who is interested in reading more, my recommendation is to read first the ICDM paper for getting a general overview, and then read the SIGIR paper for getting more specific.

To those of you who are interested in additional parallel coordinate descent method for L1 loss function, you are welcome to take a look at our ICML paper described here.

Thursday, May 2, 2013

GraphLab Challenge @ SC13

Just learned from my boss Prof. Carlos Guestrin about student cluster competition which is part of SC13 conference. The interesting part is the GraphLab programming is one of the challenges:

• GraphLab(rador)http://graphlab.orgThe GraphLab project started in 2009 to develop a new parallel computation abstraction tailored to machine learning. GraphLab scales to graphs with billions of vertices and edges easily, performing orders of magnitude faster than competing systems. GraphLab combines advances in machine learning algorithms, asynchronous distributed graph computation, prioritized scheduling, and graph placement with optimized low-level system design and efficient data-structures to achieve unmatched performance and scalability in challenging machine learning tasks.
The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. The API is built on top of standard cluster and cloud technologies: interprocess communication is accomplished over TCP-IP and MPI is used to launch and manage GraphLab programs. Each GraphLab process is multithreaded to fully utilize the multicore resources available on modern cluster nodes. GraphLab supports reading and writing to both Posix and HDFS filesystems.

We will keep an eye to hear about the outcome of this contest...

Large Scale Machine Learning and Other Animals