Large Scale Machine Learning and Other Animals: July 2015

Wednesday, July 29, 2015

Scala training in SF

My Israeli Colleague Tomer Gabel is giving two full days Scala training in SF - Aug 11. My blog readers are welcome to use discount code: BOLD200 for getting 200$ off.

A new graph partitioning algorithm at CIKM

We got the following email from Fabio, a gradient student at Rome University:

I'm Fabio Petroni, a Ph.D. student in Engineering in Computer Science at Sapienza University of Rome.

Together with other researcher, we recently developed HDRF, a novel stream-based graph partitioning algorithm that provides important performance improvements with respect to all existing solutions (we are aware about) in partitioning quality.

In particular, HDRF provides the smallest average replication factor with close to optimal load balance. These two characteristics put together allow HDRF to significantly reduce the time needed to perform computation on graphs and makes it the best choice for partitioning graph data.

A paper describing the HDRF algorithm will be presented in the next CIKM conference (http://www.cikm-2015.org) and is available at the following address (this is the final submitted version): http://www.dis.uniroma1.it/~midlab/articoli/PQDKI15CIKM.pdf

We will work with Fabio for including a version of his algorithm for our latest code base GraphLab Create.

Tuesday, July 28, 2015

Some exciting developments at Dato

You may have missed our latest Dato blog post, so I wanted to shed light on two of the coolest released features:

It's particularly exciting to mention that GraphLab Create's integration with Numpy will effectively scale scikit-learn. Now with GraphLab Create and Dato Predictive Services, you can deploy existing scikit-learn models at scale as a RESTful predictive service by changing only a few lines of code. Very cool.

Dato Distributed now with distributed machine learning

# jobs distribution environments
# s = gl.deploy.spark_cluster.load(‘hdfs://…’)
# h = gl.deploy.hadoop_cluster.load(‘hdfs://…’)
e = gl.deploy.ec2_cluster.load(‘s3://…’)

# set distribution environment to my AWS cluster
gl.set_distributed_execution_environment(e)

Dato Distributed enables GraphLab Create users to execute parallel computation of Python code tasks on EC2, Spark or Hadoop clusters. The above shows how GraphLab Create can switch between these environments by changing one-line of code. In GraphLab Create 1.5.1, Dato Distributed on Hadoop now seamlessly supports distributed execution of machine learning models including logistic regression, linear regression, SVM classifier, label propagation and PageRank. Distributed machine learning on EC2 and Spark are in the works.