Thursday, December 29, 2011

Multicore parser - part 2 - parallel perl tutorial

In the first part of this post, I described how to program a multicore parser, where the task is to translate string IDs into consecutive integers that will be used for formatting the input of many machine learning algorithms. The output of part 1 is a map between strings and unsigned ints. The map is built using a single pass over all the dataset.

Now an additional task remains, namely translating the records (in my case phone call records) into a Graph to be used in Graphlab. This is an embarrassingly parallel task - since the map is read only - multiple threads can read it in parallel and translate the record names into graph edges. For example the following records:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
YXVaVQJfYZp BZurhasRwat 050803 235959 6
BqFnHyiRwam jdJBsGbXUwu 050803 235959 242
are translated into undirected edges:
1 2 
1 3
2 4
etc. etc.
The code is part of Graphlab v2 and can be downloaded from our download page.

In the current post, I will quickly explain how to continue setting up the parser.
The task we have now is to merge multiple phone calls into a single edge. It is also useful to sort the edges by their node id. I have selected to program this part in perl, since as you are going to see in a minute it is going to be a very easy task.

INPUT: A gzipped files with phone call records, where each row has two columns: the caller and the receiver; each of them as an unsigned integer. It is possible the same row will repeat multiple times in the file (in case multiple phone calls between the same pair of people where logged in different times).
OUTPUT: A gzipped output file with sorted unique phone call records. Each unique caller receiver pair will appear only once.

Tutorial - how to execute a parallel task in Perl.
1) Download and extract Parallel Fork Manager
wget http://search.cpan.org/CPAN/authors/id/D/DL/DLUX/Parallel-ForkManager-0.7.5.tar.gz
tar xvzf Parallel-ForkManager-0.7.5.tar.gz
mv Parallel-ForkManager-0.7.5 Parallel

2) Create a file named parser.pl with the following lines in it:
#!/bin/perl -w
my $THREADS_NUM = 8;
use Parallel::ForkManager;
$pm = new Parallel::ForkManager($THREADS_NUM);

opendir(DIR, "your/path/to/files");
@FILES= readdir(DIR); 

foreach $file (@FILES) {

  # Forks and returns the pid for the child:
  my $pid = $pm->start and next;

  print "working on file " . $file;
  system("gunzip -c $file | sort -u  -T . -n -k 1,1 -k 2,2 -S 4G >" . $file . "gz" );

  $pm->finish; # Terminates the child process
}

closedir(DIR);
Explanation
1) ForkManager($THREADS_NUM) sets the number of parallel threads - in this case 8.
2) For each file, the file is unzipped using "gunzip -c", and sorted uniquely (-u command line flag). The -T flag is an optional argument in case your temp drive does not have enough space. -k 1,1 sets the sorting key to be the first column and the second -k 2,2 sets column 2 as the key in case of a match in the first column. -S flag sets the buffer size such that the full input file will fit into memory. 4G is 4 Gygabytes of memory.
3) The system() command runs any shell command line, so you can change the parallel loop execution to perform your own task easily.

Overall, we got in a few lines of code a parallel execution environment which would be much harder to setup otherwise.

Wednesday, December 28, 2011

NSF grant awarded to GraphLab

I am glad to announce, that using the great help of Joel Welling, from Pittsburgh Supercomputing Center, GraphLab project was awarded an additional NSF grant. We have obtained an additional 200,000 CPU hours on both BlackLight supercomputer as well as Kraken NSF supercomputer. Grant was obtained via XSEDE (Xtreme Science and Engineering Discovery Environment). We are going to use those our for scaling up GraphLab for systems with thousands of cores.

Sunday, December 25, 2011

How to write a multicore parser

When dealing with machine learning, one usually ignores the (usually boring!) task of preparing the data to be used in any of the machine learning algorithms. Most of the algorithms have either linear algebra or statistical foundation and thus the data has to be converted to a numeric form.

In the last couple of weeks I am working on the efficient design of multicore parser, that allows converting raw string data into a format usable by many machine learning algorithms. Specifically, I am using CDR (call data records) from a large European country. However, the dataset has several typical properties, so I believe my experience is useful for other domains.

The raw CDR data I am using looks like this:
YXVaVQJfYZp BqFnHyiRwam 050803 235959 28
xtGBWvpYgYK jdJBsGbXUwu 050803 235959 242
ZpYQLFFKyTa atslZokWZRL 050803 235959 504
WMjbYygLglR BqFnCfuNgio 050803 235959 51
hcLiEoumskU RcNSJSEmidT 050803 235959 7
qBvUQlMABPv atslBvPNusB 050803 235959 3609
jdSqVjxPlBn BqFnHyiRwam 050803 235959 23
VCWyivVLzRr atslSbOOWXz 050803 235959 8411
PnLaFqLJrEV atslZokWZRL 050803 235959 8806
PnLaCsEnqei atslBvPNusB 050803 235959 590
The first column is an anonymized caller ID, the second column is an anonymized receiver ID, the third column is the date, the fourth is the time, and the last column is the duration of the call.

Now to the data magnitude. If your dataset is small, no need for any fancy parsing, you can write a python/perl/matlab script to convert it to numeric form and avoid reading further... However, this dataset is rather big: every day there are about 300M unique phone calls. So depending on how many days you aggregate together, you can get to quite a large magnitude. For a month there are about 9 billion phone calls logged.

To make the CDR useful, we need to convert the hashed string ID into a number, hopefully a consecutive increasing number. That way we can express the phone call information as a matrix.
Then we can use any of the fundamental machine learning algorithms like: SVM, Lasso, Sparse logistic regression, matrix factorization, etc. etc.

One possible approach for converting strings to integer, is taken in Vowpal Wabbit, where strings are hashed into numeric IDs during the run. However, there is a risk that two different string IDs will be mapped into the same integer. So depending on the application this may be acceptable. I have chosen to take a different approach - which is to simply assign a growing consecutive ID to each string.

I have implemented the code in GraphLab, where GraphLab is not the intuitive tool to be used for this task (although it was convenient to use). In a multicore machine, several GraphLab threads are running in parallel and parsing different portions of the input files concurrently. We have to be careful that node IDs will remain consecutive across the different files. Since stl/boost data structures are typically not thread safe, I had to use a mutex for defending against concurrent insertions to the map. (Concurrent reads from a stl/boost map are perfectly fine).

void assign_id(uint & outval, const string &name){

  //find if the string is already in the map.
  //this part is thread safe since find() is thread safe
  boost::unordered_map<string,uint>::iterator it = hash2nodeid.find(name);
  if (it != hash2nodeid.end()){
     outval = it->second;
     return;
  }

  //if not, we need to insert it to the map
  //now, we must lock since operator[] is not thread safe
  mymutex.lock();
  outval = hash2nodeid[name];
  if (outval == 0){
      hash2nodeid[name] = ++conseq_id;
      outval = conseq_id;
  }
  mymutex.unlock();
}


One should be careful here, since as I verified using gprof profiler, about 95% of the running time is wasted on this critical section of assigning strings to ints.

Initially I used std::map<string,int> but I found it to be rather slow. It seems that std::map is implemented using an underlying tree and so insertions are costly: log(N). I switched to boost::unordered_map which is actually a hash table implementation with O(1) insertions. This gave x2 speedup in runtime.

Second, since each day of input file amount to about 5GB of gzipped file, I used boost gzipped stream for avoiding the intermediate extraction of the input files. Here is an example:
char linebuf[128];
    std::ifstream in_file(filename).c_str(), std::ios::binary);
    boost::iostreams::filtering_stream<boost::iostreams::input> fin;
    fin.push(boost::iostreams::gzip_decompressor());
    fin.push(in_file); 

    while(true){
      fin.getline(linebuf, 128);
      if (fin.eof())
        break;
      //parse the line
    }
    fin.pop();
    fin.pop();
    in_file.close();
Overall, I believe the result is quite efficient: for parsing 115GB of compressed CDR data (total of 6.4 billion phone calls) it takes 75 minutes on a 4 core machine. There where about 182M unique IDs assigned. (Quad core AMD Opteron 2.6Ghz). Total of 12.8 billion map lookups (about 3M lookups a second).

Some performance graph:


Summary of lessons learned:

  1. C parsing is way more efficient than perl/python/matlab.
  2. Opening gzipped files is a waste of time and space - better work directly on the gzipped version.
  3. Parallel parsing has a good speedup up to a few (3) cores. More cores do not improve (due to heavy IO..).
  4. Better use hash table than sorted tree: boost::unordered_map is twice is fast than std::map

Tuesday, December 20, 2011

PETSc and SLEPc parallel scientific computing packages

A few days ago I asked a physicist friend of mine, Doron Naveh, from the Nano Dept at CMU
which is the best MPI based package for solving eigenvalue problems. He recommended using PETSc.

Here is a quick explanation on how to setup PETSc, SLEPc and run a simple SVD solver.

From PETSc webpage:

PETSc, pronounced PET-see (the S is silent), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, shared memory pthreads, and NVIDIA GPUs, as well as hybrid MPI-shared memory pthreads or MPI-GPU parallelism.

From Slepc homepage:
SLEPc is a software library for the solution of large scale sparse eigenvalue problems on parallel computers. It is an extension of PETSc and can be used for either standard or generalized eigenproblems, with real or complex arithmetic. It can also be used for computing a partial SVD of a large, sparse, rectangular matrix, and to solve quadratic eigenvalue problems.


Here are quick installation instructions:

Download the latest version
For example:
wget http://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-3.2-p5.tar.gz
tar xvzf petsc-3.2-p5.tar.gz
cd petsc-3.2-p5
c
./configure --with-cc=gcc --with-fc=gfortran --download-f-blas-lapack=1 --download-mpich=1
make all test

Tip: you may want to add --with-precision=single to work with floats instead of doubles (to save memory).
Tip: you may want to use --download-openmpi=1 instead.

Download SLEPc
wget http://www.grycap.upv.es/slepc/download/distrib/slepc-3.2-p3.tar.gz
tar xvzf slepc-3.2-p3/
setenv PETSC_DIR /mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5
setenv SPLEC_DIR ./slepc-3.2-p3
setenv PETSC_ARCH arch-linux2-c-debug

cd $SLEPC_DIR
./configure
make SLEPC_DIR=$PWD PETSC_DIR=/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5 PETSC_ARCH=arch-linux2-c-debug
make test

Example for running SVD.
create a directory named svd inside petsc-3.2-p5:
mkdir svd
cd svd

Get the file ex14.c from: http://www.grycap.upv.es/slepc/handson/handson4.html
wget http://www.grycap.upv.es/slepc/documentation/current/src/svd/examples/tutorials/ex14.c.html

Create the following file named Makefile, inside the directory svd and put the following in it:
SLEPC_LIB=../arch-linux2-c-debug/lib/
SLEPC_INC=../include/
SLEPC_ARCH_INC=../arch-linux2-c-debug/include/

PETSC_INC=../../petsc-3.2-p5/include/
PETSC_ARCH_INC=../../petsc-3.2-p5/arch-linux2-c-debug/include/
PETSC_LIB=../../petsc-3.2-p5/arch-linux2-c-debug/lib/

MPI_LIB=../../petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/lib

ex14:
        mpicc -o ex14 ex14.c -L${SLEPC_LIB} -L${PETSC_LIB} -L${MPI_LIB} -I${PETSC_INC} -I${SLEPC_INC} -I${SLEPC_ARCH_INC} -I${PETSC_ARCH_INC} -lslepc -lpetsc -lmpich -lm -lflapack -lfblas -lfmpich -lX11 -lmpl -lpthread -lrt -lgfortran
Note: before mpicc it should be a tab (and not spaces).

In Matlab/Octave perpare a Petsc binary file as follows:

>> addpath ../../petsc-3.2-p5/bin/matlab/
>> A=rand(10,10); 
>> PetscBinaryWrite('A',sparse(A));

Now run the example:
<25|0>bickson@biggerbro:~/petsc/slepc-3.2-p3/svd_example$ ./ex14 -file A -svd_nsv 10

Singular value problem stored in file.

 Reading REAL matrix from a binary file...
 Number of iterations of the method: 1
 Solution method: cross

 Number of requested singular values: 1
 Stopping condition: tol=1e-08, maxit=100
 Number of converged approximate singular triplets: 10

          sigma            relative error
   --------------------- ------------------
        5.389506           6.40913e-16
        1.680646            6.9377e-16
        1.364558             1.005e-15
        1.102023           9.88382e-16
        0.825067           2.23038e-15
        0.756176           1.99212e-15
        0.498011            2.9545e-15
        0.335541           1.51681e-15
        0.187449             4.207e-14
        0.124114           3.88213e-14

Useful command line options:
-help             #display help
-info             #verbose mode
-svd_type         #select type of SVD method. Legal types are: cross,cyclic,lanczos,trlanczos,lapack
-svd_view         #display more info about the solver
-svd_nsv          #set the number of singular values requested
-svd_max_it       # maximal number of iterations

Running in parallel:
mpirun -np 8 slepc_program [command line arguments]


Troubleshooting
Problem:
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `SlepcSumNorm2_Local':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:173: undefined reference to `MPI_Abort'
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `VecNormCompEnd':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:185: undefined reference to `MPI_Type_free'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:186: undefined reference to `MPI_Type_free'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:187: undefined reference to `MPI_Op_free'

Solution: You should link againt MPI, for example add -lmpich to the Makefile command.

Problem:
../arch-linux2-c-debug/lib//libslepc.a(veccomp.c.o): In function `GetNorm2':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/veccomp0.h:137: undefined reference to `sqrt'
Solution: You should link to math lib, namely add -lm to the Makefile command line.

problem:
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcUpdateVectors_Noncontiguous_Inplace':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:186: undefined reference to `dgemm_'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:199: undefined reference to `dgemm_'
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcUpdateStrideVectors':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:391: undefined reference to `dgemm_'
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:401: undefined reference to `dgemm_'
../arch-linux2-c-debug/lib//libslepc.a(contiguous.c.o): In function `SlepcVecMAXPBY':
/mnt/bigbrofs/usr6/bickson/petsc/slepc-3.2-p3/src/vec/contiguous.c:475: undefined reference to `dgemv_'
Solution: You should link against blas and lapack, namely add -lfblas -lflapack to your Makefile command

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libpetsc.a(xops.c.o): In function `PetscDrawLine_X':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:29: undefined reference to `XSetForeground'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:33: undefined reference to `XDrawLine'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libpetsc.a(xops.c.o): In function `PetscDrawArrow_X':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:45: undefined reference to `XSetForeground'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:49: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:52: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:53: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:55: undefined reference to `XDrawLine'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/src/sys/draw/impls/x/xops.c:56: undefined reference to `XDrawLine'
Solution: Add -lX11 to the makefile compilation command line

problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(init.o): In function `PMPI_Init':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/init/init.c:106: undefined reference to `MPL_env2str'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/init/init.c:132: undefined reference to `MPL_env2bool'

Solution: Add -lmpl to compilation command line

problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(mpiu_thread.o): In function `MPIU_Thread_create':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/util/thread/mpiu_thread.c:66: undefined reference to `pthread_create'

Solution: add -lpthread to compilation command line

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libmpich.a(ad_iwrite.o): In function `ADIOI_GEN_aio_wait_fn':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:266: undefined reference to `aio_suspend64'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:276: undefined reference to `aio_error64'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/mpich2-1.4.1p1/src/mpi/romio/adio/common/ad_iwrite.c:278: undefined reference to `aio_return64'
Solution: use mpicc as the compiler instead of gcc

Problem:
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dgesvd.o): In function `dgesvd':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dgesvd.f:215: undefined reference to `_gfortran_concat_string'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dgesvd.f:379: undefined reference to `_gfortran_concat_string'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dhseqr.o): In function `dhseqr':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dhseqr.f:345: undefined reference to `_gfortran_concat_string'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dlasd0.o): In function `dlasd0':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasd0.f:200: undefined reference to `_gfortran_pow_i4_i4'
../../petsc-3.2-p5/arch-linux2-c-debug/lib//libflapack.a(dlasda.o): In function `dlasda':
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasda.f:330: undefined reference to `_gfortran_pow_i4_i4'
/mnt/bigbrofs/usr6/bickson/petsc/petsc-3.2-p5/externalpackages/fblaslapack-3.1.1/lapack/dlasda.f:341: undefined reference to `_gfortran_pow_i4_i4'
Solution: Add -lgfortran to the compiler command line

Problem:
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Must indicate a file name with the -file option.!
[0]PETSC ERROR: ------------------------------------------------------------------------
Solution: You did not give an input file using the -file command line option. Problem:
13|0>bickson@bigbro6:~/petsc/slepc-3.2-p3$ ./configure
Checking environment...
ERROR: PETSc is not configured for architecture arch-installed-petsc
Solution: You forgot to define PETSC_ARCH. Define it using: setenv PETSC_ARCH arch-linux2-c-debug

Friday, December 16, 2011

News from NIPS - Divide and Conquer Matrix Factorization and efficient K-means

Here are some interesting papers that caught my eyes at NIPS. While being very simple I believe they can be very effective in practice.

"Divide and Conquer Matrix Factorization" is a joint work by Lester Mackey, Ameet Talwalkar and Michael Jordan from Berkeley. The idea is pretty simple. Assume you have a large matrix you can not fit into memory you would like to factorize to two low rank matrices. There are 3 steps in the construction:

  1. The large matrix is first partitioned into several small matrices by slicing the original matrix to column groups. 
  2. Each smaller matrix is solved in parallel using a cluster of computers, using any desired matrix factorization algorithm. 
  3. Now it remains how to aggregate the result back for getting the original factorized solutions of the larger matrix. This can be done by column projection, namely computing the least square solution of the smaller matrix onto the larger one. (See details in the paper!) and averaging the different solutions together.
It seems this construction has a pretty descent speedup. I am considering implementing this method in GraphLab as well. Ping me if you are interested in using this algo.


The second paper, Fast and Accurate K-means for Large Datasets is a method for speeding up k-means computation by Michael Shindler Alex Wong andAdam Meyerson. It got the best student paper award. The idea is pretty simple and actually reminds me the power iteration clustering.

The observation at the basis of this construction is that one of the heavy computational burdens in K-means is computing distances from each data point to all K-cluster heads. A very simple trick is proposed: a random vector is selected and the data point as well of all cluster locations are projected into it. Now we got a list of scalars for each cluster and it remains to find out which is closest to the projected data point. This significantly saves work since instead of computing D-dimensional distances we compute only 1-dimensional distances. It seems that in practice accuracy of the method is not significantly worse.

Monday, December 12, 2011

MPI vs. Hadoop

Here is an interesting blog post from John Langford (Yahoo! Research NY) about some of the pros and cons of MPI vs. Hadoop. Different approaches to ML are discussed and summarized by John as follows:
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
  • For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  • For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  • For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug. The model needs to fit into memory. If that’s not the case, then other approaches are required.

Anyone who reads this blog and is attending the NIPS big ML workshop is encouraged to contact me since I plan to be there! Note that the asynchronous approach reference (bullet 3) actually refers to our GraphLab project.  I liked the phrasing: it's scary in practice..  But we do support also BSP execution so it is always possible to debug that way.

Wednesday, December 7, 2011

Preview of GraphLab v2 new features!

GraphLab v2 new features are going to be presented in Intel's ISTC-CC retreat. The new features are the result of the hard work for Joey Gonzalez, who spent the summer as part of our collaboration with Yahoo! Research, in Alex Smola's lab.

The main challenge identified by Joey with GraphLab v1, is the handling of natural (power-law) graphs.

The problem with iterative graph algorithms of such graphs, is that when computing the update of a high degree node one has to touch large portion of the graph, and thus is is hard to distribute computation. Furthermore, computation of a high degree node is done sequentially and thus we loose some of the parallelism. Last, whenever an algorithm use locking of neighbors state (to avoid concurrent writes to their data structures) large number of locks has to be acquired which significantly degrades performance. 


The first solution is factorized update functions (shown above). The work on a high degree nodes is divided into smaller pieces of work which can be done in parallel.

The second solution is associate/commutative update functions, a.k.a delta updates (shown below). Whenever sums over a large number of nodes have to be computed, when the state of most of them is constant, there is no much point in summing up the result again and again. In this case, only the change in the result is tracked.


Some performance results on Yahoo! webgraph (25.5M nodes, 355M edges):

GraphLab v2 experiment version can be downloaded from our mercurial code base (using the command hg update -v2). We plan a full release in a couple of months.

Tuesday, December 6, 2011

GraphLab debugging with Eclipse

Here are fresh instructions I got from Young Cha, A graduate student at UCLA on
how to setup Graphlab to work with Eclipse.

1. Installing CDT for Eclipse
(The instructions below uses Eclipse's plug-in install function. There are other ways of installing CDT for Eclipse. Please refer tohttp://eclipse.org/cdt/ for more information)
  • In the menu bar, select 'Help' -> 'Install New Software..'
  • Click 'Add'
  • Type 'CDT' in Name field and http://download.eclipse.org/tools/cdt/releases/galileo in Location field (If you use a different version of Eclipse, please replace 'galileo' with your version ('helios' or 'indigo')
  • Check 'CDT Main Features' and 'CDT Optional Features' (You may choose only the modules you need. If you don't know about it, please check all)
  • Click 'Next' and install CDT
2. Using Cmake's CDT generator to convert GraphLab into a Eclipse CDT project
  • Let's assume that your graphlabapi directory is located in ~/graphlab/graphlabapi and you want to create CDT project at ~/graphlab/cdt_build
    • cd ~/graphlab
    • mkdir cdt_build
    • cd cdt_build
    • cmake -G"Eclipse CDT4 - Unix Makefiles" -D CMAKE_BUILD_TYPE=Debug ../graphlabapi
  • Please refer to http://www.cmake.org/Wiki/Eclipse_CDT4_Generator for more information
3. Importing GraphLab to Eclipse CDT
  • In the menu bar, select 'File' -> 'Import..'
  • Select 'General' -> 'Existing Projects into Workspace'
  • Click 'Next'
  • Click ''Browse' and select the 'cdt_build' directory you created previously
  • Please refer to http://www.cmake.org/Wiki/Eclipse_CDT4_Generator for more information
Thanks so much Young! Your instructions are going to be very helpful for our other users!

Friday, December 2, 2011

SpotLight: Michael Ekstrand and the LensKit Project

As part of our new collaboration with LensKit, where Graphlab collaborative filtering library will be used as one of LensKit engines, here is an overview of this interesting project from Michael Ekstrand, a PhD student at GroupLens research, Univ. of Minnesota.

- What is the goal of the LensKit project?

The goal of LensKit is to provide a flexible, extensible platform for researching, studying, and deploying recommender systems. We are primarily concerned with providing a framework for building and integrating recommenders, high-quality implementations of canonical algorithms which perform well on research-scale data sets, and tools for running reproducible offline evaluations of algorithms.

- What other relevant projects exist in the GroupLens lab?

We have been building recommenders internally for quite some time, starting with the GroupLens recommender system, followed by MovieLens and later recommendation efforts (such as various research paper recommenders which went under the name TechLens, and our new book recommender service BookLens). Several years ago, the MultiLens project made some of our recommender code available for use outside the lab.
LensKit is a brand-new, from-scratch implementation of core recommender algorithms that we will be using internally going forward. BookLens is currently built on top of it, and we plan to move MovieLens from its current internal recommender code, related to the MultiLens code, to LensKit sometime in the coming months. A number of of LensKit's design decisions have been driven by the needs of the BookLens project, as we have been making it suitable for both offline runs and integration into web applications. Future web-based recommender systems, both within GroupLens and externally, will be able to pick up LensKit and integrate it very easily with the complex needs of web server environments.

- Who is working on the project and for how long?

I started the project about 2 years ago, working on it off-and-on. Development picked up substantially in late 2011-early 2011, and we had our first public release early this year. Michael Ludwig has been involved with the project for most of that time, helping particularly with design and requirements work as he integrates it with the BookLens code, and also contributing code directly as well. Jack Kolb is an undergraduate who has been working with me since late spring, and we have had other students helping from time to time as well.

- What is the status of development?

Right now we are in late beta, with stable APIs for common recommender tasks, a robust infrastructure, and good implementations of several classic algorithms. We are currently working on documentation and a refactoring of our recommender configuration infrastructure; once that is completed and tested, we should be ready to declare a stable 1.0 release to build on going forward. It is pretty safe to build against LensKit at this point, though; the main interfaces are stable, and the APIs for configuration shouldn't change too much. Some code might need to be updated for 1.0, but it should be limited to the code to configure the recommender.

- Which open source license are you suing?

LensKit is licensed under the GNU GPL version 2 or later with a link exception to allow linking between it and modules under other licenses (whether proprietary or GPL-incompatible open source). LensKit can be used and deployed internally without restriction; projects distributing it externally must make its source code available to their users. The link exception is the same as that used by the GNU Classpath project.
Many of the libraries we depend on are licensed under the Apache license (APLv2).

- What are the interesting algorithms currently implemented?

We provide implementations of user-user and item-item collaborative filtering (with similarity matrix pre-computation and truncation in item-item), Funk's regularized gradient descent SVD (see also Paterek paper in KDD 2007), and Slope-One recommenders.

- What is the level of parallelism you allow (is there support for parallel execution?)

Currently, none of the algorithms are parallelized. The evaluator is capable of training and evaluating multiple algorithms in parallel, even sharing some of the major data structures like the rating snapshot between runs, to achieve good throughput on comparative evaluations. Parallelizing some of them is on our radar, though.

- Do you have some performance numbers on Netflix/KDD data or similar datasets?

We provide accuracy data from the MovieLens data sets and Yahoo! Music data sets in our RecSys 2011 paper (http://grouplens.org/node/479). Efficiency numbers are a bit rough, since we've been running parallel builds, but we can train an item-item model on the MovieLens 10M data set in about 15 minutes, and on one shard of the Y! Music data set (from WebScope, each shard has ~80M ratings) in about 20-26 hours. FunkSVD takes 30-50 minutes, depending on the model size and parameters, on ML10M and 14 hours on Y!M.

- Regarding the potential collaboration with GraphLab. What may be the benefits of this collaboration?

We are looking to integrate GraphLab to allow LensKit users to have easy access to high-quality implementations of a wider variety of matrix factorization methods in particular, and to leverage your existing work rather than re-building everything ourselves. It will also make it easy to integrate GraphLab-based algorithms with interesting recommender environments, such as web applications or comparative evaluation setups.

Overall, here at the GraphLab project we are very excited about this collaboration. We believe that LensKit has a few properties that are currently missing in GraphLab: namely output processing for finding the best recommendation  as well as improved user interaction and better UI.

Thursday, December 1, 2011

Linear stable models

I just heard from Alex Ihler, UC Irvine, that he participated in early Novemeber in Workshop called "Counting, Inference, and Optimization on Graphs" in Princeton University, NJ. Yongi Mao from Otawa University Gave a nice talk title "Normal Factor Graphs, Linear Algebra and Probablistic Modeling".

In my NIPS 2010 paper, I (and my advisor Carlos Guestrin) where the first to show how to compute inference in linear models involving heavy tailed stable distributions. This was the first closed form solution to this problem. The trick was to use duality and compute inference in the Fourier (characteristic function) domain. My work is heavily based on Mao's work, since he was the first to formulate this duality in a general linear model. Here are his related papers:
Y. Mao and F. R. Kschischang. On factor graphs and the Fourier transform. In IEEE Trans. Inform. Theory, volume 51, pages 1635–1649, August 2005.
Y. Mao, F. R. Kschischang, and B. J. Frey. Convolutional factor graphs as probabilistic models. In UAI ’04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 374–381, Arlington, Virginia, United States, 2004. AUAI Press.

It seems that Mao was quite frustrated that his work seemed merely theoretical at that time, while it was hard for him to find an application to his construction. So he was delighted to see that I have actually used his construction towards a very useful application. That is why he titles my work "a missed opportunity" because he is a bit upset that he did not think about it at the time... But anyway this is science - a small progress at each step..

Power iteration clustering

I attended today a very interesting talk give by Prof. William Cohen from CMU (Given at HUJI). The talk describes a very simple spectral clustering method called power iteration clustering. A paper at the same name by Frank Lio and William Cohen appeared in last year's ICML 2010.

What I like about the method, is its simplicity and application to really large datasets. While some of the HUJI audience regarded the presentation quite skeptically since no theoretical results where presented to justify the great empirical performance, it seems that empirically the power iteration clustering method works just like spectral clustering. I guess that future theoretical results will soon to follow. And let me tell you a secret: some of the most successful methods deployed today in practice are simple methods like linear and (sparse) logistic regression, matrix factorization, K-means clustering etc.

In traditional spectral clustering, a square matrix describing the problem is constructed. (Typically the matrix is normalized s.t. the sum of each row equals to one). Then the K top eigenvectors  are computed. For example, if this is a social network then this matrix is the adjacency matrix. When the data is big, we can view this process of eigen decomposition as a dimensionality reduction step from sparse data in high dimension to a lower representation which is typically dense. The second step in spectral clustering is application of some clustering method like K-means to the eigenvectors. It was shown that typically data points which are close to each other has similar values in their eigenvectors.
That is why it makes sense to cluster those points together. 

Note: I am cheating a bit by not providing all the details. A great tutorial for spectral clustering is found here: http://arxiv.org/abs/0711.0189.

In contrast, the proposed power iteration clustering (PIC) computes power iteration using the data matrix. Since the sum of each row is equal to one, it can be thought of a weighted average calculation, where each node computes the average of its neighbors. Of course this process will lead to the same global average in all graph nodes. The main observation in the PIC algorithm, is that the process should be stopped early before every node in the graph has an equivalent sum.  When stopped at the right moment, each node store a cluster distinctive value. Intuitively, this is because nodes which are highly connected in a cluster push their intermediate results to be close to each other. Then the second step of K-means clustering is computed. This step is easier (relative to traditional spectral clustering), since we cluster just scalars in a single vector and not K dimensional vectors.

In terms of work, instead of computing K largest eigenvetors in the traditional spectral clsutering methods, PIC computes only a single eigenvector (which is actually a linear combination of several eigenvectors). So it is K times faster then traditional spectral clustering. 

Here is the algorithm:





The image above shows a demonstration of the power iteration method (taken from their ICML 2010 paper). (a) Shows a problem that K-means clustering has a difficulty operating on. (b) - (d) evolution of the power iteration method. Y-axis is the output scalar value of the output vector. The x-axis represent the different points. The colors are the output of the K-means clustering of scalars. While (b)-(c) are still not accurate, eventually the values of the vector converge very nicely to three clusters.
In principle,  if we would have continued to step (e) where all nodes have converged, we will get the same value in all nodes that would not have been useful.

Anyway, since the conjugate gradient, Lanczos and SVD methods implemented
in GraphLab (As well as K-meams) it is going to be very easy to implement this algorithm in GraphLab. I am planning to do it soon.

Tuesday, November 29, 2011

Oracle Labs Report on GraphLab


Recently we had some interesting discussions with Eric Sedlar, Technical Director, Oracle Labs. Oracle examined GraphLab as a potential framework for implementing graph algorithms and assigned Sungpack Hong, a researcher at Oracle Labs to write a tech report on this subject.

A few days ago I got a copy of this seven pages tech report which is called "A brief report on Graphlab and Green-Marl". I asked for permission from Eric and Sungpack to discuss their main finding here.  Their full tech report may be published by them at the later date.

According to Sungpack there are 4 characterizing aspects of algorithms which Graphlab supports:

  1. Algorithms can be described in a node-centric way; same computation is repeatedly performed on every node.
  2. Significant amounts of computations are performed on each node.
  3. The underlying graphs have large diameter.
  4. Determinism can be neglected.
Here is my reaction to those four points:
  1. Agreed.
  2. Agreed. 
  3. This is not necessarily correct - you can use Graphlab for any Graph as long they are sparse.
  4. Actually, we support round robin scheduling, so it is quite easy to have full sweeps over all nodes. Many of the matrix factorization algorithms we implemented fall into this category - they are completely deterministic (besides of maybe some random initialization of the starting state).
Next, some issues are discussed:
  1. Programmability: user must restructure his algorithm in a node centric way.
  2. There is an overhead of runtime system when the amount of computation performed at each node is small.
  3. Small world graphs: GraphLab lock scheme may suffer from frequent conflicts for such graphs.
  4. Determinism: the result of computed algorithm in Graphlab may become non-deterministic.
  5. Concerns about distributed execution.
Here is my thoughts on those points:
  1. This is a correct concern, not every algorithm is suitable for programming in GraphLab.
  2. Agreed.
  3. Agreed. In GraphLab v2 Joey is heading significant improvement which will allow factoring the update function in to several different computations to deal with this issue.
  4. Objection! As shown, for example, by our participation in the ACM KDD CUP contest this year, we got very high quality prediction results, that where completely deterministic. It is possible to implement a wide variety of deterministic algorithms in GraphLab.
  5. Agreed. Yucheng is heading the effort of completely rewriting the distributed engine and we believe significant performance improvements are possible in the distributed settings. 
Finally, the following conclusion is given:
... I believe GraphLab still is a valuable back-end for Green-Marl (DB: their programming language) .. it is a very natural to use GraphLab as one of Green-Marl's back-ends....  GraphLab and Green-Marl can be helpful to each other... Finally it will be interesting to wait .. for GraphLab2. A fair comparison of distributed graph processing ... would be meaningful then.

Overall, I wanted to thank Eric and Sungpack for the great and serious effort they did in evaluating GraphLab! While I am of course biased, it is important to get unbiased evaluation of third parties.

Spotlight: Dmitriy Golovashkin - Oracle and the R project


A few days ago I got the following note from Dmitry, a principal stuff member at Oracle:

Hi Danny,

I found out about GraphLab just two days ago.

I was working on a MapReduce based QR factorization and whilst searching web for references, found your blog & GraphLab. No question, I am planning to learn more about the project. Looks very exciting!

In general, our group is focusing on in-database and Hadoop based
data mining and statistical algorithms development.
R http://www.r-project.org/ is a big part of it.

Kind regards,
DG

As always I am absolutely thrilled for getting my readers feedback! I asked Dmitry if he can share some  more insight about R project and here is what he wrote:

R is huge in data mining and statistical camps.
The number of contributed packages is staggering, it is amongst the most complete and feature rich environments for statistical and data mining computing.
Another very important observation concerns the quality of some of the contributed packages: outstanding work & implementation.

The biggest problem with R has to do with its inherent data storage model: everything must be stored in memory and most algorithms are sequential.
For instance the notion of a matrix in R is captured in the following C one liner:
double* a = (double*) malloc(sizeof(double) * nElements));

It is possible to build R with a vendor-supplied matrix packages (BLAS and LAPACK) and thus have multithreaded matrix computations in R (which helps a lot).
However if the input does not fit into memory, then it is somewhat problematic to run even the simplest algorithms.

We enable R folks to carry out computations directly on the database data (no need to move the data out). The in-memory limitation has been lifted for some algorithms (not all of course).

More is here
http://www.oracle.com/us/corporate/features/features-oracle-r-enterprise-498732.html

Kind regards,
DG

We definitely agree that R is very useful statistical package. In fact, one of our users, Steve Lianoglou from Weill Cornell Medical Collage ported our Shotgun solver package to R.   And here is an excerpt from Steve's homepage which summarizes his ambivalent relation to R:

I have a love/hate relationship with R. I appreciate its power, but at times I feel like I'm "all thumbs" trying use it to to bend the computer to my will (update: I actually don't feel like this anymore. In fact I'm quite comfortable in R now, but try not to get too frustrated if you're not yet ... it only took me about 6 months or so!).
If that feeling sounds familiar to you, these references might be useful.
  • The R Inferno [PDF]. "If you are using R and you think you're in hell, this is a map for you." I stumbled on this document after using R for about 8 months or so and I could still, sympathize with that statement.
  • An R & Bioconductor Manual by Thomas Girke




Anyway if you are a reader of this blog or a Graphlab user - send me a note!

Tuesday, November 22, 2011

Installing GraphLab on Fedora Core 16

FOR LAZY USERS: You can use Amazon EC2 image: ami-1510d87c
As instructed here.

0) Login into your FC 16 machine.
On amazon EC2 - you can launch instance ami-0316d86a

1) Installing libboost
yum install boost boost-devel

2) Installing itpp
Note: For FC 15 and earlier it is easy to setup itpp, see packages here: http://koji.fedoraproject.org/koji/packageinfo?packageID=2142 and explanation on http://sourceforge.net/apps/wordpress/itpp/download/
Using the command:
yum install itpp itpp-devel itpp-doc

Here I will explain how to install itpp manually, for FC 16

yum install lapack autoconf autogen libtool
ln -s /usr/lib64/liblalpack.so.3 /usr/lib64/liblapack.so
ln -s /usr/lib64/libblas.so.3 /usr/lib64/libblas.so

yum groupinstall "Development Tools"
wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
tar xvzf itpp-4.2.tar.gz
cd itpp-4.2
./autogen.sh 
./configure --enable-debug --without-fft --with-blas=/usr/lib64/libblas.so.3  --with-lapack=/usr/lib64/liblapack.so.3 CFLAGS=-fPIC CXXFLAGS=-fPIC  CPPFLAGS=-fPIC
make && make install
Note: future version of lalpack may have different library name, check your /usr/lib64
dir for the correct lapack/blas lib names.

3) Install Mercurial
yum install mercurial

4) Install cmake
yum install cmake

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
./configure --bootstrap --itpp_include_dir=/usr/local/include/  --itpp_dynamic_link_dir=/usr/local/lib/libitpp.so  --itpp_lapack_dir=/usr/lib64/
cd graphlabapi
cd release/
make -j4

7) Test Graphlab
cd tests
export LD_LIBRARY_PATH=/usr/local/lib/
./runtests.sh

8) Optional (but recommended): install Octave.
yum install octave

Sunday, November 20, 2011

Installing GraphLab on RedHat Enterprise Server 6.1

0) Login into your Red Hat machine Enterprise 6.1 machine (64 bit).
On amazon EC2 - you can launch instance ami-31d41658

1) Installing libboost
yum install boost boost-devel
yum groupinstall "Development Tools"

2) Install Mercurial
yum install mercurial

3) Install cmake
yum install cmake

4a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.
copy the command string: "hg clone..." and execute it in your ubuntu shell.


or 4b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

5) configure and compile
cd graphlabapi
./configure 
cd release/
make -j4


Saturday, November 19, 2011

Installing GraphLab on Gentoo Linux



For Lazy Users: simply use  ami-d9e820b0  Amazon EC2 image. (More detailed instructions)


Detailed instructions.
Login into your Gentoo Linux machine (I used preinstalled Amazon EC2 image: AMI-e234c78b).

1) Install itpp using:
emerge lapack
USE="blas lapack debug doc" emerge itpp

2) Install boost using:
emerge boost

3) Install mercurial using:
emerge mercurial

4) Install cmake from sources:
Note: I was not able to execute "emerge cmake" because of some versioning problem in
the Gentoo repository, so I had to install it manually from sources as follows:
get http://www.cmake.org/files/v2.8/cmake-2.8.6.tar.gz
tar xvzf cmake-2.8.6.tar.gz
cd cmake-2.8.6
./configure
make
make install

5a) Install graphlab from mercurial
Go to graphlab download page, and follow the download link to the mercurial repository.

copy the command string: "hg clone..." and execute it in your Gentoo shell.

or 5b) Install graphlab from tgz file
Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6) configure and compile
cd graphlabapi
./configure --bootstrap --yes
cd release/
make -j4

7) Test Graphlab
cd tests
./runtests.sh

8) Optional: install Octave.
If you don't have access to Matlab, Octave is an open source replacement. Octave is useful for preparing input formats to GraphLab's collaborative filtering library and reading the output.
You can install Octave using the command:
emerge octave

Let me know how it went!

Thursday, November 17, 2011

SpotLight: Tianqi Chen - SVDFeature collaborative filtering project


Here is an interview With Tianqi Chen, the winner in the 3rd place in ACM KDD CUP track 1 this year. Since he got two places ahead of us (Lebusishu team), I definitely appreciate his achievement! Tianqi is one of the main collaborators of the SVDFeature project. In a nutshell, SVDFeature is an efficient c++ implementation of stochastic gradient descent that allows for additional feature tuning by the user. It can scale to large models since it keeps only part of the problem in memory. However, it is not fully parallel yet.

1) First I wanted to congratulate you for your great achievement in the KDD CUP this year. In retrospective, can you explain what are the crucial features of SVDFeature project that made it so successful ?

One of the key factor of winning is ensemble predictor of course, our teammates in HKUST did a great job in that. On the other hand, SVDFeature creates the chance to build a powerful single predictor that takes many information into account, we report the best single method(RMSE=22.12 without blending) in track1 using SVDFeature. The most interesting thing of SVDFeature is user can build variants of matrix factorization model by feature defining. In our final predictor of track1, we incorporate user temporal information, implicit feedback, item neighborhood information, taxonomy information etc. Many famous models such as time-SVD++, SVD with neighborhood can also be implemented by defining user temporal related feature, and neighborhood feature. Another important thing of SVDFeature is we don't try to load all the training data into the memory, which suits the large-scale CF data such as KDDCup(of course we still need the memory for storing the model). For example, we can use less than 2GB memory to get comparable performance in KDDCup track1 and track2.

2) What is the project focus?

We really want to make the project as an generic tool for developing Informative Collaborative Filtering and Ranking Algorithms. By informative, we mean to build matrix factorization algorithm that takes available information into account( taxonomy information, user temporal, neighborhood, social network, users' previous click history or browsing history etc.). Similar context exists long in traditional classification tools such as LIBSVM, SVMLight, SVMRank. These tools takes in feature vectors and give predict using the information given in the feature-vector. We want SVDFeature to be the counterpart of the feature-based predictors in Collaborative Filtering and Ranking. Users can define features to describe user(e.g user's click history and browsing history), item(e.g taxonomy information), and global effect(e.g neighborhood effect) to build actually new variants of matrix factorization. In that sense, SVDFeature is suitable for those who want to develop new recommendation algorithms for feature define. It's also suitable for building context-aware recommendation algorithms.

3) Who are the contributors?

There are four authors in the project page, I am the major author of the project. We do welcome comments and feedback on the project.

4) What is the open source model?

The project is released under Apache License, Version 2.0. The toolkit doesn't depends on extra libraries and compiles under both linux(g++) and windows(MSVC).

5) What are the advantages of using the project?

(1) Implement existing variants of Matrix Factorization( time-SVD++, neighborhood SVD) and composite of them.

(2) Build new CF algorithms( for ranking or for rate prediction) that makes use of extra information by defining new features.

(3) Working with a team building feature components independently and merge up together to see the result. Since the input only involves different kinds of feature, we can have one guy working on user temporal feature using python, and another one generation item neighborhood feature using C++. The generated features can be used to build independent models or merge up to build a hybrid predictor.

(4) Large-scale data handling, the toolkit works for existing large-scale datasets such as kddcup, netflix with frugal usage of memory and good speed, if we build features for basic matrix factorization in KDDCup track1, here are some statistics for one round over the dataset: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz. 274 sec(64 factors), 357sec(128 factors) , the memory cost is within 2GB.

6) Are there some algorithms which will be hard to express using your framework?

Yes, to be honest, currently SVDFeature supports feature-based matrix factorization model, that involves user, item and global feature. Which do covers large-range of existing algorithms. It currently doesn't cover the models that involves multiple independent factorization parts. For example, tensor time bias, that has time vs user/item in addition to user vs item factor. It's not hard to add additional tweaks to the code to add some specific features. We do welcome requests of specific features. On the other hand, we want to think about a more abstract model that involves configuration of multiple factorization parts, that's a future direction of the toolkit.

7) Can you point to some tutorials for people who want to try it out? Conceptually, what is needed from the users to be defined?

We have a detailed user reference manual and technical report about the model that comes with the project archive. The user need to define a feature-file similar to the common SVM format and provide a configure file that specifies the parameters such as number of latent factors, learning_rate. etc. We have a demo folder in the toolkit archive that have simple example scripts for movielen data. We also have a non-trivial example that shows how to do experiment on KDDCup track2 dataset. We also welcome questions from users.

8) Is the implementation parallel?

The training algorithm is not parallel, but we do create a independent thread to fetch the input data to ensure the speed, we also use SSE2 to optimize some vector operations.

9) Can you give some performance numbers for kdd cup data?

For the KDDCup track1, SVDFeature can build a model with RMSE 22.1x, we also did an afterwards experiments on KDDCup track2 ( see all the scripts and codes) that error=3.1x. The performance of SVDFeature really depends on the user's intelligence of feature defining, the kddcup track2 is a good example to show performance of different feature compositions.

10) What is the difference of SVDFeature to libfm?

(1) Yes, it's very related to FM, we do respect the work of libFM. We pointed it out in our related work in the tech-report.
(2) The difference in the model: feature-based matrix factorization distinguish different types of feature, it has factorization interaction between user feature and item feature, and linear only effect with the input of global feature. FM is more like a tensor-factorization style model, which enforce factorization interaction between every features. The distinguish of features allows linear only feature such as neighbourhood information, and ensures user temporal feature and user implicit feedback will only have factorization interaction with item features.
(3) Difference in toolkit: SVDFeature supports both rate prediction and feature-based collaborative ranking, it also has a fast algorithm for training with implicit feedback, large-scale data handling(it's very important if you define many features), and other minor differences: e.g detailed options for parameter tuning and various loss functions( hinge loss, logistic loss, square loss ) .
(4) Some examples that may tell these differences: try to implement SVD++ model, SVD with neighbourhood information, or see our non-trival experiment on KDDCup track2(http://apex.sjtu.edu.cn/apex_wiki/kddtrack2)
We call our model Feature-based Matrix Factorization because the idea descends more naturally from matrix factorization using features. We do not try to claim we proposed a freshly new feature-based model, the credits shall goes to FM. But we do try to make a model and a toolkit that's useful and can help doing informative collaborative filtering and ranking, and SVDFeature at least works better in some scenarios:)

News from 29 Nov, 2011:
Titanqi published ready-to-run experiment on KDDCup track1 in http://apex.sjtu.edu.cn/apex_wiki/kddtrack1

Wednesday, November 16, 2011

GAMP: Generalized Approximate Message Passing in GraphLab

About a week ago I was invited to visit Ohio State University by Phil Schniter. I heard about a lot of interesting research activities, that I will probably report in this blog in the future. Specifically, Phil is working on inference on very related linear models (relative to my work).

I heard from Phil about the GAMP method, an extension of Montanari's AMP (approximate message passing methods). GAMP is supposed to overcome some of the difficulties in AMP, namely Gaussian matrix assumption. Here is an excerpt from the GAMP wiki:
Generalized Approximate Message Passing (GAMP) is an approximate, but computationally efficient method for estimation problems with linear mixing. In the linear mixing problem an unknown vector, x, with independent components, is first passed through linear transform z=Ax and then observed through a general probabilistic, componentwise measurement channel to yield a measurement vector y. The problem is to estimate x and z from y and A. This problem arises in a range of applications including compressed sensing. Optimal solutions to linear mixing estimation problems are, in general, computationally intractable as the complexity of most brute force algorithms grows exponentially in the dimension of the vector . GAMP approximately performs the estimation through a Gaussian approximation of loopy belief propagation that reduces the vector-valued estimation problem to a sequence of scalar estimation problems on the components of the vectors x and z.
This project is intended to develop the theory and applications of GAMP. We also provide open source MATLAB software for others who would like to experiment with the algorithm.
Below is a simplified version of GAMP matlab code I got from phil. It is a highly simplified instance of the GAMPmatlab code, which includes a library of signal priors and matrix constructions, as well as stepsize control (to improve stability when the matrix has highly correlated columns). Automatic tuning of GAMPmatlab parameters is provided by the EM-BG-GAMP code.

% this m-file uses GAMP to estimate the matrix X (of dimension NxT) from the noisy matrix observation Y=A*X+W of dimension MxT.  here, A has independent Gaussian entries with matrix mean and variance (A_mean,A_var), W has iid Gaussian entries with scalar mean and variance (0,psi), and X has iid Bernoulli-Gaussian entries with scalar activity rate "lam" and with non-zero elements having scalar mean and variance (theta,phi).

 % declare signal parameters
 N = 1000;      % # weights     [dflt=1000]
 M = 400;   % # observations    [dflt=400]
 T = 10;   % # dimensions per observation   [dflt=10]
 lam = 0.01*ones(N,T);  % BG: prior probability of a non-zero weight [dflt=0.01]
 theta = 0*ones(N,T);  % BG: prior mean of non-zero weights  [dflt=0]
 phi = 1*ones(N,T);   % BG: prior variance of non-zero weights [dflt=1]
 SNRdB = 15;   % SNR in dB      [dflt=15]

 % declare algorithmic parameters
 iterMax = 50;   % maximum number of AMP iterations  [dflt=20]

 % generate true Bernoulli-Gaussian signal (i.e., "weights")
 B_true = (rand(N,T)>1-lam);   % sparsity pattern
 X_true = B_true.*(theta+sqrt(phi).*randn(N,T));% sparse weights

 % generate measurement (i.e., "feature") matrix
 A_mean = randn(M,N)/sqrt(M);       % element-wise mean 
 A_mean2 = A_mean.^2;
 A_var = abs(A_mean).^2;   % element-wise variance
 A_true = A_mean + sqrt(A_var).*randn(M,N); % realization

 % generate noisy observations
 psi = norm(A_true*X_true,'fro')^2/(M*T)*10^(-SNRdB/10);   % additive noise variance 
 Y = A_true*X_true + sqrt(psi).*randn(M,T);  % AWGN-corrupted observation

 % initialize GAMP
 X_mean = lam.*theta;  % initialize posterior means
 X_var = lam.*phi;  % initialize posterior variances
 S_mean = zeros(M,T);  % initialized scaled filtered gradient
 NMSEdB_ = NaN*ones(1,iterMax); % for debugging: initialize NMSE-versus-iteration

 % iterate GAMP
 for iter=1:iterMax,

   % computations at observation nodes
   Z_var = A_mean2*X_var;
   Z_mean = A_mean*X_mean;
   P_var = Z_var + A_var*(X_var + X_mean.^2);
   P_mean = Z_mean - Z_var.*S_mean;
   S_var = 1./(P_var+psi);
   S_mean = S_var.*(Y-P_mean);
% computations at variable nodes
R_var = 1./(A_mean2'*S_var + eps);
   R_mean = X_mean + R_var.*(A_mean'*S_mean);
   gam = (R_var.*theta + phi.*R_mean)./(R_var+phi);
   nu = phi.*R_var./(R_var+phi);
   pi = 1./(1+(1-lam)./lam .*sqrt((R_var+phi)./R_var) .* exp( ...
    -(R_mean.^2)./(2*R_var) + ((R_mean-theta).^2)./(2*(phi+R_var)) ));
   X_mean = pi.*gam;
   X_var = pi.*(nu+gam.^2) - (pi.*gam).^2;

   % for debugging: record normalized mean-squared error 
   NMSEdB_(iter) = 20*log10(norm(X_true-X_mean,'fro')/norm(X_true,'fro'));
 end;% iter

 % for debugging: plot posterior means, variances, and NMSE-versus-iteration
 subplot(311)
   errorbar(X_mean(:,1),sqrt(X_var(:,1)),'.');
   hold on; plot(X_true(:,1),'k.'); hold off;
   axis([0,size(X_true,1),-1.2*max(abs(X_true(:,1))),1.2*max(abs(X_true(:,1)))])
   ylabel('mean')
   grid on;
   title('GAMP estimates')
 subplot(312)
   plot(10*log10(X_var(:,1)),'.');
   axe=axis; axis([0,N,min(axe(3),-1),0]);
   ylabel('variance [dB]');
   grid on;
 subplot(313)
   plot(NMSEdB_,'.-')
   ylabel('NMSE [dB]')
   grid on;
   title(['NMSE=',num2str(NMSEdB_(iter),3),'dB '])
 drawnow;

Today I have implemented GAMP in GraphLab. The implementation in parallel is pretty straightforward. The main ovearhead is the matrix product with A (marked in red) and with A' (marked in green). Each of the variable and observation nodes is a graphlab node. There are two update functions, one for variable nodes and one for observation nodes. The code can be found in GraphLab linear solvers library.

Here is a snapshot from the implementation of the observation node update function:
void gamp__update_function(gl_types_gamp::iscope &scope,
         gl_types_gamp::icallback &scheduler) {

  /* GET current vertex data */
  vertex_data_gamp& vdata = scope.vertex_data();
  unsigned int id = scope.vertex();
  gamp_struct var(vdata, id);

  
/*
 *  computations at variable nodes
*/
   memset(var.R_var, 0, config.D*sizeof(sdouble));
   memset(var.R_mean, 0, config.D*sizeof(sdouble));
   for (unsigned int k=ps.m; k< ps.m+ps.n; k++){
      vertex_data_gamp  & other= g_gamp->vertex_data(k);
      gamp_struct obs(other,k);

      for (int i=0; i< config.D; i++){
        var.R_var[i] += pow(var.AT_i[k-ps.m],2) * obs.S_var[i];
        var.R_mean[i] += var.AT_i[k-ps.m] * obs.S_mean[i];
      }
  }

  for (int i=0; i< config.D; i++){
     var.R_var[i] = 1.0 / (var.R_var[i] + ps.gamp_eps);
     var.R_mean[i] = var.X_mean[i] + var.R_var[i] * var.R_mean[i];
     var.gam[i] = (var.R_var[i]*ps.gamp_theta + ps.gamp_phi*var.R_mean[i]) / (var.R_var[i] + ps.gamp_phi);
     var.nu[i] = ps.gamp_phi*var.R_var[i] / (var.R_var[i] + ps.gamp_phi);
     var.pi[i] = 1.0 / (1.0 + (1.0 - ps.gamp_lam) / ps.gamp_lam * sqrt((var.R_var[i]+ps.gamp_phi)/var.R_var[i]) * exp( \
      -(powf(var.R_mean[i],2)) / (2*var.R_var[i]) + (powf(var.R_mean[i] - ps.gamp_theta,2)) / (2*(ps.gamp_phi+var.R_var[i])) ));
     var.X_mean[i] = var.pi[i] * var.gam[i];
     var.X_var[i] = var.pi[i] * (var.nu[i]+ pow(var.gam[i],2)) - pow(var.pi[i] * var.gam[i],2);
  }
}
The code is slightly longer then matlab, and looks somewhat uglier since the vector loop operation have to be explicitly expanded. I am not yet happy with performance - since Matlab is highly optimized for the matrix product of dense matrices. I still need to think how to improve - maybe by blocking and using lapack...

Monday, November 14, 2011

SpotLight: Zeno Gantner - MyMediaLite


Here are some more details I got from Zeno Gantner, about his collaborative filtering project: MyMediaLite.
> 1) What is the focus of your library?
The focus of MyMediaLite is to provide
  1. a collection of useful recommender system algorithms,
  2. a toolkit for experimentation with such methods, and
  3. practical tools for using and the deploying the methods in the library.

There is no single specific target audience - we would like to cater both researchers and practitioners, as well as people who teach or learn about recommender systems.

Our long-term vision is to be a kind of Weka for recommender systems. We are not there, yet. For example, we only have command-line tools that expose most of the library's functionality, but no fancy GUI program.

While scalability and performance are important to us, we do not sacrifice usability for it. Thus, the library offers many useful additions beyond the raw algorithm implementations:
  • incremental updates for many models
  • storing and reloading of learned models
  • support for different text-based file formats, database support

What MyMediaLite is not: an off-the-shelf package that you can plug into your online shop. But using MyMediaLite for your online shop relieves you of having to implement a recommendation algorithm. You can use an efficient, well-tested implementation from the library instead, which can save you a lot of work.

So if you are a .NET developer and are looking for a recommender system/collaborative filtering solution, MyMediaLite may be worth a look ...

> 2) Who are the contributors?
Originally, there were 3 authors in our lab who worked on MyMediaLite. After the project "MyMedia", which was responsible for MyMediaLite's birth, ended, I am the main author. A guy from BBC R+D (hello, Chris!) has ported parts of the library to Java (available on our
download page
).

I occasionally get bug reports and patches from users, but there has not been a major outside contribution, e.g. a new recommender algorithm, yet. But I would be very happy accept such contributions. If you (the readers of Danny's blog) want to contribute, there isa list of interesting methods that could be implemented for MyMediaLite in our issue tracker.

The development process is really open. I keep the source code on github and gitorious. There is also a Google Group, a public issue tracker, and plenty of documentation, so it should be rather easy to start working on MyMediaLite.

My current goal is to turn MyMediaLite into a community project, that's why I go to conferences to present MyMediaLite, write about it on Twitter, do interviews, etc.

> 3) What are the main implemented algorithms?
The library addresses two main tasks: rating
prediction
(very popular in the public eye because of the Netflix
Prize) and item
recommendation
from implicit/positive-only
feedback
. The latter task is particularly important in practice, because you always have that kind of data (clicks, purchase action), while you have to ask your users for ratings, and they will not always give them to you.

So for both tasks we have simple baseline algorithms (average ratings, most popular item, etc.) that you can use to check whether you screwed up: a rating algorithm should always have more accurate predictions than the item average, and an item recommendation algorithn should always come up with better suggestions than the globally most popular items).

Beyond that, we have several variants of k-nearest-neighborhood (kNN) algorithms for both tasks, both based on interaction data (collaborative filtering) and on item attribute data (content-based filtering).

Most importantly, we have matrix factorization techniques for both tasks. If you have enough data, those are the models to use. For rating prediction, we have a straightforward matrix factorization with SGD training that also models user and item biases. For item recommendation, we have weighted regularized matrix factorization (WR-MF), which is called weighted-alternating least squares in GraphLab, and BPR-MF, which optimizes for a ranking loss (AUC).

> 4) Out of the implemented algorithms, which 2-3 perform best (on datasets like netflix, kddcup?)
I would go for the matrix factorization techniques mentioned above.

> 5) Do you have some performance results (for example on netflix data) in terms of speed and accuracy?
We have some results on our website:
http://www.ismll.uni-hildesheim.de/mymedialite/examples/datasets.html
http://www.ismll.uni-hildesheim.de/mymedialite/examples/recsys2011.html

For single-core matrix factorization, on an Intel(R) Xeon(R) CPU E5410 with 2.33 GHz, one iteration over the Netflix data takes, depending on the number of factors, 2 (10 factors) to 10 minutes (120 factors).

> 6) Which parts of the library are serial and which are parallel?
Most parts of the library are serial. We have parallelized some parts that are really easy to parallelize, e.g. cross-validation, where the experiment on each fold is entirely independent of the other folds.

We have one parallel SGD method for rating prediction MF. It is based on work presented at this years's KDD by Rainer Gemulla, which basically separates the training examples into independent chunks, and then performs parallel training on those chunks.

I implemented the method because I saw the talk and liked the paper, and I wanted to try how parallelization in C# works - it is very, very simple.

I will try to add more parallelized algorithms, but it is not the main development focus. Most feature requests from users are concerned about evaluation protocols and measures, and features of the surrounding framework.

> 7) What do I need to install if I want to use the library?
The byte code binaries are the same on all platforms. You need a .NET 4.0 runtime. On Windows, Microsoft .NET is already installed, so all you need to do is to download the library and tools and run them. On Mac OS X and Linux (and other Unix variants), you need Mono 2.8 or later, which you may have to install. The latest Ubuntu comes with Mono 2.10.5 installed by default, so there you can also just run it.

We have tutorial on our website on how to run the command line programs.

If you want to build the library, you also need MonoDevelop, which runs on all major platforms. You can download it for free. On Windows, you can use Visual Studio 2010 instead.

> 8) Some companies are limited by GP license. Did you consider moving to more industry friendly license, or do you target mainly academic usage?
While I do not rule out a license change in the future, I am quite happy with the current license. MyMediaLite was derived from a codebase with a much more restrictive license (educational/research use only).

The GPL is actually quite industry-friendly, although not as permissive as the Apache or
BSD licenses. The Linux kernel is also licensed under the GNU GPL. Is there a business that does not run Linux because of its license?

I am not a lawyer, but the GPL allows companies and all other users everything they need to do with MyMediaLite:
  • run, modify, and distribute the code
  • create derived works
  • sell it
  • use it in their online shop to make more money
  • set it up as a web service, and sell recommendations as a software
    service

If you modify the library, add new features etc., you are not required to release its source code as long as do not distribute the software itself.

Basically there are only two things that you cannot do with MyMediaLite under the GPL:
  • redistributing (e.g. selling) the code or derived works and
    denying your users the rights that we gave to you
  • sue us over some dubious software patents you are holding

time-SVD++ is now implemented in Graphlab

One of the algorithms to perform better in the KDD CUP contest this year is Koren's time-SVD++ algorithm. You can find more details in a previous post.

To summarize, according to our testing, we got the following validation RMSE on KDD CUP data:
(as lower prediction error, the better..)

1 Neighborhood model Adjusted cosine (AC) similarity 23.34
2 ALS                                                                              22.01
3 Weighted ALS                                                              18.87**
4 BPTF                                                                             21.84
5 SGD                                                                              21.88
6 SVD++                                                                         21.59
7 Time aware neighborhood                                             22.7
8 time-SVD                                                                      21.41
9 time-SVD++                                                                 20.90
10 MFITR                                                                        21.30
11 time-MFITR                                                                21.10
12 Random forest                                                              26.0


(Weighted ALS did not perform well on test data).
As can be seen from the above table, time-SVD++ was the best performing single algorithm
on the KDD CUP data. I also got the same impression when talking to Yehuda Koren.

time-SVD++ is now implemented as part of GraphLab's collaborative filtering library.
You are all welcome to try it out!

Saturday, November 12, 2011

GraphLab feedback this week

I got this email from Zeno Gantner:

"Hi Danny,

how are things? Any new enhancements for GraphLab planned? Its CF
library really looks impressive. Are there any plans for .NET/Mono bindings?

We had a poster about MyMediaLite at ACM RecSys in Chicago, and
mention GraphLab in the "related work" section:
http://ismll.de/mymedialite/download/mymedialite-recsys2011.pdf

Best regards from Germany,
Zeno"


I looked at the poster above and actually the system looks pretty interesting. One thing they have covered well is different cost functions and evaluation metrics like AUC and precision recall. I will probably follow up in this blog once I learn mode.

Last week I visited Ohio State University and had some very interesting meetings. One of the recent GraphLab users there is Josh Ash, research programmer who is working on Hyperspectral imaging. Here is the feedback I got from him:

"Hi Danny,

It was great meeting you as well, and an especially unusual coincidence that you visited the very week after I discovered and started using GraphLab! For the user's page, you can note that I am using GraphLab for "Bayesian analysis of hyperspectral imagery." I am still building out the model (mathematically and in code), but so far I am very happy with the usability and results from GraphLab. Currently I'm getting a 160x speedup using Java-GraphLab on an 8-core machine versus Matlab single-core. In addition to 8 cores, part of the speedup is due to better implementation (2nd time around is always better) and part due to Java being more efficient on looping over discrete random variables.

I will send more detailed notes about my GL experiences (installation, usage, suggested features, etc.) at a later date. Shortly, I'll also likely create a small web presence for this project so I can pass along that address as well. In the meantime, if you want a link for the 'users' page, you can use my home page: http://www2.ece.ohio-state.edu/~ashj/

Cheers!
Josh"


Justin Gardin from Stony Brook University, NY sent me the following description of his project. He finds GraphLab "amazingly fast."
"I am working on bioinformatic sequence motif analysis. The concept is pretty simple actually. Break up a bunch of sequences which are known to contain a binding site(binding site sequence unknown) in to a matrix of word counts. Each rows equals to a single permutation of ACGT in K slots. Each column equals distance along the sequence. A entry equals how many times the specific "word" occurs in that window of the sequences. Apply non-negative matrix factorization with dimension reduction. Set the reduction equal to the number of motifs you expect to come out of the data (usually trial and error). Then the U and V matrices represent the contribution of each kmer, at each position (respectively), to the motifs in the dataset. If there are big peaks in the U and V matrices at positions and similar words, then there exists a conserved motif in the dataset.

I follow up this analysis with a special case of gibbs sampling, which is initialized with the expected sequence motif from the U matrix, and weight the sampling by the V matrix. Works really well.

Justin Gardin"


Mike Davies from Cambridge Univ is documenting his GraphLab experience in his blog.