## Monday, April 30, 2012

### Graph500 benchmark

Today I learned from Prof. Garth Gibson from CMU, about  a graph algorithm benchmark called http://graph500.org . It is a benchmarking suite design for comparing performance of graph algorithms.

Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.
Backed by a steering committee of over 50 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set).

They have very interesting and challenging data sizes:
There are five problem classes defined by their input size:
toy 17GB or around 1010 bytes, which we also call level 10,
mini 140GB (1011 bytes, level 11),
small 1TB (1012 bytes, level 12),
medium 17TB (1013 bytes, level 13),
large 140TB (1014 bytes, level 14), and
huge 1.1PB (1015 bytes, level 15).

The benchmark performs the following steps:
1. Generate the edge list.
2. Construct a graph from the edge list (timed, kernel 1).
3. Randomly sample 64 unique search keys with degree at least one, not counting self-loops.
4. For each search key:
1. Compute the parent array (timed, kernel 2).
2. Validate that the parent array is a correct BFS search tree for the given search tree.
5. Compute and output performance information.

In comparison to our machine learning techniques, the Graph500 uses a completely useless and synthetic data:
The graph generator is a Kronecker generator similar to the Recursive MATrix (R-MAT) scale-free graph generation algorithm [Chakrabarti, et al., 2004].
Overall, it is a great initiative in the right direction, although we as ML researchers would love to see real data and real applications instead synthetic data. From the other hand, the magnitudes of data the parallel computing guys are handling are several order of magnitude larger than the ones we can fit in state of the art systems like GraphLab.

### Ensemble methods - a notable book by Elder and Seni

A couple of days ago a stumbled upon predictive analytics world website. It is an organization of professional conferences for Fortune 500 type companies that use predictive analytics.

One of the interesting researchers presenting regularly at predictive analytics is John Elder from Elder Research. I contacted John to invite him to participate in our GraphLab workshop. John was very nice to promptly respond and following my query he introduced me to Giovanni Seni. The pleasant surprise was to find our that John and Giovanni are the co-authors of the book:   Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. This got me quite excited, since last year in the ACM KDD CUP 2011 we use an ensemble method to combine several predictions together for getting a higher accuracy solution. We used very simple form of linear regression. It seems that the ensemble methods book contains much fancier methods that allow a better quality prediction. I am looking forward to reading this book very soon.

## Monday, April 23, 2012

### KDD CUP Update

As the contest is heating up I am getting more and more interesting feedback. As you can see in the above image, thanks to JustinYan, we are now at place 5 in track 2.

I have significantly extended the parser library in GraphLab version 2 to create useful tools for anyone who wants to compete. Here are some of the currently available tools:

kdd_train_data_parser - clean the training data from opposite signals, and split it into meaningful validation and training sets.
kdd_test_splitter - split the test data into two test files, as instructed by the contest
kddcup_output_builder - take two vector prediction files (output of graphlab or any other software like Vowpal Wabbit), find the highest prediction files, and merge them back into a zip submission file containing the solution.
kdd_usr_itm_feature_parsers - parse user and item features and translate them to GraphLab or VW input formats.
kdd_linear_model_builder - build a linear feature model for the training/validation that can be used for classification.

Since we don't have much time to perfect everything before the code release, I would love to get any help/feedback/suggestions from you!!

Some more detailed instructions:

GraphLab v2 setup
2) After checking out, you should do:
hg pull
hg update v2
./configure
cd release
make

Example runs:
More details are found here.

Let me know how it went!

## Friday, April 20, 2012

### KDD CUP 2012 Track 1 using GraphLab..

As some of you may recall, last year we had some great excitement with KDD CUP 2011, where we won the 5th place in track1. Additionally, the group who won the contest was using GraphLab as well. This year I really wanted to take a look again at this interesting and high visibility contest, but I did not find the time for it up to now. I got some interesting questions from students at Georgia Tech ML course taught by Le Song, were the course task is to compete in the KDD CUP 2012.

In this blog post I will share some of my approaches of dealing with KDD CUP 2012 track 1. I must warn my readers that it is very preliminary results, but I will be glad to get any inputs along the way.
Acknowledgement: my close collaborator, JustinYan from the Chinese National Academy of Science
is to credit most of the below results. By the way he is looking for summer internship in the US. Contact him if you are looking for intern. He is based in China - he will need a US intern visa.

## First try - matrix factorization using alternating least squares.

Goal: build a linear model for the user/item matrix and predict the missing values (the test data).

### Preparing the training data:

Some explanations and motivation. Track1 does not have a separate validation dataset, as was for example in Netflix and last year KDD CUP. It is very useful to have a validation dataset, or else one may easily overfit over the training dataset. Following JustinYan advice, I have separated the data based on its time properties. According to the unix timestamps of the data,
ratings are distributed from day 283 to day 314. (int date is in [131834878,1321027199] , 131834878 is 2011.10.12 00:00:00 (unix time +8) and 1321027199 is 2011.11.11(unix time +8). ). We take the last 5 days (310-314) as the validation and the first days (283-309) as the training set.

And here is what I did to convert the data for Graphlab v1 matrix factorization format:
2) I am using graphlab matrix factorization version which assumes there is a single rating between each user and data item. In rec_log_train.txt there are many items which are ranked several times. (Later we can add support to multiple ratings of the same item). Currently I am also avoiding the temporal aspect
we can later add that. Filter out duplicate ratings using the command:
sort -n -k 1,1 -k 2,2 rec_log_train.txt > rec_log_train.sorted.txt
3) Use graphlab version 2 kdd train parser to filter out duplicate ratings:
<468|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$./kdd_train_data_parser rec_log_train.sorted.txt WARNING: kdd_train_data_parser.cpp(main:224): Eigen detected. (This is actually good news!) INFO: kdd_train_data_parser.cpp(main:226): GraphLab parsers library code by Danny Bickson, CMU Send comments and bug reports to danny.bickson@gmail.com Currently implemented parsers are: Call data records, document tokens INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices. INFO: io.hpp(gzip_in_file:800): Opening input file: rec_log_train.sorted.txt INFO: io.hpp(gzip_out_file:831): Opening output file rec_log_train.sorted.txt.out INFO: kdd_train_data_parser.cpp(operator():181): Finished parsing total of 73209277 lines in file rec_log_train.sorted.txt INFO: kdd_train_data_parser.cpp(operator():197): Finished exporting a total of 39752418 ratings Finished in 103.088 Explanation: whenever there are multiple instances of the same user-item pair, with the same ratings, they are pruned and only one is left. Whenever there are two or more instances of the same user-item pair with conflicting ratings, namely both -1 and 1, they are ignored and removed from the training data. ### Preparing the test data: The test data contains the following format: [user ] [ item ] [ empty rating ] [ time ] It has unique 34910937 user/item pairs. The test data is explained here: The rec_log_test.txt file contains data for both the public leaderboard set and final evaluation set. The file is sorted temporally. Any data with a timestamp < 1321891200 is used for the public leaderboard set, and >= 1321891200 is used for the final evaluation set. The goal is to predict which items that were recommended to the user were followed by the user, within each set. In other words, the test data can be divided into two. In each set we should recommend the highest ranked items for this set. testA has 19,349,608 unique users and testB has 15,561,329 unique users. Here are some more detailed instructions: 5) Download the test data file: rec_log_test.txt.zip 6) Unzip the file using # unzip rec_log_test.txt.zip Use GraphLab v2 kdd_test_splitter to prepare the test: ./kdd_test_splitter rec_log_test.txt --ncpus=1 WARNING: kdd_test_splitter.cpp(main:159): Eigen detected. (This is actually good news!) INFO: kdd_test_splitter.cpp(main:161): GraphLab Parsers library code by Danny Bickson, CMU Send comments and bug reports to danny.bickson@gmail.com Schedule all vertices INFO: sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices. INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt Rows: 2421057 Cols: 2421057 Nonzeros: 34910937 Constructing all vertices. Adding edges. Graph size: 34910937 Start finalize.. INFO: kdd_test_splitter.cpp(operator():132): Finished going over 19349608 rating for 1196411 unique users INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testA.sorted INFO: io.hpp(load_matrixmarket_graph:326): Reading matrix market file: rec_log_test.txt Rows: 2421057 Cols: 2421057 Nonzeros: 34910937 Constructing all vertices. Adding edges. Graph size: 34910937 Start finalize.. INFO: kdd_test_splitter.cpp(operator():132): Finished going over 15561329 rating for 1196411 unique users INFO: kdd_test_splitter.cpp(operator():133): Successfully created output file: testB.sorted Finished in 164.793 INFO: kdd_test_splitter.cpp(main:184): Finished exporting test data! The output of this procedure are split test files, testA.sorted holding earlier ratings and testB.sorted holding later ratings. Ratings are sorted first by the user and then by the item numbers. ### Understanding the cost function Track one using AP@3 (average precision for 3 items) as its error measure. Here is pseudo code I got from JustinYan which computes it: private static double ProcessList(List<Sample> labelList1) { //labelList is an ordered list of the top 3 computed predictions List<Sample> labelList = labelList1.OrderByDescending(x => x.Prediction).Take(3).ToList(); //realList is a list of ratings that where equal to one List<Sample> realList = labelList1.Where(x => x.RealValue > 0).ToList(); //realClickCount is the number of items a user clicked on double realClickCount =realList.Count(); double ap = 0; int count = 0; //if user clicked on nothing, we get AP of zero. if (realClickCount == 0) return 0; //for each of the recorded ratings for (int i = 0; i < labelList.Count; i++) //if one of the top 3 items we recommended, had actually been //selected increase our precision. if (labelList[i].RealValue > 0) { ap += ++count * 1.0 / (i + 1); } } return ap / (realClickCount); } GraphLab now supports AP@3 cost function calculation using the command line switch --calc_ap=true ### Running GraphLab SGD 8) Rename the test file to be recognized in graphlab using: ln -s testA.sorted rec_log_train.sorted.out.txtt 6) Run bias-SGD using the command: ./pmf rec_log_train.sorted.txt.out 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --sgd_step_dec=0.98 --sgd_gamma=0.005 --sgd_lambda=0.005 --exporttest=true --calc_ap=true --reduce_mem_consumption=true --exportlinearmodel=false INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support Setting run mode Bias-SGD INFO: pmf.cpp(start:291): Bias-SGD starting loading data file rec_log_train.sorted.txt.out Loading Matrix Market file rec_log_train.sorted.txt.out TRAINING Loading rec_log_train.sorted.txt.out TRAINING Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556 loading data file rec_log_train.sorted.txt.oute Loading Matrix Market file rec_log_train.sorted.txt.oute VALIDATION Loading rec_log_train.sorted.txt.oute VALIDATION Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863 loading data file rec_log_train.sorted.txt.outt Loading Matrix Market file rec_log_train.sorted.txt.outt TEST Loading rec_log_train.sorted.txt.outt TEST Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 19349608 INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started. Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50 SVD++ 50 factors complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000. INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin max iterations = 8 step = 1 max_iterations = 8 INFO: asynchronous_engine.hpp(run:111): Worker 0 started. INFO: asynchronous_engine.hpp(run:111): Worker 1 started. INFO: asynchronous_engine.hpp(run:111): Worker 2 started. INFO: asynchronous_engine.hpp(run:111): Worker 3 started. INFO: asynchronous_engine.hpp(run:111): Worker 4 started. INFO: asynchronous_engine.hpp(run:111): Worker 5 started. INFO: asynchronous_engine.hpp(run:111): Worker 6 started. INFO: asynchronous_engine.hpp(run:111): Worker 7 started. Entering last iter with 1 21.4589) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.328273 AP@3 for validation: 0.292205 Entering last iter with 2 68.1817) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332562 AP@3 for validation: 0.308924 Entering last iter with 3 114.985) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332799 AP@3 for validation: 0.312879 Entering last iter with 4 162.032) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.333644 AP@3 for validation: 0.314124 Entering last iter with 5 219.16) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336898 AP@3 for validation: 0.319216 Entering last iter with 6 289.758) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340856 AP@3 for validation: 0.321952 Entering last iter with 7 360.691) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344672 AP@3 for validation: 0.322379 Entering last iter with 8 431.754) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.352534 AP@3 for validation: 0.325194 INFO: asynchronous_engine.hpp(run:119): Worker 5 finished. INFO: asynchronous_engine.hpp(run:119): Worker 2 finished. INFO: asynchronous_engine.hpp(run:119): Worker 1 finished. INFO: asynchronous_engine.hpp(run:119): Worker 6 finished. INFO: asynchronous_engine.hpp(run:119): Worker 7 finished. INFO: asynchronous_engine.hpp(run:119): Worker 4 finished. INFO: asynchronous_engine.hpp(run:119): Worker 3 finished. INFO: asynchronous_engine.hpp(run:119): Worker 0 finished. Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000. Finished in 466.943851 seconds INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.740713 INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out.test.predictions vector size: 19349608 INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m Performance counters are: 0) EDGE_TRAVERSAL, 1421.59 Performance counters are: 2) CALC_RMSE_Q, 0.296477 Performance counters are: 6) CALC_OBJ, 1.17573 As you can see, each iteration of bias-SGD takes around 60 seconds using 8 cores machine. So it is enough to have a single good multicore machine. The output is the file: rec_log_train2.txt.test.predictions Let's look at the output: head rec_log_train2.txt.test.predictions %%MatrixMarket matrix array real general %%output predictions for test data 19349608 1 -1 -0.7164881229401 0.07807982712984 -0.9745061397552 -0.7039368152618 -0.6436884999275 -1 -0.6042827963829 For each user/item pair in the testA.sorted data, we get a prediction. Now we store the results for testA: mv rec_log_train2.txt.test.predictions testA.predictions And again we run graphlab, this time with testB: ./pmf rec_log_train.sorted.txt.out2 15 --D=50 '--scheduler=round_robin(max_iterations=8,block_size=1)' --matrixmarket=true --minval=-1 --maxval=1 --zero=true --matrix_market_tokens_per_row=4 --ncpus=8 --lambda=0.01 --sgd_gamma=0.005 --sgd_lambda=0.005 --sgd_step_dec=0.98 --exporttest=true --calc_ap=true INFO: pmf.cpp(do_main:441): PMF/BPTF/ALS/SVD++/time-SVD++/SGD/Lanczos/SVD Code written By Danny Bickson, CMU Send bug reports and comments to danny.bickson@gmail.com WARNING: pmf.cpp(do_main:448): Program compiled with it++ Support Setting run mode Bias-SGD INFO: pmf.cpp(start:291): Bias-SGD starting loading data file rec_log_train.sorted.txt.out2 Loading Matrix Market file rec_log_train.sorted.txt.out2 TRAINING Loading rec_log_train.sorted.txt.out2 TRAINING Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 33589556 loading data file rec_log_train.sorted.txt.out2e Loading Matrix Market file rec_log_train.sorted.txt.out2e VALIDATION Loading rec_log_train.sorted.txt.out2e VALIDATION Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 6162863 loading data file rec_log_train.sorted.txt.out2t Loading Matrix Market file rec_log_train.sorted.txt.out2t TEST Loading rec_log_train.sorted.txt.out2t TEST Matrix size is: USERS 2421057 MOVIES 2421057 TIME BINS 1 INFO: read_matrix_market.hpp(load_matrix_market:133): Loaded total edges: 15561329 INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 0 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 1 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 2 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 3 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 4 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 5 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 6 started. INFO: asynchronous_engine.hpp(run:137): Worker (Sync) 7 started. Bias-SGD for matrix (2421057, 2421057, 1):33589556. D=50 SVD++ 50 factors complete. Objective=0, TRAIN RMSE=0.0000 VALIDATION RMSE=0.0000. INFO: pmf.cpp(run_graphlab:238): starting with scheduler: round_robin max iterations = 8 step = 1 max_iterations = 8 INFO: asynchronous_engine.hpp(run:111): Worker 1 started. INFO: asynchronous_engine.hpp(run:111): Worker 0 started. INFO: asynchronous_engine.hpp(run:111): Worker 2 started. INFO: asynchronous_engine.hpp(run:111): Worker 3 started. INFO: asynchronous_engine.hpp(run:111): Worker 4 started. INFO: asynchronous_engine.hpp(run:111): Worker 5 started. INFO: asynchronous_engine.hpp(run:111): Worker 6 started. INFO: asynchronous_engine.hpp(run:111): Worker 7 started. Entering last iter with 1 42.9776) Iter SVD 1, TRAIN RMSE=0.5666 VALIDATION RMSE=0.5769. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.327499 AP@3 for validation: 0.291468 Entering last iter with 2 113.61) Iter SVD 2, TRAIN RMSE=0.5393 VALIDATION RMSE=0.5700. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.331413 AP@3 for validation: 0.308851 Entering last iter with 3 184.123) Iter SVD 3, TRAIN RMSE=0.5301 VALIDATION RMSE=0.5680. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.332523 AP@3 for validation: 0.312983 Entering last iter with 4 251.239) Iter SVD 4, TRAIN RMSE=0.5249 VALIDATION RMSE=0.5673. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.334125 AP@3 for validation: 0.314801 Entering last iter with 5 307.7) Iter SVD 5, TRAIN RMSE=0.5213 VALIDATION RMSE=0.5671. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.336783 AP@3 for validation: 0.318807 Entering last iter with 6 377.682) Iter SVD 6, TRAIN RMSE=0.5180 VALIDATION RMSE=0.5670. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.340606 AP@3 for validation: 0.320896 Entering last iter with 7 447.13) Iter SVD 7, TRAIN RMSE=0.5146 VALIDATION RMSE=0.5670. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.344897 AP@3 for validation: 0.322647 Entering last iter with 8 517.358) Iter SVD 8, TRAIN RMSE=0.5113 VALIDATION RMSE=0.5671. INFO: biassgd.hpp(bias_sgd_post_iter:64): AP@3 for training: 0.351736 AP@3 for validation: 0.324784 INFO: asynchronous_engine.hpp(run:119): Worker 2 finished. INFO: asynchronous_engine.hpp(run:119): Worker 0 finished. INFO: asynchronous_engine.hpp(run:119): Worker 5 finished. INFO: asynchronous_engine.hpp(run:119): Worker 6 finished. INFO: asynchronous_engine.hpp(run:119): Worker 1 finished. INFO: asynchronous_engine.hpp(run:119): Worker 4 finished. INFO: asynchronous_engine.hpp(run:119): Worker 7 finished. INFO: asynchronous_engine.hpp(run:119): Worker 3 finished. Final result. Obj=0, TRAIN RMSE= 0.0000 VALIDATION RMSE= 0.0000. Finished in 552.339237 seconds INFO: io.hpp(export_test_file:418): **Completed successfully (mean prediction: -0.733542 INFO: read_matrix_market.hpp(save_matrix_market_vector:250): Saved output vector to file: rec_log_train.sorted.txt.out2.test.predictions vector size: 15561329 INFO: read_matrix_market.hpp(save_matrix_market_vector:251): You can read it with Matlab/Octave using the script mmread.m found on http://graphlab.org/mmread.m Performance counters are: 0) EDGE_TRAVERSAL, 2109.31 Performance counters are: 2) CALC_RMSE_Q, 0.264258 Performance counters are: 6) CALC_OBJ, 1.1756 ### Final parsing and compilation of submission Use GraphLab v2, for creating the results in KDD cup format. This code go over the recommendation and builds a list of the top 3 recommendations out of the test items. Submission file contains exactly 1,340,127 users (713,611 from testA, and 626,516 from testB). There are four inputs to the output creation. The first one is the sorted testA we created previously using matlab. The second is the prediction file which is output from GraphLab. <309|0>bickson@bigbro6:~/ygraphlab/graphlabapi/debug/toolkits/parsers$ ./kddcup_output_builder testA.sorted testA.predictions testB.sorted testB.predictions --ncpus=1
WARNING:  kddcup_output_builder.cpp(main:165): Eigen detected. (This is actually good news!)
INFO:     kddcup_output_builder.cpp(main:167): GraphLab Parsers library code by Danny Bickson, CMU
Send comments and bug reports to danny.bickson@gmail.com
Schedule all vertices
INFO:     sweep_scheduler.hpp(sweep_scheduler:124): Using a random ordering of the vertices.
Rows:      2421057
Cols:      2421057
Nonzeros:  15561329
Constructing all vertices.
Graph size:    15561329
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testB.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 15561329 rating  for 626516 unique users
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testB.sorted.out
Rows:      2421057
Cols:      2421057
Nonzeros:  19349608
Constructing all vertices.
Graph size:    19349608
INFO:     io.hpp(load_matrix_market_vector:544): Going to read matrix market vector from input file: testA.predictions
INFO:     kddcup_output_builder.cpp(operator():131): Finished going over 19349608 rating  for 713611 unique users
INFO:     kddcup_output_builder.cpp(operator():132): Successfully created output file: testA.sorted.out
Finished in 82.1905
INFO:     kddcup_output_builder.cpp(main:189): Merging the two files together
updating: testA.sorted.out (deflated 72%)
INFO:     kddcup_output_builder.cpp(main:208): Successfully created zip file: submission.zip
INFO:     kddcup_output_builder.cpp(main:214): removed temporary files
INFO:     kddcup_output_builder.cpp(main:216): Created successfully zip file: submission.zip

### Preliminary results

Here is a very preliminary report on the results I was able to obtain so far:
ALS - 0.254 on the leaderboard
SVD++ - 0.29 on the leaderboard (Thanks for my collaborators in Hungarian National Academy of Science, specifically Tamas Kiss)
bias-SVD 0.330 on the leaderboard (After some optimization JustinYan got 0.34 with it).
logistic regression - 0.336

### Disclaimer

As the code is evolving, there may be some bugs/inaccuracies in the code. However, I did not want to wait for the last minute to get a perfect solution that no one will be able to use. Please ping me if you find any bugs or have any suggestions.

### SpotLight: Prof. Michael Mahoney, Stanford

A couple of days ago I had a delightful conversation with Prof. Mahoney from Stanford. Michael is the organizer of the successful MMDS workshop which every couple of year attracts 250 researchers interested in algorithms on big data. The coming workshop will be held July 10-13, 2012, at Stanford. For more information, including registration, how to submit poster presentations, etc., see the MMDS web page.

Michael did a lot of work in the field of random projection and I asked him some questions to learn some more about this field:

1) Can you provide some resources for novice people (like me) who likes to learn more on the technique of random projections?

Here are the links to the two reviews on randomized matrix algorithms that we discussed.

Randomized Algorithms for Matrices and Data, M. W. Mahoney,
NOW Publishers, Foundations and Trends in Machine Learning, Volume 3, Issue 2, 2011,
Technical Report, Preprint: arXiv:1104.5557 (2011).

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp
SIAM Review, Volume 53, Issue 2, pp. 217-288, 2011,
Technical Report, Preprint: arXiv:0909.4061 (2009).

The first of these reviews (mine) adopts a more data/TCS/ML approach for the least-squares and low-rank approximation problems and tries to highlight "why" the algorithms work with an eye to developing future methods more generally; while the latter review (HMT) focuses on coupling the novel randomized algorithms for low-rank matrix approximation with classical numerical methods, the goal being to get high-quality numerical code, typically for "scientific" applications. Thus, e.g., some of the assumptions that are made are more appropriate for scientific applications that "data analytics" applications---e.g., for them sparse matrices are typically structured, e.g., from pdes, and thus can be applied quickly as an operator to unstructured Gaussian random projections.

2) Are you familiar with the divide and conquer paper from last NIPS? ("Divide and Conquer Matrix Factorization" by Lester MackeyAmeet Talwalkar and Michael Jordan)

I'm familiar with that paper on "divide and conquer," and I've talked a lot with Ameet Talwalkar about these issues. Their splitting of the data makes sense; but from the perspective of what we were discussing about analytics, this and other work of Ameet (and many others in ML) makes assumptions that the matrix has small "coherence," which is basically the largest "statistical leverage score." See my review monograph above, or see our PNAS paper on "CUR matrix decompositions for improved data analysis," or the paper below, for details. That assumption is fine in some cases, and Ameet has shown that it is true for at least a few ML data sets. But it is a very false assumption in many many other data sets, which the PNAS paper, the review monograph, and several other papers I can point you to discuss. More importantly, to develop robust large-scale analytic tools, you simply can't assume that, since it will often be false. A good discussion of these issues can also be found in our paper:

Fast approximation of matrix coherence and statistical leverage,
P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff,
Technical Report, Preprint: arXiv:1109.3843 (2011)

which also describes how to compute the coherence and statistical leverage scores "fast." Coupled with LSRN, the algorithm in this paper yields fast code to compute the statistical leverage scores and matrix coherence---very fast in the case of "rectangular" matrices, and "medium fast" in the case of large sparse matrices with a low-rank parameter, partly due to the issue mentioned above, partly due to numerical issues with iterative methods like Lanczos-based methods and partly due to the densification issue. The latter is going to be a problem in general, and there is no simple solution---we should talk in more detail about this if you like.

3) I heard about LRSN, your efficient parallel iterative least squares. Can you give us the essence of this construction?

LSRN is a parallel iterative least squares solver developed by Meng,
Saunders, and Mahoney that is based on random projections and that takes
advantage of modern computer architectures is a novel way. The solver
computes the unique min-length solution to strongly over-determined or
under-determined least-squares systems of the form
$\min_{x \in \mathbb{R}^n} \|A x - b\|_2$. Thus, LSRN is similar to
recent implementations of randomized projection algorithms for the
least-squares problem; but it is more general and more robust, and it is
designed for larger-scale parallel or distributed environments. In
particular, all existing prior work assumes that $A$ has full rank; and,
for the iterative solvers, none of it comes with guarantees on the number
of iterations, independent of the spectrum of $A$. For LSRN, the matrix
$A \in \mathbb{R}^{m \times n}$ can be either full-rank or rank-deficient;
since $A$ is only involved in matrix-matrix and matrix-vector
multiplications, $A$ can be a dense matrix, a sparse matrix, or a linear
operator; and LSRN automatically speeds up on sparse matrices and with
fast linear operators. The preconditioning phase of LSRN consists of a
random normal projection, which is embarrassingly parallel, and a singular
value decomposition of a "small" matrix. Importantly, the preconditioned
system is well-conditioned, with a strong concentration result on the
extreme singular values, and hence that the number of iterations is fully
predictable (just as with direct solvers) when LSQR or the Chebyshev
semi-iterative method is applied to the preconditioned system. The latter
method is particularly efficient for solving large-scale problems on
distributed clusters with high communication cost. In addition, LSRN is
easy to implement using threads or with MPI, and it scales well in
distributed clusters and parallel environments. Numerical results
demonstrate that, on a local shared memory machine, LSRN outperforms
LAPACK's DGELSD on large-scale dense problems and MATLAB's backslash
(SuiteSparseQR) on large-scale sparse problems; and empirical results also
demonstrate that LSRN scales well on Amazon Elastic Compute Cloud (EC2)

## Wednesday, April 18, 2012

### KDD CUP 2011 reflection

I just learned from my collaborator JustinYan from the chinese national academy of science,about the paper A Linear Ensemble of Individual and Blended Models for Music Rating Prediction" was published by the Taiwanese team from the National Taiwan University, which won KDD CUP 2011 both tracks one and two.

What is specially interesting about this paper IMHO, is that GraphLab matrix factorization library was used for computing two models out of their ensemble.

That raises the number of teams who utilized GraphLab and won high places to two in this contest.

## Monday, April 16, 2012

### Expert Opportunity?

I am getting so many weird inquires from head hunters, recruiters and other consultants that I started getting suspicious about it. My theory is that most of those emails are completely junk and chain emails sent to a zillion people without any human being to back them up.

Yesterday I got am email with this novel title:  "Expert Opportunity" from someone in a company that hires consultants for legal IP stuff (I will not mention the name of the company for obvious reasons). What was nice about it, is the job description:
Candidates need to have experience with the Hadoop file system, ideally the Apache and MapReduce implementations. Candidates need to understand the hardware components of distributed file systems such as Hadoop along with the underlying software.

The following questions arise to my mind:
1) Is map reduce an implementation of hadoop???
2) Is Hadoop a hardware for file system???
3) Can software be underlying to hardware???

I took the effort and emailed this guy saying something is wrong with his job description. A couple of days passed and I got no reply... I will keep you posted.

### LDA explained

LDA (latent Dirichlet allocation) is a popular method used frequently in natural language processing. Originally it was proposed by David Blei in his 2003 paper.
A common LDA application is to have a document/word input matrix as the LDA input, and to cluster the documents and words into a set of topics.

The output of Blei's algorithm are two matrices: Beta and Gammma. Beta is a matrix of size
words x topics, and Gamma is a matrix of size documents x topics.

I got the following question about LDA from http://twitter.com/#!/trowind, a graduate student working on NLP from Seoul.

I'm trying to recognize the topics of each given document.
To do this I built topic distributions of word by using LDA algorithm with "glcluster" of GraphLab.

The first output is a "Word X Topic" matrix which contained the probabilities of P(topic_k|word_i). Anyway, I want to know the probabilities of topic for a given document : P(topic_k | doc_j)

How can I get these probabilities?

I contacted Khaled El-Arini, a CMU graduate student and the best LDA expert I could get hold on, and asked him to shed some light on this rather confusing dilemma. And this is what I got from Khaled:

P(word|topic) is Beta (by definition).
Gamma are the posterior dirichlet parameters for a particular document [see page 1004: here]
the gamma file should contain a T-dimensional vector for each document (where T is the number of topics), and that represents the variational posterior Dirichlet parameters for a particular document. In other words, if T=3, and gamma = [1 2 2], it means that the posterior distribution over theta for this document is distributed Dirichlet(1, 2, 2), and thus if you just want to take the posterior mean topic distribution for this particular document, it would be [0.2 0.4 0.4], based on the properties of a Dirichlet distribution.

Additional insignts I got from Liang Xiong, another CMU graduate student and my LDA expert number 2:
Gamma gives the distribution of p(topic | doc). So if you normalize the
Gamma so that it sums to one, you get the posterior mean of p(topic | doc). Unfortunately I don't see a directly way to convert this to p(doc | topic), since lda is doing a different decomposition as plsa. But personally I think that p(topic | doc) is more useful than p(doc|topic).

Anyway, I hope now it is more clear how to use Blei's LDA output.

## Sunday, April 15, 2012

### More on large scale SVM

About a month ago I posted here on large scale SVM. The conclusion of my post was that linear SVM is solved problem, mainly due to Pegasos stochastic gradient descent algorithm.

Today I had the pleasure of meeting Shai Shalev-Shwartz, the author of Pegasos. I asked Shai if he can explain to me (namely for dummies.. ) why pegasos is working so well. So this is what I heard. Pegasos is a stochastic gradient descent method. Instead of computing the costly operation of the exact gradient, a random data point is selected, and an approximated gradient is computed, based solely on this data point. The solution method is advancing in random directions, however on expectation, those random directions will lead to the exact global solution.

I asked Shai if he can provide me a simple matlab code that demonstrates the essence of Pegasos. And here is the code I got from him:

% w=pegasos(X,Y,lambda,nepochs)
% Solve the SVM optimization problem without kernels:
%  w = argmin lambda w'*w + 1/m * sum(max(0,1-Y.*X*w))
% Input:
%  X - matrix of instances (each row is an instance)
%  Y - column vector of labels over {+1,-1}
%  lambda - scalar
%  nepochs - how many times to go over the training set
% Output:
%  w - column vector of weights
%  Written by Shai Shalev-Shwartz, HUJI
function w=pegasos(X,Y,lambda,nepochs)

[m,d] = size(X);
w = zeros(d,1);
t = 1;
for (i=1:nepochs)      % iterations over the full data
for (tau=1:m)      % pick a single data point
if (Y(tau)*X(tau,:)*w < 1)   % distance of data point
% from  separator is to small
% or data point is at the other side of the separator.
% take a step towards the gradient
w = (1-1/t)*w + 1/(lambda*t)*Y(tau)*X(tau,:)';
else
w = (1-1/t)*w;
end
t=t+1;         % increment counter
end
end

You must agree with me that this is a very simple and elegant piece of code.
And here are my two stupid questions:
1) Why do we update the gradient with the magic number < 1?
2) Why do we update w even when gradient update is not needed?
1) It's either the data point is close to the separator, or that it's far away from the separator but on the wrong side (i.e., if y*w*x is a large negative number).
2) The actual distance from a point x to the separator is |w*x| / (||w|| * ||x||). So, to increase this number we want that both |w*x| will be large and that ||w|| will be small. So, even if we don't update, we always want to shrink ||w||.

## Friday, April 13, 2012

### MadLINQ: New large scale matrix computation paper from MSR asia

I got this from Aapo Kyrola:

"MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud"http://dl.acm.org/citation.cfm?doid=2168836.2168857
It won the Eurosys best paper award, and really is an excellent paper from
MSR Asia. There are two reasons to read it:
1) It has similar application domain and motivation to GraphLab: they argue that many data mining
and ML algos can be represented as matrix algorithms. This work is focused
on dense/ish matrices and compares against ScalaPACK.  It is not direct competitor
to GraphLab other than maybe in the matrix factorization domain.
2) It is excellent Systems work and paper. It got very high praise from the
chairs of being a top-to-bottom systems work, from abstraction to verification
to detailed evaluation.

## Thursday, April 12, 2012

### LSI vs. SVD

Here is a question about LSI I got from Ashish Dhamelia from capitalnovus.com:

I am new to LSI.
I have read few papers about LSI and they all suggest using SVD.
Our data are mostly emails and few office documents as well.
We do get lots of data and we index using Lucene/Solr.

I started with SemanticVectors. It basically performs SVD on lucene index ( We can view lucene index as Term X Document matrix with element being TF-IDF score) So our input matrix is huge and sparse (lots of zeros). I do not have exact count about non-zeros element. But I will try to get more details. So given TFIDF matrix (Lucene index), Semantic vectors performs SVD and created two output matrix. Left singular matrix is term vector matrix while right singular matrix is document vector matrix.

So from term vector matrix, we can query a term and find semantically similar terms based on cosine similarity. Issue here is scalability. Semantic vector does everything in memory so it’s not scalable.

Right now we are evaluating mahout and graphlab to address this issue.

I forwarded this question to our LSI expert, Andrew Olney from University of Memphis. And this is what I got:

For either LSI/LSA
• Basically you construct a term/doc matrix (each cell is frequency count of word (row) in document (col)
• Normalize if you want e.g. log-entropy or tfidf
• Take the SVD
LSA is primarily interested in word/word;doc/doc similarity and uses the U matrix (left singular vectors) LSI is primarily interested in IR and also uses the V matrix (right singular vectors) For LSA the edited book "Handbook of LSA" is a good reference. For LSI I suggest this paper. What is meant in the email below by 'concept search' is not clear to me, but it seems similar to a collaborative filtering approach.

Stochastic SVDs (e.g. Gorrell) have been used for that though they are approximate. See also perhaps Lingo.

And finally, maybe an online topic model would be better if the user is selecting 'concepts' or concepts are otherwise reflected in the UI. Topics seem more comprehensible to people than SVD dimensions. See Blei’s paper.

Since we have now a very efficient SVD solver in GraphLab v2, I would love to help anyone who is interested in implementing LSA/LSI on top of it. Let me know if you want to try it out!

## Saturday, April 7, 2012

### The first GraphLab workshop is coming up!

I am absolutely excited to report that we are organizing the first Graphlab workshop. The workshop has two goals:

• Expose distributed/multicore GraphLab v2  to a wider ML community using demos and tutorials. Meet other GraphLab users and hear what they are working on!
• A meeting place for technology companies working on large scale machine learning. Companies are invited to present their related research and discuss future challenges.

Here is the current program committee:

Additional companies who confirmed their participation include:
Universities / research lab participating at the workshop:

• Carnegie Mellon University
• Johns Hopkins University
• University of California, Berkeley
• University of California, Santa Cruz
• University of Pennsylvania
• University of San Francisco
• Lawrence Berkeley National Lab
• Georgia Tech
• Stanford University
Date
Workshop date is July 9th in San Francisco.

Registration
Online registration is now open.

• 50$for student • 100$ for early bird
• 150\$ for regular registration

## Tuesday, April 3, 2012

### Machine learning contest from Amazon

Here is what I got from Ken Montanez, from Amazon, via Mahout mailing list:

Amazon's Information Security Organization partnered with IEEE to put on a Machine Learning Competition for their upcoming MLSP 2012 International Workshop in Spain.
Amazon provided real industry data to give competitors the opportunity to work with real world data.The competition deadline is May 14, 2012.Link Website: http://mlsp2012.conwiz.dk, under the 'MLSP Competition' menu option.
Have fun!Ken-- Ken Montanez Software Engineering Manager Security Platform

It seems the goal of the competition is binary classification of user access to resources, to either allow or deny access. The data is explained here. Anyway for us it is always interesting to obtain real data!