Saturday, December 29, 2012

SIGMOD 2013 GRADES Graph Workshop


I got the following interesting graph workshop invitation from Peter Boncz:


At SIGMOD 2013 a new workshop will be held: Graph Data Management Experiences & Systems (GRADES). 

In the database research community, graph data management is nowadays a popular topic, though the published papers tend to be highly technically focused (e.g. on graph reachability indexing or subgraph search), and not always targeted at what we now call Big Data.
In the GRADES workshop we strive for an emphasis on Experiences and Systems, that is, we value input from practitioners on graph data management application areas and systems, and the challenges these bring when handling large-scale problems, as found e.g. in life science analytics, social network marketing, digital forensics, telecommunication network analysis and digital publishing. 
Given your interesting work on the Distributed GraphLab system, we would like to invite you to consider submitting a short paper, with this focus on experiences and systems. Note that graph data management benchmarks and RDF stores are also in scope.
Please see for more details: http://event.cwi.nl/grades2013
The workshop will be held Sunday June 23 2013 in New York, on the day preceding SIGMOD/PODS 2013. 
The audience will be a mix of graph data management researchers from SIGMOD/PODS and industrial data management practitioners. One of the goals of the workshop is to confront the research audience with "real" open problems and challenges.
The GRADES workshop is well-poised to bring industrial practitioners and owners of graph data management problems to the workshop, as it is sponsored and co-organized by the Linked Data Benchmark Council (LDBC,http://ldbc.eu) aimed at industrial benchmarking in Graph and RDF database systems.  We expect to see industrial participation from Graph database and RDF data management vendors inside and outside LDBC and some of their more involved customers. Do not doubt to contact us if you wish additional information on LDBC. 
You are of course encouraged to share this call with colleagues or other people who might be interested in attending the workshop and/or submitting a paper.
best regards, thanks in advance for your participation, and wishing you a good and productive 2013, 
Peter Boncz & Thomas Neumann GRADES Co-Chairs

Wednesday, December 19, 2012

3rd Generation Collaborative Filtering - Sparse Case [or] Can Matrix Factoriztion be used for Classification?

Following my gensgd implementation which supports dense feature matrices, I was asked by my mega collaborator Justin Yan to implement a sparse version that supports libsvm format.
sparse_gensgd is exactly the same algorithm (high dimensional matrix factorization), but for the sparse case. Perhaps a bit surprising, I will show below the sparse_gensgd algorithm can be used for classification with a very nice performance. As a case study I will discuss KDD CUP 2010.

Case study: ACM KDD CUP 2010

In this case study I will show you how you can get state-of-the-art performance from GraphChi CF toolkit for solving a recent KDD CUP 2010 task. Here is a text from the contest website describing the task:
This year's challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems. This task presents interesting technical challenges, has practical importance, and is scientifically interesting. 
I have used libsvm data respository which converted the task into a binary classification. The problem is moderate in size: around 20M samples for training, 750K samples for testing, with 29M sparse features.

The winning team was NTU, and here is their winning paper. Here is a graph depicting their single algorithm improvement:
As you can see, prediction RMSE around 0.2815 is the best result obtained by a single model.

The data is sparse in the sense that any number of features can appear in one sample. For example, here are the first 3 lines of the data in libsvm format:

1 2:1 103:1 104:1 105:1 106:0.301 107:0.301 32913:1 32917:1 2990385:1 2990386:1 2990387:1 2990388:1 2990389:0.301 2990390:1 2990391:1 2990392:1 2990393:1 2990394:1 2990395:1 2990396:1 2990397:1 2990398:1 2990399:1 2990400:1 2990401:1
0 2:1 92:1 115:1 116:1 117:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990402:1 2990403:1 2990404:1 2990405:1 2990406:1
0 2:1 100:1 143:1 144:1 145:1 12235:1 32913:1 32917:1 2990387:1 2990388:1 2990389:0.477 2990390:1 2990391:1 2990393:1 2990394:1 2990396:1 2990398:1 2990399:1 2990407:1 2990408:1 2990409:1 2990410:1 2990411:1

The target is either 1 or 0, this is the parameter we would like to predict as part of the matrix factorization procedure. The rest of the features are integer ids, where most of them are binary (1) but some of them are doubles (0.477). Given the validation dataset where only the features are given we would like to predict the target - either 0 or 1.


Now we run sparse_gensgd and we get:

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/sparse_gensgd --training=kddb --cutoff=0.5 --calc_error=1 --quiet=1 --gensgd_mult_dec=0.99999 --max_iter=100 --validation=kddb.t --gensgd_rate3=1e-4 --D=20 --gensgd_regw=1e-4 --gensgd_regv=1e-4 --gensgd_rate1=1e-4 --gensgd_rate2=1e-4 --gensgd_reg0=1e-3
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
   253.313) Iteration:   0 Training RMSE:   0.316999 Train err:   0.134567  Validation RMSE:   0.301573 Validation Err:    0.11255
   302.651) Iteration:   1 Training RMSE:   0.311927 Train err:   0.131657  Validation RMSE:   0.299295 Validation Err:   0.111737
   350.719) Iteration:   2 Training RMSE:   0.309526 Train err:   0.130117  Validation RMSE:   0.298312 Validation Err:   0.111191
   399.433) Iteration:   3 Training RMSE:   0.307839 Train err:   0.128916  Validation RMSE:   0.297752 Validation Err:   0.111072
...

   1598.31) Iteration:  27 Training RMSE:   0.293562 Train err:   0.117877  Validation RMSE:   0.288989 Validation Err:   0.109334
   1647.51) Iteration:  28 Training RMSE:   0.293217 Train err:   0.117608  Validation RMSE:   0.288871 Validation Err:   0.109252
   1696.24) Iteration:  29 Training RMSE:   0.292884 Train err:   0.117357  Validation RMSE:   0.288908 Validation Err:   0.109398
   1745.18) Iteration:  30 Training RMSE:   0.292555 Train err:   0.117106  Validation RMSE:   0.289125 Validation Err:   0.109435


As you can see, we got a nice validation RMSE of 0.289 while the best customized solution for this task get an RMSE of 0.281. So for a general purpose solver the obtained performance is quite nice.

Some explanation about the run time parameters:
--training - training input file name
--validation - validation input file name
--D - width of feature vector
--calc_error - calculates the classification error.
--cutoff - threshold value used for binary classification. Since the data contains 0/1 labels I set the cutoff threshold to be 0.5
--max_iter - number of iterations
--gensgd_rate1/2/3 - learning rates (gradient step sizes)
--gensgd_regw/v/0 - regularization rates
--quiet - runs in less verbose mode

Instructions:
1) Install GraphChi as instructed here (steps 1-3).
2) Download the datasets kddb (training) and kddb.t (validation) and put them in the root GraphChi folder. (Tip: use bunzip2 to open those files).
3) Create a file named kddb\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 19264097
4) Create a file named kddb.t\:info with the following two lines:
%%MatrixMarket matrix coordinate real general
1000 1000 748400
5) Run as instructed.






Monday, December 17, 2012

Some interesting feedback for the 3rd generation cf discussed in this blog

You can find below two emails I got from industry regarding my 3rd generation solver.
I definitely can understand some of the frustration - as I can claim anything in my blog,
for the industry guys it is harder to prove their claims as they charge money for it... While my solution cost nothing, so if it is worth more than nothing than everyone is happy about it.

The first feedback I got from Dinesh Vadhia, founder of Xyggy.com

Hi Danny
A solution to the scalable CF problem with additional information has been available for a while with a Bayesian machine learning method developed by Professor Zoubin Ghahramani and Dr Katherine Heller (see reference below). Together with Zoubin, we at Xyggy are focused on making it simpler and faster for developers and data scientists to deploy intelligent services online with scalable Bayesian machine learning.  Briefly, the key aspects for machine learning practitioners are:
i) Feature engineering of any data type
Feature vectors can be created from any data type.  Features can also be combined for example, bag-of-words + votes (events) + preference ratings + custom features and so on.  If data is remotely of value then include it as a feature.  Feature vector lengths can range from the tens to tens of millions.  The method is equally suitable for items x items, items x users and users x items problem types.
ii) No training
No training is required. Once the feature engineering is completed, the sparse binarized data is fed to the Xyggy engine.  The method automatically learns and generalizes.
iii) Multiple items per query
A query consists of one or more items.  Typically, the more items per query the better the results.
iv) Dynamic predictions
Predictions are calculated dynamically in near real-time.  If the query items change, the predictions are re-calculated.  Think of it as dynamic clustering. 
v) Scalability and parallelism
The method can be viewed as IR+RecSys and will scale to any required size.  The inherent parallelism offers scalability and performance routes to deliver services at web-scale.
vi) Relevance feedback, engineered serendipity and novelty detection.
The Android showcase app demonstrates personalization with autonomous discovery utilizing both positive and negative relevance feedback, as well as engineered serendipity.  The source code for the Android app will be released soon to show how these capabilities can be built into applications with the Xyggy api.  If there is interest, the feature engineering process for this app can be explained as it is instructive.
More information can be found here.  We welcome the opportunity to work with organizations who want a simpler and faster way to deploy scalable intelligent services. 
I'll be happy to continue the discussion.
Best ...
Dinesh Vadhia
founder/engineer
dinesh@xyggy.com
www.xyggy.com

Reference:
Though written a while back, Information Retrieval using a Bayesian Model of Learning and Generalization provides a good overview. However, the post doesn't cover relevance feedback, engineered serendipity or RecSys in any detail.  Also, note that the demo has been r
eplaced with the Android showcase app.

The second feedback I got from Nick Vasiloglou, ISMION:

I don't think you expected so much feedback for your blog post, but I guess this is good. Let me add my comments to your post
  1. You will be surprised about how many things happen in the industry and nobody ever bothers to publish. I can easily believe that somebody else has tried this approach. The truth is that I have been trying to convince some of my clients to use tensor factorization (PARAFAC) instead of linear regression, but they are not convinced. One of the reason is that traditional industry prefers linear regression because of the confidence intervals and statistical significance of the factors. In your approach you aggregate several factors into the time variable. You could have instead used a multidimensional tensor x[i,j,k,l]=a[i]*b[j]*c[k]*d[l] use L1 regularization and that would have given you some importance on the variables. I am not sure how you can match the linear regression metrics, but worst case you use bootstrap.
My answer: yes, I have been there, done that.. typically in the industry you write patents, on issues that will be hardly accepted to decent conference. But I know there is not enough time to pursue publications. By the way, in my approach, I am not aggregating several factors into time variables (I only noted this can be done using traditional matrix factorization approach), I am using separate factors for each variable. The tensor case can be hardly scaled beyond the 3rd dimension because of all the interactions between the latent feature vectors. 
  1. Another problem that I see with your approach is handling continuous variables. So You are trying to predict delays and you might have continuous variable like weather temperature or humidity. I believe you implicitly quantize them through hashing. This might be suboptimal since values that are too close might fall into separate bins. Even worse if let's say your original dataset didn't have any categorical/ordinal variables, but only d continuous factors, then by quantizing them with hashing there is high probability that true nearest neighbors will be assigned in different bins. A different approach would be to build m random-trees with l number of leafs each. So now your original d-dimensional space has been transformed into a m*l one. Each point now is mapped on m different leafs. I think this is a better approach thatn quantizing each variable separately. It preserves up to a small error the euclidean distances.
My answer: I agree there is a delicate point here. For the target variable (like flight delay) I support continues variables. But the other feature I quantize. Thus the normal relation between increasing values is lost. But for most problems I tried it works well in practice. Regarding the lost relations, whenever a single feature appears in two different samples then they are connected, and the gradient is compute with respect to both. Thus data dependancies are maintained very well.
  1. There is also one detail that it is not mentioned in any matrix factorization works and it is very critical in practice. If the matrix has block diagonal components, in other words the corresponding graph has disconnected components, you can get very bad results in your recommender systems when you are looking for similar items. So a good advice is to use the graphlab method for identifying them first and then use your favorite factorization.
My answer: Most of the graphs I work on are connected. So you may be right but I am not concered with this issue.
  1. One of the reasons why companies don't broadcast their solutions is because they are afraid of patents. This is also my advise to my clients,  specially if they are in the medical/healthcare industry, never to disclose their exact methods. Patent trolls are watching
My answer: good life in academia!! Earn a low salary and broadcast as much as you like!!
  1. At last I would like to say that I am very glad you posted this approach. I happen to teach an introductory course to practitioners with the title "Learning Machine Learning by Example" http://www.meetup.com/Learning-Machine-Learning-by-Example/ and we use the airline dataset.  Your post is giving me an opportunity to introduce the students to matrix factorization coming straight from linear regression!
My answer: very interesting. I hope you will give an opportunity for the students to actually play with my code, I think it would be benfiticial regarding their understanding how different feature contribute to the overall solution quality. 

Nick has kindly added GraphChi CF toolkit to his machine learning meetup course.



Friday, December 14, 2012

Collaborative filtering - 3rd generation - part 2

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.

**********************************************************************************
A couple of days ago I wrote about a new experimental software I am writing - which is what I call a 3rd generation collaborative filtering software. I got a lot of interesting feedback from my readers which helps improve the software. Previously I tried it to examine its performance on KDD CUP 2012 dataset. Now I tried it on a completely different datasets and I am quite pleased with the results.

First dataset: Airline on time


Below I will explain how to deploy it on a different problem domain: Airline on time performance. It is a completely different dataset from a different domain, but still the gensgd software can deal without without any modification. I hope that those results that show how
flexible is the software will encourage additional data scientist to try it out!

The airline on time dataset, has information about 10 years of flights in the US. The data of each year is a csv file with the following format:
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay

The fields are rather self explanatory  Each line represents a single flight, and information about the date, carrier, airport etc. is given, and the interesting fields is the varying information about flight duration.

And here are the first few lines:

2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

Note: you can get the dataset using the commands:
curl http://stat-computing.org/dataexpo/2009/2008.csv.bz2 -o 2008.csv.bz2
bunzip2 2008.csv.bz2


First task. Can we predict the total time the flight was on the air? 


Well, for a matrix factorization method, it is not clear what is the actual matrix here. That is why it is useful to have a flexible software. In my experiments I have chosen "UniqueCarrier" and "FlightNum" as the two fields which form the matrix. This is because the characterize each flight rather uniquely. Next we need to decide which field we want to predict. I have chosen the ActualElapsedTime as the prediction target. Note that those fields are chosen on the fly, so you are more than welcome to chose others and see how well is the prediction in that case.
(Additional information about each field meaning is found here).

First let's use traditional matrix factorization.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1  --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=20 --file_columns=28 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 0 : 

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum

   7.58561) Iteration:   0 Training RMSE:    67.1094
   11.7177) Iteration:   1 Training RMSE:    64.6665
   15.8441) Iteration:   2 Training RMSE:    63.2155
   19.9971) Iteration:   3 Training RMSE:    59.0044
   24.0989) Iteration:   4 Training RMSE:    53.9083
   28.1962) Iteration:   5 Training RMSE:    50.2416
...
   77.6041) Iteration:  17 Training RMSE:    35.6409
   81.7165) Iteration:  18 Training RMSE:     35.505
   85.8197) Iteration:  19 Training RMSE:    35.4046
   89.9266) Iteration:  20 Training RMSE:    35.3288


We got RMSE error of 35.3 minutes error on predicted flight time taking into account the carrier and flight number. That is rather bad.. we are half an hour off track.

Next let's throw in some temporal features into the computation: Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime. How do we do that? It is very easy! Just add the command line: --features=0,1,2,3,4,5,6,7 namely the positions of the features in the input file. This is what we call temporal matrix factorization or tensor factorization. But for utilizing it in one of the traditional methods, you need to merge al the 8 fields into one integer which encodes the time. Which is of course a tedious task.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --file_columns=28 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --max_iter=100  --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --features=1,2,3,4,5,6,7 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 7 : 

INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime

INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum


   21.8356) Iteration:   0 Training RMSE:    50.3144
   36.6782) Iteration:   1 Training RMSE:    40.4813
    51.425) Iteration:   2 Training RMSE:    36.0579
   66.4348) Iteration:   3 Training RMSE:    33.4226
...
   272.188) Iteration:  17 Training RMSE:    20.0103
   286.887) Iteration:  18 Training RMSE:    19.7198
   301.602) Iteration:  19 Training RMSE:    19.4597
   316.305) Iteration:  20 Training RMSE:    19.2147


 With temporal information we now got to RMSE of 19.2 minutes. Which is again not that
good.

Now let's utilize the full power of gensgd: when the going gets tough - throw in some more features! Without even understanding what the feature means I have thrown in almost everything...

./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=11 --rehash=1 --features=1,2,3,4,5,6,7,12,13,14,15,16,17,18 --gensgd_rate3=1e-5  --gensgd_mult_dec=0.9999 --file_columns=28 --max_iter=20 --gensgd_rate1=1e-5 --gensgd_rate2=1e-5 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1155): Total selected features: 14 : 
INFO:     gensgd.cpp(main:1211): Selected feature:   1 : Month
INFO:     gensgd.cpp(main:1211): Selected feature:   2 : DayofMonth
INFO:     gensgd.cpp(main:1211): Selected feature:   3 : DayOfWeek
INFO:     gensgd.cpp(main:1211): Selected feature:   4 : DepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   5 : CRSDepTime
INFO:     gensgd.cpp(main:1211): Selected feature:   6 : ArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:   7 : CRSArrTime
INFO:     gensgd.cpp(main:1211): Selected feature:  12 : CRSElapsedTime
INFO:     gensgd.cpp(main:1211): Selected feature:  13 : AirTime
INFO:     gensgd.cpp(main:1211): Selected feature:  14 : ArrDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  15 : DepDelay
INFO:     gensgd.cpp(main:1211): Selected feature:  16 : Origin
INFO:     gensgd.cpp(main:1211): Selected feature:  17 : Dest
INFO:     gensgd.cpp(main:1211): Selected feature:  18 : Distance
INFO:     gensgd.cpp(main:1212): Target variable    11 : ActualElapsedTime
INFO:     gensgd.cpp(main:1213): From                8 : UniqueCarrier
INFO:     gensgd.cpp(main:1214): To                  9 : FlightNum
   36.2089) Iteration:   0 Training RMSE:    21.1476
   61.2802) Iteration:   1 Training RMSE:    10.1963
   86.3032) Iteration:   2 Training RMSE:    8.64215
   111.236) Iteration:   3 Training RMSE:    7.76054
   136.246) Iteration:   4 Training RMSE:    7.14308
   161.221) Iteration:   5 Training RMSE:     6.6629
...
   461.528) Iteration:  17 Training RMSE:    4.26991
    486.61) Iteration:  18 Training RMSE:    4.17239
   511.737) Iteration:  19 Training RMSE:    4.08084
   536.775) Iteration:  20 Training RMSE:    3.99414

Now we got down to 4 minutes avg error. But, we can continue the computation (run more iterations) and we get down even below 2 minutes error. Isn't that neat? The average flight time is 127 minutes in 2008, so 2 minutes error prediction is not that bad.

Conclusion: traditional matrix / tensor factorization have some severe limitation when dealing with real world complex data. Additional techniques are needed to improve accuracy!

Second task: let's predict TaxiIn (time that the plane is on the ground when coming in)

This task is slightly more difficult, since as you may imagine, there is much larger variation in texiin time relative to flight time. But is predeicing it more difficult? No.. we simply change --val_pos=19 namely to point the taget into the taxiintime field.

bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=2008.csv --from_pos=8 --to_pos=9 --val_pos=19 --rehash=1  --file_columns=28 --gensgd_rate3=1e-3  --gensgd_mult_dec=0.9999 --max_iter=20  --file_columns=28 --gensgd_rate1=1e-3 --gensgd_rate2=1e-3 --features=1,2,3,4,5,6,7,10,11,12,13,14,15,16,17,18 --quiet=1 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[quiet] => [1]
INFO:     gensgd.cpp(main:1155): Total selected features: 16 : 
INFO:     gensgd.cpp(main:1158): Selected feature: 1
INFO:     gensgd.cpp(main:1158): Selected feature: 2
INFO:     gensgd.cpp(main:1158): Selected feature: 3
INFO:     gensgd.cpp(main:1158): Selected feature: 4
INFO:     gensgd.cpp(main:1158): Selected feature: 5
INFO:     gensgd.cpp(main:1158): Selected feature: 6
INFO:     gensgd.cpp(main:1158): Selected feature: 7
INFO:     gensgd.cpp(main:1158): Selected feature: 10
INFO:     gensgd.cpp(main:1158): Selected feature: 11
INFO:     gensgd.cpp(main:1158): Selected feature: 12
INFO:     gensgd.cpp(main:1158): Selected feature: 13
INFO:     gensgd.cpp(main:1158): Selected feature: 14
INFO:     gensgd.cpp(main:1158): Selected feature: 15
INFO:     gensgd.cpp(main:1158): Selected feature: 16
INFO:     gensgd.cpp(main:1158): Selected feature: 17
INFO:     gensgd.cpp(main:1158): Selected feature: 18
   1.56777) Iteration:   0 Training RMSE:    3.89207
   3.01777) Iteration:   1 Training RMSE:    3.64978
    4.5159) Iteration:   2 Training RMSE:    3.46472
    5.8659) Iteration:   3 Training RMSE:    3.30712
   7.26778) Iteration:   4 Training RMSE:    3.17225
    8.7159) Iteration:   5 Training RMSE:    3.06696
...
   23.6072) Iteration:  16 Training RMSE:    2.60147
   24.9789) Iteration:  17 Training RMSE:    2.57697
   26.3267) Iteration:  18 Training RMSE:    2.55768
   27.6967) Iteration:  19 Training RMSE:    2.54186
   29.0773) Iteration:  20 Training RMSE:    2.53113
We again get to average RMSE of 2.5 minutes - which means that this task is actually more difficult than predicting air time.


Instructions:
0) Install GraphChi from mercurial using the instructions here.
1) Download the year 2008 from here.
2) Open the zip file using:
bunzip2 2008.csv.bz2
3) Create a matrix market format file, named 2008.csv:info with the following two lines:
%%MatrixMarket matrix coordinate real general
20 7130 1000000
4) Run the commands as instructed above.


Second dataset: Hearst machine learning challenge

A while ago Hearst provided data about emails campaigns and the task was to predict user reaction to emails (click/ not clicked). The data has several millions records about emails sent with around 273 user features for each email. Here is some of the available frields:
CLICK_FLG,OPEN_FLG,ADDR_VER_CD,AQI,ASIAN_CD,AUTO_IN_MARKET,BIRD_QTY,BUYER_DM_BOOKS,BUYER_DM_COLLECT_SPC_FOOD,BUYER_DM_CRAFTS_HOBBI,BUYER_DM_FEMALE_ORIEN,BUYER_DM_GARDEN_FARM,BUYER_DM_GENERAL,BUYER_DM_GIFT_GADGET,BUYER_DM_MALE_ORIEN,BUYER_DM_UPSCALE,BUYER_MAG_CULINARY_INTERS,BUYER_MAG_FAMILY_GENERAL,BUYER_MAG_FEMALE_ORIENTED,BUYER_MAG_GARDEN_FARMING,BUYER_MAG_HEALTH_FITNESS,BUYER_MAG_MALE_SPORT_ORIENTED,BUYER_MAG_RELIGIOUS,CATS_QTY,CEN_2000_MATCH_LEVEL,CLUB_MEMBER_CD,COUNTRY_OF_ORIGIN,DECEASED_INDICATOR,DM_RESPONDER_HH,DM_RESPONDER_INDIV,DMR_CONTRIB_CAT_GENERAL,DMR_CONTRIB_CAT_HEALTH_INST,DMR_CONTRIB_CAT_POLITICAL,DMR_CONTRIB_CAT_RELIGIOUS,DMR_DO_IT_YOURSELFERS,DMR_MISCELLANEOUS,DMR_NEWS_FINANCIAL,DMR_ODD_ENDS,DMR_PHOTOGRAPHY,DMR_SWEEPSTAKES,DOG_QTY,DWELLING_TYPE,DWELLING_UNIT_SIZE,EST_LOAN_VALUE_RATIO,ETECH_GROUP,ETHNIC_GROUP_CODE,ETHNIC_INSIGHT_MTCH_FLG,ETHNICITY_DETAIL,EXPERIAN_INCOME_CD,EXPERIAN_INCOME_CD_V4,GNDR_OF_CHLDRN_0_3,GNDR_OF_CHLDRN_10_12,GNDR_OF_CHLDRN_13_18,GNDR_OF_CHLDRN_4_6,GNDR_OF_CHLDRN_7_9,HH_INCOME,HHLD_DM_PURC_CD,HOME_BUSINESS_IND,I1_BUSINESS_OWNER_FLG,I1_EXACT_AGE,I1_GNDR_CODE,I1_INDIV_HHLD_STATUS_CODE,INDIV_EDUCATION,INDIV_EDUCATION_CONF_LVL,INDIV_MARITAL_STATUS,INDIV_MARITAL_STATUS_CONF_LVL,INS_MATCH_TYPE,LANGUAGE,LENGTH_OF_RESIDENCE,MEDIAN_HOUSING_VALUE,MEDIAN_LEN_OF_RESIDENCE,MM_INCOME_CD,MOSAIC_HH,MULTI_BUYER_INDIV,NEW_CAR_MODEL,NUM_OF_ADULTS_IN_HHLD,NUMBER_OF_CHLDRN_18_OR_LESS,OCCUP_DETAIL,OCCUP_MIX_PCT,PCT_CHLDRN,PCT_DEROG_TRADES,PCT_HOUSEHOLDS_BLACK,PCT_OWNER_OCCUPIED,PCT_RENTER_OCCUPIED,PCT_TRADES_NOT_DEROG,PCT_WHITE,PHONE_TYPE_CD,PRES_OF_CHLDRN_0_3,PRES_OF_CHLDRN_10_12,PRES_OF_CHLDRN_13_18,PRES_OF_CHLDRN_4_6,PRES_OF_CHLDRN_7_9,PRESENCE_OF_CHLDRN,PRIM_FEM_EDUC_CD,PRIM_FEM_OCC_CD,PRIM_MALE_EDUC_CD,PRIM_MALE_OCC_CD,RECIPIENT_RELIABILITY_CD,RELIGION,SCS_MATCH_TYPE,TRW_INCOME_CD,TRW_INCOME_CD_V4,USED_CAR_CD,Y_OWNS_HOME,Y_PROBABLE_HOMEOWNER,Y_PROBABLE_RENTER,Y_RENTER,YRS_SCHOOLING_CD,Z_CREDIT_CARD

Fields meaning and code are described in detail here. You will need to register the website for getting access to the data.

And this the is the first entry:
N,N,,G,,8,0,1,0,0,0,0,1,0,0,0,0,4,0,0,1,0,0,0,B,U,0,,M,Y,0,0,0,0,0,1,1,1,0,2,0,A,C,0,J,18,Y,66,,A,U,U,U,U,U,34,,U,U,84,M,H,1,1,M,5,I,01,00,67,3,,E06,Y,7,3,0,05,0,37,78.09,30,63,36,13.27,59,,N,N,N,N,N,N,U,UU,U,07,6,J,4,,J,4,U,,Y,U,0,Y,,24,,,,,,,F,F,,,,,,,U,Y,,,,,,,17,69,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NORTH LAUDERDALE,330685141,FL,190815,,,,,,1036,Third Party - Merch,"Mon, 09/20/10 01:04 PM"


For this demo, I used the file Modeling_1.csv which is the first of 5 files, with 400K entries.

We would like to predict the zeros entry (click flag). I have taken column 9 and 10 as the matrix from/to entries. The rest of the columns up to column 40 are features. (While there are more features the actual solution is so accurate so the first 40 are enough).

After about an hour of playing I got the the following formulation:

./toolkits/collaborative_filtering/gensgd --training=Modeling_1.csv --val_pos=0 --from_pos=9 --to_pos=10 --features=3,4,5,6,7,8,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40 --has_header_titles=1 --rehash=1 --file_columns=200 --rehash_value=1 --calc_error=1 --cutoff=0.5 --has_header_titles=1
WARNING:  common.hpp(print_copyright:104): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1255): Total selected features: 36 : 
INFO:     gensgd.cpp(main:1258): Selected feature:   3 : AQI
INFO:     gensgd.cpp(main:1258): Selected feature:   4 : ASIAN_CD
INFO:     gensgd.cpp(main:1258): Selected feature:   5 : AUTO_IN_MARKET
INFO:     gensgd.cpp(main:1258): Selected feature:   6 : BIRD_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:   7 : BUYER_DM_BOOKS
INFO:     gensgd.cpp(main:1258): Selected feature:   8 : BUYER_DM_COLLECT_SPC_FOOD
INFO:     gensgd.cpp(main:1258): Selected feature:  11 : BUYER_DM_GARDEN_FARM
INFO:     gensgd.cpp(main:1258): Selected feature:  12 : BUYER_DM_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  13 : BUYER_DM_GIFT_GADGET
INFO:     gensgd.cpp(main:1258): Selected feature:  14 : BUYER_DM_MALE_ORIEN
INFO:     gensgd.cpp(main:1258): Selected feature:  15 : BUYER_DM_UPSCALE
INFO:     gensgd.cpp(main:1258): Selected feature:  16 : BUYER_MAG_CULINARY_INTERS
INFO:     gensgd.cpp(main:1258): Selected feature:  17 : BUYER_MAG_FAMILY_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  18 : BUYER_MAG_FEMALE_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  19 : BUYER_MAG_GARDEN_FARMING
INFO:     gensgd.cpp(main:1258): Selected feature:  20 : BUYER_MAG_HEALTH_FITNESS
INFO:     gensgd.cpp(main:1258): Selected feature:  21 : BUYER_MAG_MALE_SPORT_ORIENTED
INFO:     gensgd.cpp(main:1258): Selected feature:  22 : BUYER_MAG_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  23 : CATS_QTY
INFO:     gensgd.cpp(main:1258): Selected feature:  24 : CEN_2000_MATCH_LEVEL
INFO:     gensgd.cpp(main:1258): Selected feature:  25 : CLUB_MEMBER_CD
INFO:     gensgd.cpp(main:1258): Selected feature:  26 : COUNTRY_OF_ORIGIN
INFO:     gensgd.cpp(main:1258): Selected feature:  27 : DECEASED_INDICATOR
INFO:     gensgd.cpp(main:1258): Selected feature:  28 : DM_RESPONDER_HH
INFO:     gensgd.cpp(main:1258): Selected feature:  29 : DM_RESPONDER_INDIV
INFO:     gensgd.cpp(main:1258): Selected feature:  30 : DMR_CONTRIB_CAT_GENERAL
INFO:     gensgd.cpp(main:1258): Selected feature:  31 : DMR_CONTRIB_CAT_HEALTH_INST
INFO:     gensgd.cpp(main:1258): Selected feature:  32 : DMR_CONTRIB_CAT_POLITICAL
INFO:     gensgd.cpp(main:1258): Selected feature:  33 : DMR_CONTRIB_CAT_RELIGIOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  34 : DMR_DO_IT_YOURSELFERS
INFO:     gensgd.cpp(main:1258): Selected feature:  35 : DMR_MISCELLANEOUS
INFO:     gensgd.cpp(main:1258): Selected feature:  36 : DMR_NEWS_FINANCIAL
INFO:     gensgd.cpp(main:1258): Selected feature:  37 : DMR_ODD_ENDS
INFO:     gensgd.cpp(main:1258): Selected feature:  38 : DMR_PHOTOGRAPHY
INFO:     gensgd.cpp(main:1258): Selected feature:  39 : DMR_SWEEPSTAKES
INFO:     gensgd.cpp(main:1258): Selected feature:  40 : DOG_QTY
INFO:     gensgd.cpp(main:1259): Target variable   0 : CLICK_FLG
INFO:     gensgd.cpp(main:1260): From              9 : BUYER_DM_CRAFTS_HOBBI
INFO:     gensgd.cpp(main:1261): To               10 : BUYER_DM_FEMALE_ORIEN
   54.8829) Iteration:   0 Training RMSE: 0.00927502  Train err:      8e-05
   99.4742) Iteration:   1 Training RMSE: 0.00120904  Train err:          0
   143.852) Iteration:   2 Training RMSE: 0.000793143 Train err:          0
   188.523) Iteration:   3 Training RMSE: 0.000604034 Train err:          0
   233.188) Iteration:   4 Training RMSE: 0.000500067 Train err:          0


We got a very good classifier - starting from the second iteration there are no classification errors.

Some explanation about additional run time flags, not used in previous examples.
1) --rehash_value=1 - since the target value is not numeric, I used rehash_value to translate Y/N into two numeric integer bins.
2) --cutoff=0.5 - after hasing the target Y/N we get two integers: 0 and 1. So I use 0.5 as a prediction threshold to decide for Y/N.
3) --file_columns=200 - I am looking only at the first 40 columns, so there is no need in parsing all the 273 columns. (You  can play with this parameter on run time).
4) --has_header_titles=1 - first line of input field includes column titles

Instructions
1) Register to the hearst website.
2) Download the first data file Modeling_1.csv and put in the in main graphchi folder.
3) Create a file named Modeling_1.csv:info and put the following two lines in it:
%%MatrixMarket matrix coordinate real general
11 13 400000
4) Run as instructed.

Tuesday, December 11, 2012

Collaborative Filtering - 3rd Generation [or] winning the kdd cup in 5 minutes!

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.


After spending a few years writing collaborative filtering software with thousands of installations, and after talking to tens of companies and participating in KDD CUP twice,  I have started to develop some next generation collaborative filtering software. The software is very experimental at this point and I am looking for the help of my readers - universities and companies who would like to try it out.
[NOTE: I HAVE ADDED SOME UPDATES BELOW ON THURSDAY DEC 13]

The problem:

Most collaborative filtering methods (like ALS, SGD, bias-SGD, NMF etc.) use the rating values for computing matrix factorization. A few "fancier" methods (like tensor-ALS, time-SVD++ etc. ) utilize also the temporal information to improve the quality of predictions. So basically we are limited to 2 or 3 dimensional factorization. Typically the utilized data is of the type:
[ user ] [ item ] [ rating ] 
or
[ user ] [ item ] [ time ] [ rating ] 

I am often asked, how to approach problems when you have data of the type:

[ user ] [ item ] [ item category] [ purchase amount ] [ quantity ] [ user age ] [ zip code ] [ time ] [ date ] ... [ user rating ]

In other words, how do we utilize additional information we have about user features, item features, or even more fancier feature like user friends etc. This problem is often encountered in practice and in many cases, papers are written about it by doing specific constructions. See for example Koenigstein's paper. However, in practice, most users do not like to break their heads and invent novel algorithms but want to have a readily accessible method that can take more features into account and without much fine tuning.

The solution:

Following the great success of libFM, I thought about implementing a more general SGD method in GraphChi that case take a list of features into account.

A new SGD based algorithm is developed with the following
1) Support for string features (John Smith bought the Matrix)
2) Support for dynamic selection of features on runtime.
3) Support of multiple file formats with column permutation.
4) Support for an unlimited number of features
5) Support for multiple ratings of the same item.

Working example - KDD CUP 2012 - track1

To give some concrete example, I will use KDD CUP 2012 track1 data which will demonstrate how easy to setup and try the new method.

Preliminaries:
0) Download track 1 data from here. Extract the zip file.
1) Download and install GraphChi using steps 1-3.

2a) In the root graphchi folder, Create a file named rec_log_train.txt:info with the following lines:

%%MatrixMarket matrix coordinate real general
2500000 2500000 73209277


2b) link the file track1/rec_log_train.txt into the root graphchi folder:
cd graphchi
ln -s ../track1/rec_log_train.txt .

Let's look at the input file format:

<49|0>bickson@bigbro6:~/graphchi$ head rec_log_train.txt
2088948 1760350 -1 1318348785
2088948 1774722 -1 1318348785
2088948 786313 -1 1318348785
601635 1775029 -1 1318348785
601635 1902321 -1 1318348785
The input is of the format 
[user] [ add ] [ click ] [ timestamp ]

Where click is either -1 (not clicked) or 1 (clicked).

First step: regular matrix factorization

Now let's run a quick matrix factorization using user, item and rating:
 ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2

Explanation: --training is the input file. --val_pos=2 means that the rating is in column 2, --rehash=1 means we treat all fields as strings (and thus support string values), --limit_rating means we handle only the first million ratings (to speed up the demo), --max_iter is the number of SGD iterations, --minval and --maxval are the allowed rating range, and --quiet less verbose output. --calc_error displays the classification error (how many predictions were wrong).--file_columns=4 says that there are 4 columns in the input file.

And here is the output we get:



WARNING:  common.hpp(print_copyright:204): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com
[training] => [rec_log_train.txt]
[limit_rating] => [1000000]
[max_iter] => [100]
[gensgd_mult_dec] => [0.999999]
[minval] => [-1]
[maxval] => [1]
[quiet] => [1]
[calc_error] => [1]
[file_columns] => [4]
[gensgd_rate0] => [0.1]
[gensgd_rate1] => [0.1]
[gensgd_rate2] => [0.1]
[gensgd_regw] => [0.1]
[gensgd_reg0] => [0.1]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]

 === REPORT FOR sharder() ===
[Timings]
edata_flush: 1.00698s (count: 265, min: 0.000625s, max: 0.005065, avg: 0.00379992s)
execute_sharding: 2.3855 s
finish_shard.sort: 0.648602s (count: 4, min: 0.156317s, max: 0.166634, avg: 0.162151s)
preprocessing: 1.72782 s
shard_final: 1.78102s (count: 4, min: 0.432858s, max: 0.454368, avg: 0.445255s)
[Other]
app: sharder
   31.7185) Iteration:   0 Training RMSE:   0.526537 Train err:  0.0010427
Step size 1 0.000387365  Step size 2 0.000780633  Step size 3 8.82609e-06
...
   295.691) Iteration:  99 Training RMSE:   0.206428 Train err: 0.000239218

We got a training RMSE error of 0.20, but only a training error of 0.02% (namely 99.98% of the predictions are correct)

Second step: temporal matrix factorization

Now let's add the time bins (3rd column) into the computation as feature and run again. This is done using the --features=3 command line flag:


./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2 --features=3 --gensgd_rate3=0.1 --gensgd_rate4=0.1
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   3.17175) Iteration:   0 Training RMSE:   0.522901 Train err:   0.033788
...
    284.147) Iteration:  99 Training RMSE:  0.0275943 Train err:          0

By adding the time bins into consideration, we get an improvement from RMSE 0.206 to 0.027 !!
Furthermore, the classification error is down to zero. 

Third step: let's throw in some user features!

Besides of add rating data, we have some additional information about the users. The file user_profile.txt holds some properties of each user. The file has the following format:
100044 1899 1 5 831;55;198;8;450;7;39;5;111
100054 1987 2 6 0
100065 1989 1 57 0
100080 1986 1 31 113;41;44;48;91;96;42;79;92;35
100086 1986 1 129 0
100097 1981 1 75 0
100100 1984 1 47 71;51

The file has the following format:
[user ] [ year of birth ] [ gender ] [ number of tweets ] [ tag ids (area of interest) ]

Adding user features is simply done by the flag --user_file=user_profile.txt

./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --features=3 --last_item=1 --user_file=user_profile.txt
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.02809) Iteration:   0 Training RMSE:          0 Train err:          0
   2.90718) Iteration:   1 Training RMSE:   0.511614 Train err:   0.022662
   3.74655) Iteration:   2 Training RMSE:    0.49371 Train err:   0.017136
   4.55983) Iteration:   3 Training RMSE:   0.479225 Train err:   0.015074
   5.40781) Iteration:   4 Training RMSE:   0.465404 Train err:   0.016538
   6.27764) Iteration:   5 Training RMSE:   0.451063 Train err:   0.015657
...
   77.5867) Iteration:  96 Training RMSE:  0.0177382 Train err:          0
   78.3384) Iteration:  97 Training RMSE:  0.0176325 Train err:          0
   79.0683) Iteration:  98 Training RMSE:  0.0174947 Train err:          0
   79.7872) Iteration:  99 Training RMSE:  0.0174152 Train err:          0

Overall we got another improvement from 0.018 to 0.0174

Step four: throw in some item features

In the KDD cup data, we are also given some item features, in the file item.txt

2335869 8.1.4.2 412042;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974
1774844 1.8.3.6 31449;517124;45008;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268

The format is:
[add id] [catergory] [ list of keywords ]


Let's throw in some item information into the algorithm. This is done using the --item_file parameter.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=0 --features=3 --last_item=1   --quiet=1 --user_file=user_profile.txt --item_file=item.txt --gensgd_rate5=1e-5 --calc_error=1 --file_columns=4

WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.23951) Iteration:   0 Training RMSE:          0 Train err:          0
   4.95858) Iteration:   1 Training RMSE:   0.527203 Train err:   0.022205
   7.54827) Iteration:   2 Training RMSE:   0.499881 Train err:   0.022271
    10.026) Iteration:   3 Training RMSE:   0.476596 Train err:   0.024138
   12.4976) Iteration:   4 Training RMSE:   0.454496 Train err:   0.016523
   14.9459) Iteration:   5 Training RMSE:   0.431336 Train err:   0.016406
...
    217.96) Iteration:  96 Training RMSE:  0.0127242 Train err:          0
   220.116) Iteration:  97 Training RMSE:  0.0126185 Train err:          0
   222.317) Iteration:  98 Training RMSE:  0.0125111 Train err:          0
   224.559) Iteration:  99 Training RMSE:  0.0123526 Train err:          0



We got some significant RMSE improvement - from 0.017 to 0.012.

Thursday, Dec 13 - An update

I am getting a lot of readers inputs about this blog post, which is excellent!

One question I got from Xavier Amatriain, manager of recommendations @ Netflix, is why do I compute training error and not test error. Xavier is absolutely right, I was quite excited about the results so I wanted to share them before I even had time to compute the test error. Anyway I promise to do so in a couple of days. But I am quite sure that the model is quite accurate!

I got some interesting inputs from Tianqi Chen, author of SVDFeature software:
I think one important thing that we may want to add is the support of classification loss( which is extremely easy for SGD ). Since now days RMSE optimization seems get a bit out of fashioned and most data are click-through data and the optimization target is ranking instead of RMSE. I think the feature selection part is quite interesting. Since adding junk feature in those feature-based factorization model will almost hamper the performance. However, directly replacing L1 constraint on weight will work worse than L2 regularization, so I am curious what trick you used :-)

I also got comments from my golden collaborator Justin Yan:
1. for SGD-FM, it is hard to turn the parameters like learning rate and MCMC based method is slow.
2. Recently I find another great model- online bayisian probit regression (adpredictor) which bing has used in their CTR prediction. this model is a online learning model it is very fast,and the result is better than Logistic regression, so I am thinking about borrowing some ideas from this model to improve LibFM to a online learning model.

The last kind of feedback I am getting is from companies how claim to already solved this problem.. I think that if the problem was already completely solved, I was not getting so much feedback about it.
What do you think?

Next: next generation cf - part 2 - trying the software on airline on time data.