Sunday, February 10, 2013

Setting up Java GraphChi development environment - and running sample ALS

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run  alternating least squares algorithm (ALS) on a small subset of Netflix data.
Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

Preliminaries - setting up Maven

Download maven binary from:
http://maven.apache.org/download.cgi

Extract the tgz file into /usr/local/apache-maven-3.0.4/

Setup Maven environment:
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin

optional:

export MAVEN_OPTS="-Xms256m -Xmx512m"

Note: you have to have Java JDK installed.

Download and install mercurial from:

http://mercurial.selenic.com/downloads/


Checkout GraphChi-Java from:


http://code.google.com/p/graphchi-java/source/checkout


Download Ecplise Classic Juno from: 

http://www.eclipse.org/downloads/index-developer.php?release=juno



Download m2e eclipse plugin from: 

http://eclipse.org/m2e/download/

By adding a new software site as explained here: http://help.eclipse.org/juno/index.jsp?topic=//org.eclipse.platform.doc.user/tasks/tasks-127.htm

Eclipse -> install -> work with: http://download.eclipse.org/technology/m2e/releases
software name: m2e -> 
Restart eclipse.



Import GraphChi Java project into Ecplise

Eclipse -> File -> import -> existing maven project -> 
Next->Browse for the graphchi-java project (the path you checked using mercurial)


Project -> Build (remove the check mark on build automatically if present).
At the first compilation maven will download some plugins:

Verify that the project compiler is pointing to Java 1.6: Right mouse click GraphChi Java project root -> properties - > compiler -> 1.6 (see picture):

Hopefully now the project compiled without errors.


Now run ALS with subset of netflix data

Download the file: smallnetflix_mm and put it in your project folder.

Right mouse click ALSMatrixFactoriztion,java -> Run as.. -> run configuration and add command line arguments:
Namely full path to the downloaded file name, and the number of shards (1 in this case).

Also set the virtual machine parameters to increase memory.

Press the "Run" button.
Correct run should be:

9:54:25 AM ALS main - INFO:   Found shards -- no need to preprocess
9:54:25 AM ALS main - INFO:   Set latent factor dimension to: 5
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
9:54:26 AM engine run - INFO:   :::::::: Using 4 execution threads :::::::::
9:54:26 AM ALS beginIteration - INFO:   Initializing latent factors for 96576 vertices
Creating 1 blocks
9:54:26 AM engine run - INFO:   0.672s: iteration: 0, interval: 0 -- 96575
Tried to read past file: 0 --- 772608
9:54:26 AM engine run - INFO:   Subinterval:: 0 -- 96575 (iteration 0)
9:54:26 AM engine run - INFO:   Init vertices...
9:54:27 AM engine run - INFO:   Loading...
9:54:27 AM engine run - INFO:   Loading memshard started. pool-2-thread-1 id=11
9:54:27 AM engine run - INFO:   Memshard: 0 -- 96575
9:54:27 AM engine run - INFO:   Vertices length: 96576
9:54:27 AM memoryshard loadVertices - INFO:   Load memory shard: 0 --- 96575
9:54:27 AM engine run - INFO:   Loading memory-shard finished.pool-2-thread-1
9:54:27 AM engine run - INFO:   Load took: 274ms
9:54:27 AM engine run - INFO:   Update exec: 610 ms.
9:54:27 AM engine run - INFO:   1.793s: iteration: 1, interval: 0 -- 96575
9:54:27 AM engine run - INFO:   Subinterval:: 0 -- 96575 (iteration 1)
9:54:27 AM engine run - INFO:   Init vertices...
9:54:27 AM engine run - INFO:   Loading...
9:54:27 AM engine run - INFO:   Loading memshard started. pool-2-thread-2 id=16
9:54:27 AM engine run - INFO:   Memshard: 0 -- 96575
9:54:27 AM engine run - INFO:   Vertices length: 96576
9:54:27 AM memoryshard loadVertices - INFO:   Load memory shard: 0 --- 96575
9:54:28 AM engine run - INFO:   Loading memory-shard finished.pool-2-thread-2
9:54:28 AM engine run - INFO:   Load took: 163ms
9:54:28 AM engine run - INFO:   Update exec: 391 ms.
9:54:28 AM engine run - INFO:   2.422s: iteration: 2, interval: 0 -- 96575
9:54:28 AM engine run - INFO:   Subinterval:: 0 -- 96575 (iteration 2)
9:54:28 AM engine run - INFO:   Init vertices...
9:54:28 AM engine run - INFO:   Loading...
9:54:28 AM engine run - INFO:   Loading memshard started. pool-2-thread-3 id=17
9:54:28 AM engine run - INFO:   Memshard: 0 -- 96575
9:54:28 AM engine run - INFO:   Vertices length: 96576
9:54:28 AM memoryshard loadVertices - INFO:   Load memory shard: 0 --- 96575
9:54:28 AM engine run - INFO:   Loading memory-shard finished.pool-2-thread-3
9:54:28 AM engine run - INFO:   Load took: 134ms
9:54:29 AM engine run - INFO:   Update exec: 374 ms.
9:54:29 AM engine run - INFO:   2.997s: iteration: 3, interval: 0 -- 96575
9:54:29 AM engine run - INFO:   Subinterval:: 0 -- 96575 (iteration 3)
9:54:29 AM engine run - INFO:   Init vertices...
9:54:29 AM engine run - INFO:   Loading...
9:54:29 AM engine run - INFO:   Loading memshard started. pool-2-thread-4 id=18
9:54:29 AM engine run - INFO:   Memshard: 0 -- 96575
9:54:29 AM engine run - INFO:   Vertices length: 96576
9:54:29 AM memoryshard loadVertices - INFO:   Load memory shard: 0 --- 96575
9:54:29 AM engine run - INFO:   Loading memory-shard finished.pool-2-thread-4
9:54:29 AM engine run - INFO:   Load took: 170ms
9:54:29 AM engine run - INFO:   Update exec: 398 ms.
9:54:29 AM engine run - INFO:   3.5820000000000003s: iteration: 4, interval: 0 -- 96575
9:54:29 AM engine run - INFO:   Subinterval:: 0 -- 96575 (iteration 4)
9:54:29 AM engine run - INFO:   Init vertices...
9:54:29 AM engine run - INFO:   Loading...
9:54:29 AM engine run - INFO:   Loading memshard started. pool-2-thread-1 id=11
9:54:29 AM engine run - INFO:   Memshard: 0 -- 96575
9:54:29 AM engine run - INFO:   Vertices length: 96576
9:54:29 AM memoryshard loadVertices - INFO:   Load memory shard: 0 --- 96575
9:54:29 AM engine run - INFO:   Loading memory-shard finished.pool-2-thread-1
9:54:29 AM engine run - INFO:   Load took: 117ms
9:54:30 AM engine run - INFO:   Update exec: 505 ms.
9:54:30 AM engine run - INFO:   Engine finished in: 4.2620000000000005 secs.
9:54:30 AM engine run - INFO:   Updates: 482880
9:54:30 AM ALS main - INFO:   Train RMSE: 0.7323246277805968, total edges:900817
9:54:31 AM ALS writeOutputMatrices - INFO:   Latent factor matrices saved: /Users/bickson/Downloads/smallnetflix_mm_U.mm, /Users/bickson/Downloads/smallnetflix_mm_V.mm



Known errors:

in thread "main" java.io.FileNotFoundException: ~/Downloads/smallnetflix_mm.shovel.0 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:194)
at java.io.FileOutputStream.<init>(FileOutputStream.java:84)
at edu.cmu.graphchi.preprocessing.FastSharder.<init>(FastSharder.java:113)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.createSharder(ALSMatrixFactorization.java:176)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:198)
Solution: Give a full absolute path pointing to the location of your file, namely /home/bickson/Downloads/smallnetflix_mm etc.

Error:
thread "main" java.lang.IllegalArgumentException: Java Virtual Machine has only 32489472bytes maximum memory. Please run the JVM with at least 256 megabytes of memory using -Xmx256m. For better performance, use higher value
at edu.cmu.graphchi.engine.GraphChiEngine.<init>(GraphChiEngine.java:120)
at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:215)
Solution
Increase virtual machine memory quota as explained on top.

12 comments:

  1. Hi Danny,
    I was able to setup and run the Java version on Windows 7.It run perfectly although I have slow 2GB RAM 32 bit x86 machine:
    Here is the final output:
    12:29:15 AM engine run - INFO: Engine finished in: 70.42 secs.
    12:29:15 AM engine run - INFO: Updates: 495445
    12:29:15 AM ALS main - INFO: Train RMSE: 0.80984909266858, total edges:3298163
    12:29:20 AM ALS writeOutputMatrices - INFO: Latent factor matrices saved: C:\ARVista01\data2012\MySoftwareProjects\DataScience\MachineLearning\CMU\GraphChi\Data\smallnetflix_mm.txt_U.mm, C:\ARVista01\data2012\MySoftwareProjects\DataScience\MachineLearning\CMU\GraphChi\Data\smallnetflix_mm.txt_V.mm

    It would be nice if all the algorithm in the C++ version are ported to Java. Java has a much more bigger audience
    I am willing to help with porting to Java and/or testing
    thanks
    Al

    ReplyDelete
    Replies
    1. Thanks Al for your kind note. We would love to get any help we can. Let me try to port some algorithm and have you help us test it.

      Best,

      Delete
  2. Sure. I will be glad to help in any ways I can. I also can participate in code review (may be design and implantation) . I have a strong S/W background (Java,C/C++,Python) on Windows (and lesser degree Linux platform).

    ReplyDelete
  3. Hi,
    Which Java SDK for Ubuntu you recommend :
    Oracle Java 6 (or 7) latest
    OpenJDK 6 (or 7)
    I am using Oracle Java 7 on Win 7
    P.S: In my last post "implantation" is typo! Meant implementation!

    ReplyDelete
  4. Why it takes much more time on my PC?
    5:13:57 PM engine run - INFO: Engine finished in: 67.946 secs.
    5:13:57 PM engine run - INFO: Updates: 495445
    5:13:57 PM ALS main - INFO: Train RMSE: 0.804810381972627, total edges:3298163

    as compared to Danny's:
    9:54:30 AM engine run - INFO: Engine finished in: 4.2620000000000005 secs.
    I am using Win 7 with 2 AMD core each 1.67 GHz and 2 GB M.
    I guess Ram is the determining factor?

    ReplyDelete
    Replies
    1. Don't worry about it - when I run I did not notice that my input file was truncated so it was about 1/4 of the right size. So you should multiply runtime in about x4 to get my runtime.

      Delete
    2. Actually on a faster PC (win7 with 8 cores) using the whole file it run in 7.115 which is very close to your results!


      3:18:21 PM memoryshard loadVertices - INFO: Load memory shard: 0 --- 98350
      3:18:22 PM engine run - INFO: Loading memory-shard finished.pool-2-thread-1
      3:18:22 PM engine run - INFO: Load took: 275ms
      3:18:22 PM engine run - INFO: Update exec: 737 ms.
      3:18:22 PM engine run - INFO: Subinterval:: 98351 -- 99088 (iteration 4)
      3:18:22 PM engine run - INFO: Init vertices...
      3:18:22 PM engine run - INFO: Loading...
      3:18:22 PM engine run - INFO: Loading memshard started. pool-2-thread-2 id=19
      3:18:22 PM engine run - INFO: Memshard: 98351 -- 99088
      3:18:22 PM engine run - INFO: Vertices length: 738
      3:18:22 PM memoryshard loadVertices - INFO: Load memory shard: 98351 --- 99088
      3:18:22 PM engine run - INFO: Loading memory-shard finished.pool-2-thread-2
      3:18:22 PM engine run - INFO: Load took: 85ms
      3:18:22 PM engine run - INFO: Update exec: 74 ms.
      3:18:22 PM engine run - INFO: Engine finished in: 7.115 secs.
      3:18:22 PM engine run - INFO: Updates: 495445
      3:18:22 PM ALS main - INFO: Train RMSE: 0.8100755498999634, total edges:3298163
      3:18:23 PM ALS writeOutputMatrices - INFO: Latent factor matrices saved: C:\AARW701\data\AR\TOSH2013_01\SW\EWS\GraphChiJava\DataSets\smallnetflix_mm.txt_U.mm, C:\AARW701\data\AR\TOSH2013_01\SW\EWS\GraphChiJava\DataSets\smallnetflix_mm.txt_V.mm

      Delete
  5. You're welcome. Any performance model as a function of number of CPU and power of each and RAM?

    More specifically I use a PC with:
    Operating System: Windows 7 Professional 64-bit (6.1, Build
    Processor: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz (8 CPUs), ~2.2GHz
    Memory: 8192MB RAM
    Available OS Memory: 8100MB RAM
    Page File: 2943MB used, 13254MB available

    ReplyDelete
  6. Thanks for the tutorial. I followed the steps but get an error when I run it.

    Exception in thread "main" java.io.FileNotFoundException: /home/Data/smallnetflix_mm.txt.shovel.0 (No such file or directory)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.(FileOutputStream.java:212)
    at java.io.FileOutputStream.(FileOutputStream.java:104)
    at edu.cmu.graphchi.preprocessing.FastSharder.(FastSharder.java:115)
    at edu.cmu.graphchi.apps.SmokeTest.createSharder(SmokeTest.java:101)
    at edu.cmu.graphchi.apps.SmokeTest.main(SmokeTest.java:125)

    In the FastSharder, it tries to retrive a file which is not created. Do you know what may be causing this error ?

    ReplyDelete
    Replies
    1. Can you please specify the full command line arguments you are using?

      Thanks

      Delete