Monday, February 28, 2011

The GraphLab large scale machine learning framework - installation on MAC OSX 10.6

GraphLab is an open source large scale parallel machine learning framework.

I was asked by Dan, an avid reader of this blog, to supply installation instructions for MAC OSX Snow Leopard 10.6. Installation is quite simple.

1) Install cmake using the following link: cmake 2.8.4

2) Download graphlab code and install GraphLab where XXX is the latest version you can find in here
wget http://graphlab.org/release/graphlabapi_v1_XXXX.tar.gz
tar xvzf graphlabapi_v1_XXXX.tar.gz
cd graphlabapi
./configure --bootstrap 
cd debug #or equivalently cd release
make -j 4
cd tests
./runtests.sh

Anyone who tries it out - update me if it went smoothly!

NOTE: We now support Eigen linear algebra package. It can be installed using the command:
./configure --bootstrap --eigen
However, on MAC OS, gcc45 is required. See explanation here.

NOTE2: The current GraphLab MAC setup default to gcc-4.2. If you like to use your default compiler, you can comment the first few lines of the file CMakeLists.txt in the root graphlab folder, namely the lines:
if(APPLE)
  set(CMAKE_C_COMPILER "gcc-4.2")
  set(CMAKE_CXX_COMPILER "c++-4.2")
endif(APPLE)

Friday, February 25, 2011

Mahout on Amazon EC2 - part 5 - installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

This post explains how to install Mahout ML framework on top of Amazon EC2 (CentOS/RedHat based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated.

Note: part 1 of this post, explains how to install the same installation on top of Ubuntu based machine.

Full procedure should take around 2-3  hours.. :-(

1) Start high performance instance from amazon aws console
Cent OS AMI ID ami-7ea24a17 (x86_64)  Edit AMI
Name:  Basic Cluster Instances HVM CentOS 5.4   
Description:  Minimal CentOS 5.4, 64-bit architecture, and HVM-based virtualization for use with Amazon EC2 Cluster Instances.

2) Login into the instance (right mouse click on running instance from AWS console)

3) Install some required stuff
sudo yum update
sudo yum upgrade
sudo apt-get install python-setuptools  
sudo easy_install "simplejson"

4) Install boto (unfortunately I was not able to install it using easy_install directly)
wget http://boto.googlecode.com/files/boto-1.8d.tar.gz
tar xvzf boto-1.8d.tar.gz
cd boto=1.8d
sudo easy_install .

5) Install maven2 (unfortunately I was not able to install it using yum)
wget http://www.trieuvan.com/apache/maven/binaries/apache-maven-2.2.1-bin.tar.gz
tar xvzf apache-maven-2.2.1-bin.tar.gz
cp -R apache-maven-2.2.1 /usr/local/
ln -s /usr/local/apache-maven-2.2.1/bin/mvn /usr/local/bin/

6) Download and install Hadoop
wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz   
tar vxzf hadoop-0.20.2.tar.gz  
sudo  mv hadoop-0.20.2 /usr/local/

add the following to $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jre-openjdk/  
# The maximum amount of heap to use, in MB. Default is 1000  
export HADOOP_HEAPSIZE=2000  

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>   <property>
<name>mapred.job.tracker</name>  
<value>localhost:9001</value>
</property>
 <property>  
<name>dfs.replication</name>  
 <value>1</value>      
  </property>
</configuration>


Edit the file hdfs-site.xml

<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/home/data/tmp/</value>
 </property>
<property>
 <name>dfs.data.dir</name>
 <value>/home/data/tmp2/</value>
</property>
<property>
 <name>dfs.name.dir</name>
 <value>/home/data/tmp3/</value>
</property>
</configuration>

Note: directory /home/data does not exist, and you will have to create it
when starting the instance using the commands:
# mkdir -p /home/data  
# mount -t ext3 /dev/sdb/ /home/data/ 
The reason for this setup is that the root dir has only 10GB, while /dev/sdb/
has 800GB.

set up authorized keys for localhost login w/o passwords and format your name node
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

  • Checkout and build Mahout from trunk. Alternatively, you can upload a Mahout release tarball and install it as we did with the Hadoop tarball (Don't forget to update your .profile accordingly).

    # svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
    # cd mahout
    # mvn clean install
    # cd ..
    # sudo mv mahout /usr/local/mahout-0.4
    


    4)Add the following to your .profile
    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
    export HADOOP_HOME=/usr/local/hadoop-0.20.2
    export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
    export MAHOUT_HOME=/usr/local/mahout-0.4/
    export MAHOUT_VERSION=0.4-SNAPSHOT
    export MAVEN_OPTS=-Xmx1024m
    

    Verify that the paths on .profile point to the exact version you downloaded

    6) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

    # $HADOOP_HOME/bin/hadoop namenode -format
    $HADOOP_HOME/bin/start-all.sh
    jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
    cd $MAHOUT_HOME
    ./examples/bin/build-reuters.sh
    $HADOOP_HOME/bin/stop-all.sh
    rm -rf /tmp/*   // delete the Hadoop files


  • Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the other conf file settings. The Hadoop startup scripts will not make any changes to them. In particular, upping the Java heap size is required for many of the Mahout jobs.
    // edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
    <property>
       <name>mapred.child.java.opts</name>
       <value>-Xmx2000m</value>
    </property>

    7) Allow for Hadoop to run even if you will work on a different EC2 machine:
    echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
    

    8) Now bundle the image.
    Using Amazon AWS console - select running instance, right mouse click and then bundle EBS image. Enter image name and description. Now the machine will reboot and the image will be created.
  • Thursday, February 24, 2011

    Some thoughts about accuracy of Mahout's SVD

    I was testing Mahout's SVD code and I encountered some subtleties.
    I wonder if I am missing anything or is there a bug in the code?

    1) The ordering of eigenvalues was the opposite than eigenvectors. But this was hopefully fixed by now in patch-369.
    2) When requesting a rank of 4, we get 3 eigenvalues... So it seems that the rank is always lower by one.
    3) There are two transformations which makes comparison of results with matlab (or pen & paper) harder:

    a) The scaleFactor. Defined in: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosState.java
    I quote a documentation remark in: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java:48
    " /** To avoid floating point overflow problems which arise in power-methods like Lanczos, an initial pass is made
     * through the input matrix to
    generate a good starting seed vector by summing all the rows of the input matrix, and
    compute the trace(inputMatrixt*matrix)
    This latter value, being the sum of all of the singular values, is used to rescale the entire matrix, effectively forcing the largest singular value to be strictly less than one, and transforming floating point overflow
    problems into floating point underflow (ie, very small singular values will become invisible, as they  will appear to be zero and the algorithm will terminate).*/
    

    b) The second transformation is orthonogolization of the resulting vector. This step is optional (IMHO).
    see: ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java:118
    The function call is: orthoganalizeAgainstAllButLast(nextVector, state);
    Again I quote from documentation:
    /** 

    This implementation uses {@link org.apache.mahout.math.matrix.linalg.EigenvalueDecomposition} to do the * eigenvalue extraction from the small (desiredRank x desiredRank) tridiagonal matrix. Numerical stability is * achieved via brute-force: re-orthogonalization against all previous eigenvectors is computed after every pass. * This can be made smarter if (when!) this proves to be a major bottleneck. Of course, this step can be parallelized * as well. *


    If anyone wants to reproduce my test, Can can add the function testLanczosSolver2() to TestLanczosSolver.java (code below).
    1) To run it, you need first to comment the line:
    //nextVector.assign(new Scale(1 / scaleFactor));
    in LanczosSolver.java, so it is easier to compare the results to Matlab, without the scaling.
    2) You need to also comment the line:
    //orthoganalizeAgainstAllButLast(nextVector, basis);
    in LanczosSolver.java

    The factorized matrix is:
    >> A
    
    3.1200 -3.1212 -3.0000
    -3.1110 1.5000 2.1212
    -7.0000 -8.0000 -4.0000
    
    The eigenvalues are;
    >> [a,b]=eig(A'*A)
    
    a =
    
    0.2132 -0.8010 -0.5593
    -0.5785 0.3578 -0.7330
    0.7873 0.4799 -0.3871
    
    b =
    
    0.0314 0 0
    0 42.6176 0
    0 0 131.2553
    

    Now I run the unit test testLanczosSolver2 and I get:
    INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal auxiliary matrix.
    Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
    INFO: Eigenvector 0 found with eigenvalue 131.25526355941963
    Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
    INFO: Eigenvector 1 found with eigenvalue 42.61761063477249
    Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
    INFO: Eigenvector 2 found with eigenvalue 0.03137295830779152
    Feb 9, 2011 1:25:36 PM org.slf4j.impl.JCLLoggerAdapter info
    INFO: LanczosSolver finished.
    

    As you can see the eigenvalues are correct.

    @Test
    public void testLanczosSolver2() throws Exception {
    int numRows = 3; int numCols = 3;
    int numColumns = 3;
    SparseRowMatrix m = new SparseRowMatrix(new int[]{numRows, numCols});
    /**
    
        * 3.1200 -3.1212 -3.0000
          -3.1110 1.5000 2.1212
          -7.0000 -8.0000 -4.0000
    
    */
    m.set(0,0,3.12);
    m.set(0,1,-3.12121);
    m.set(0,2,-3);
    m.set(1,0,-3.111);
    m.set(1,1,1.5);
    m.set(1,2,2.12122);
    m.set(2,0,-7);
    m.set(2,1,-8);
    m.set(2,2,-4);
    
    int rank = 4;
    Matrix eigens = new DenseMatrix(rank, numColumns);
    long time = timeLanczos(m, eigens, rank, false);
    assertTrue("Lanczos taking too long! Are you in the debugger? ", time < 10000);
    }
    
    Update: June 1st, 2011: Now GraphLab has also an efficient SVD Lanczos solver. Some performance benchmarks are found here:

    The GraphLab machine learning framework on Amazon EC2 - part 2 - testing

    This page has moved.

    For linear solver example applications see: http://graphlab.org/gabp.html

    For matrix factorization example applications see http://graphlab.org/pmf.html

    For clustering example applications see http://graphlab.org/clustering.html

    The GraphLab large scale machine learning framework - part 1 - installation on Amazon EC2

    GraphLab is an open source large scale parallel machine learning framework.

    This post explains how to install GraphLab on Amazon EC2 and how to load GraphLab from a preinstalled EC2 image.

    Loading Graphlab from one of our publicly available images:
    1) Follow the directions on http://graphlab.org/download.html to check out the latest available AMI. Launch the selected image using Amazon AWS Console.

    NOTE: Images are available in US-EAST region.
    TIP: Don't forget to allow SSH (Tcp port 22) in the default security group,
    ( In the AWS console go to EC2 -> Network Security -> Security Groups and verify that in the default security group (or the security group you where using)).

    2) After launching the AMI instance, it is always desirable to get the latest
    GraphLab using the commands:
    cd graphlabapi
    hg pull
    hg update
    ./configure
    cd release
    make -j4
    

    3) It is also useful to run unit testing to verify the update went fine:
    cd tests/
    ./runtests.sh
    

    GraphLab installation instructions

    NOTE: The below instruction should be used by advanced users, in case you did not find the required AMI in our public AMI images.

    Installations instructions where moved to the download page.
    Select the matching icon for your operating systems for detailed instructions.

    Monday, February 21, 2011

    Large scale matrix factorization using alternating least suqares: which is better - GraphLab or Mahout?

    I am working in the last couple of weeks on comparing the performance of GraphLab vs. Mahout on Alternaring least squares using Netflix data. To remind, GraphLab is the parallel machine learning system we are building in CMU.

    Initial results are encouraging. Mahout Alternating least squares implementation by Sebastian Schelter was tested on Amazon EC2, using two m2.2xlarge nodes (13x2 virtual cores).

    For running 10 iterations, number of features=20, lambda=0.065, it takes 39272 seconds, while GraphLab implementation in C++ takes only 714 seconds (on a machine with 8 cores).

    Running time may be taken with a grain of salt, since I was not using the exact same machine, but the magnitude of difference will certainly hold even if I would run GraphLab on EC2 (which I plan to do soon).

    Regarding accuracy, Mahout ALS has a test RMSE accuracy of
    0.9310 while GraphLab obtained slightly better accuracy of 0.9279.

    Here is Mahout ALS final output: (of the RMSE computation)
    ubuntu@ip-10-115-27-222:/mnt$ /usr/local/mahout-0.4/bin/
    mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/ | grep RMSE
    11/02/17 12:31:42 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only
    11/02/17 12:31:42 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/}
    RMSE: 0.9310729597725026, MAE: 0.7298745910296568
    11/02/17 12:31:55 INFO driver.MahoutDriver: Program took 12437 ms
    

    Here is the GraphLab output:
    bickson@biggerbro:~/newgraphlab/graphlabapi/debug/apps/pmf$ ./PMF netflix-r 10 0 --D=20 --max_iter=10 --lambda=0.065 --ncpus=8
    setting run mode 0
    INFO   :pmf.cpp(main:1121): PMF starting
    
    loading data file netflix-r
    Loading netflix-r train
    Creating 99072112 edges...
    ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................loading data file netflix-re
    Loading netflix-re test
    Creating 1408395 edges...
    ........setting regularization weight to 0.065
    PTF_ALS for matrix (480189, 17770, 27):99072112.  D=20
    pU=0.065, pV=0.065, pT=1, muT=1, D=20
    nuAlpha=1, Walpha=1, mu=0, muT=1, nu=20, beta=1, W=1, WT=1 BURN_IN=10
    complete. Obj=6.83664e+08, TEST RMSE=3.7946.
    INFO   :asynchronous_engine.hpp(run:56): Worker 0 started.
    
    ...
    
    INFO   :asynchronous_engine.hpp(run:56): Worker 7 started.
    
    Entering last iter with 1
    228.524) Iter ALS 1  Obj=2.60675e+08, TRAIN RMSE=2.2904 TEST RMSE=0.9948.
    Entering last iter with 2
    289.594) Iter ALS 2  Obj=6.48921e+07, TRAIN RMSE=1.1400 TEST RMSE=0.9573.
    Entering last iter with 3
    350.487) Iter ALS 3  Obj=4.75073e+07, TRAIN RMSE=0.9754 TEST RMSE=0.9444.
    Entering last iter with 4
    411.551) Iter ALS 4  Obj=4.09914e+07, TRAIN RMSE=0.9063 TEST RMSE=0.9381.
    Entering last iter with 5
    472.615) Iter ALS 5  Obj=3.79096e+07, TRAIN RMSE=0.8718 TEST RMSE=0.9348.
    Entering last iter with 6
    533.039) Iter ALS 6  Obj=3.61298e+07, TRAIN RMSE=0.8513 TEST RMSE=0.9324.
    Entering last iter with 7
    594.177) Iter ALS 7  Obj=3.50076e+07, TRAIN RMSE=0.8382 TEST RMSE=0.9305.
    Entering last iter with 8
    654.41) Iter ALS 8  Obj=3.42655e+07, TRAIN RMSE=0.8294 TEST RMSE=0.9290.
    Entering last iter with 9
    714.095) Iter ALS 9  Obj=3.37535e+07, TRAIN RMSE=0.8234 TEST RMSE=0.9279.
    INFO   :asynchronous_engine.hpp(run:66): Worker 6 finished.
    
    ...
    
    INFO   :asynchronous_engine.hpp(run:66): Worker 2 finished.
    

    Sunday, February 20, 2011

    Installing BLAS/Lapack/ITPP on Amazon EC2/Ubuntu Linux

    BLAS/Lapack are efficient matrix math libraries. The following instructions explains how to install them for Amazon EC2 (Ubuntu maverick version, and Amazon Linux). It++ (itpp) is a popular c++ wrapper for blas/lapack.

    DISLAIMER: The below instructions are for 64 bit machines. For 32 bit machines follow other instructions: http://bickson.blogspot.com/2011/06/graphlab-pmf-on-32-bit-linux.html

    FOR LAZY READERS:
    Just use Amazon EC2 public image ami-c21eedab (Ubuntu)

    INSTALLATION VIA YUM/APT-GET
    Try to install itpp using the following command:
    sudo yum install libitpp-dev
    
    Or
    sudo apt-get install libitpp-dev
    
    TIP: You may also want to install libitpp7-dbg using the yum/apt-get command.
    It is not mandatory, but it helps debugging when you link against libitpp_debug.so
    (instead of libitpp.so).

    If the above worked then we are done. If not, you will need to follow
    instructions below. Thanks to Udi Weinsberg for this tip.

    FOR ADVANCED USERS:


    0) Start with an Ubuntu image like ami-641eed0d, or an Amazon AMI image like:

    1) Install required packages. For Ubuntu:
    sudo apt-get install --yes --force-yes automake autoconf libtool* gfortran
    For Amazon Linux:
    sudo yum install -y automake autoconf libtool* gcc-gfortran

    2) Install lapack.
    Here again their are two options:

    The easy way is to simply (On Ububtu)
    sudo apt-get install --yes --force-yes liblapack-dev
    On Amazon Linux:
    sudo yum install -y lapack-devel blas-devel

    Thanks Akshay Bhat from Cornell for this tip!
    If the liblapack setup was successful, go to step 3.

    If the above command DOES NOT work for you (depends on your OS and setup) you will need to install lapack manually. The procedure is explained in steps a-c below.

    a) Download and prepare the code
    wget http://www.netlib.org/lapack/lapack.tgz
    tar xvzf lapack.tgz
    cd lapack-3.3.0  //if version number changes, change here to the right directory
    mv make.inc.example make.inc
    

    b) edit make.inc and add -m64 -fPIC flag to fortran compiler options:
    # FORTRAN, OPTS, NOOPT, LOADER

    c) compile
    make blaslib
    make
    
    If everthing went OK, test will be run for a couple of minutes
    and the files blas_LINUX.a and lapack_LINUX.a will be created at the main directory

    3) setup LDFLAGS
    export LDFLAGS="-L/usr/lib -lgfortran"

    4) Download and install itpp from
    wget http://sourceforge.net/projects/itpp/files/itpp/4.2.0/itpp-4.2.tar.gz
    tar xvzf itpp-4.2.tar.gz
    cd itpp-4.2
    ./autogen.sh 
    If you installed Lapack from yum/apt-get, you should use the following command:
    ./configure --without-fft --with-blas=/usr/lib64/libblas.so --with-lapack=/usr/lib64/liblapack.so --enable-debug CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC
    Where /usr/lib64/ is the place where lapack was installed.

    If you installed lapack from source, use the following command
    ./configure --without-fft --with-blas=/home/ubuntu/lapack-3.3.0/blas_LINUX.a --with-lapack=/home/ubuntu/lapack-3.3.0/lapack_LINUX.a CFLAGS=-fPIC CXXFLAGS=-fPIC CPPFLAGS=-fPIC

    make
    sudo make install
    
    Note: If you installed lapack from yum/apt-get, don't forget to add the -lblas -llapack linker flag when you compile against lapack/blas.

    Verifying installation
    To verify that installation went Ok, run the following commands:
    itpp-config --cflags
    itpp-config --libs
    
    1) The command itpp-config should be available from shell.
    2) The right installation path should appear as output.

    Known issues you may encounter:
    Problem:
    *** Warning: Linking the shared library libitpp.la against the^M
    *** static library /usr/lib64/libblas.a is not portable!^M
    libtool: link: g++ -shared -nostdlib /usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64/crti.o /usr/lib/gcc/x86_64-amazon-linux/4.4.4/crtbeginS.o  -Wl,--whole-archive ../itpp/base/.libs/libbase.a ../itpp/stat/.libs/libstat.a ../itpp/comm/.libs/libcomm.a ../itpp/fixed/.libs/libfixed.a ../itpp/optim/.libs/liboptim.a ../itpp/protocol/.libs/libprotocol.a ../itpp/signal/.libs/libsignal.a ../itpp/srccode/.libs/libsrccode.a -Wl,--no-whole-archive  -L/usr/lib64/ /usr/lib64/liblapack.a /usr/lib64/libblas.a -lgfortranbegin -lgfortran -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4 -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../.. -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/x86_64-amazon-linux/4.4.4/crtendS.o /usr/lib/gcc/x86_64-amazon-linux/4.4.4/../../../../lib64/crtn.o    -Wl,-soname -Wl,libitpp.so.7 -o .libs/libitpp.so.7.0.0^M
    /usr/bin/ld: /usr/lib64/liblapack.a(dgees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC^M
    /usr/lib64/liblapack.a: could not read symbols: Bad value^M
    collect2: ld returned 1 exit status^M
    make[2]: *** [libitpp.la] Error 1^M
    
    Solution: it seems that lapack was statically compiled without the -fPIC option and thus itpp refuses to link against it. Follow step 2a to install lapack manaually with the -fPIC option.

    Problem:
    make[1]: *** Waiting for unfinished jobs....
    [ 83%] Building CXX object src/graphlab/CMakeFiles/
    
    graphlab_pic.dir/distributed2/distributed_scheduler_list.o /usr/local/lib/libitpp.so: undefined reference to `zgesv_' /usr/local/lib/libitpp.so: undefined reference to `dorgqr_' /usr/local/lib/libitpp.so: undefined reference to `dswap_' /usr/local/lib/libitpp.so: undefined reference to `dgeqp3_' /usr/local/lib/libitpp.so: undefined reference to `dpotrf_' /usr/local/lib/libitpp.so: undefined reference to `dgemm_' /usr/local/lib/libitpp.so: undefined reference to `zungqr_' /usr/local/lib/libitpp.so: undefined reference to `zscal_' /usr/local/lib/libitpp.so: undefined reference to `dscal_' /usr/local/lib/libitpp.so: undefined reference to `dgesv_' /usr/local/lib/libitpp.so: undefined reference to `dgetri_' /usr/local/lib/libitpp.so: undefined reference to `zgemm_' /usr/local/lib/libitpp.so: undefined reference to `zposv_' /usr/local/lib/libitpp.so: undefined reference to `zgetri_' /usr/local/lib/libitpp.so: undefined reference to `dgeev_' /usr/local/lib/libitpp.so: undefined reference to `zgemv_' /usr/local/lib/libitpp.so: undefined reference to `zgeqrf_' /usr/local/lib/libitpp.so: undefined reference to `zgerc_' /usr/local/lib/libitpp.so: undefined reference to `zswap_' /usr/local/lib/libitpp.so: undefined reference to `zgeev_' /usr/local/lib/libitpp.so: undefined reference to `daxpy_' /usr/local/lib/libitpp.so: undefined reference to `dgetrf_' /usr/local/lib/libitpp.so: undefined reference to `zgels_' /usr/local/lib/libitpp.so: undefined reference to `zgetrf_' /usr/local/lib/libitpp.so: undefined reference to `dgees_' /usr/local/lib/libitpp.so: undefined reference to `dcopy_' /usr/local/lib/libitpp.so: undefined reference to `dger_' /usr/local/lib/libitpp.so: undefined reference to `dgels_' /usr/local/lib/libitpp.so: undefined reference to `dgeqrf_' /usr/local/lib/libitpp.so: undefined reference to `zpotrf_' /usr/local/lib/libitpp.so: undefined reference to `zgees_' /usr/local/lib/libitpp.so: undefined reference to `dgesvd_' /usr/local/lib/libitpp.so: undefined reference to `zgeru_' /usr/local/lib/libitpp.so: undefined reference to `dsyev_' /usr/local/lib/libitpp.so: undefined reference to `zaxpy_' /usr/local/lib/libitpp.so: undefined reference to `ddot_' /usr/local/lib/libitpp.so: undefined reference to `zgesvd_' /usr/local/lib/libitpp.so: undefined reference to `zgeqp3_' /usr/local/lib/libitpp.so: undefined reference to `zcopy_' /usr/local/lib/libitpp.so: undefined reference to `dgemv_' /usr/local/lib/libitpp.so: undefined reference to `dposv_' /usr/local/lib/libitpp.so: undefined reference to `zheev_' collect2: ld returned 1 exit status make[2]: *** [tests/anytests] Error 1 make[1]: *** [tests/CMakeFiles/anytests.dir/all] Error 2
    Solution: itpp was compiled using dynamic libraries, but your application did not include the -lblas and -llapack link flags.

    Problem:
    *** Error: You must have "autoconf" installed to compile IT++ SVN sources
    *** Error: You must have "automake" installed to compile IT++ SVN sources
    *** Error: You must have "libtoolize" installed to compile IT++ SVN sources
    
    Solution:
    Need to install the packages autoconf, automake and libtoolize. See yum/apt-get documentation.

    Problem:
    /usr/bin/ld: /home/bickson/lapack-3.3.1/lapack_LINUX.a(dgees.o): relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
    /home/bickson/lapack-3.3.1/lapack_LINUX.a: could not read symbols: Bad value
    collect2: ld returned 1 exit status
    
    Solution:
    It seem you forgot to follow section 2b.

    TIP: It is useful to enable also itpp_debug library which is very useful when debugging your code. This is done by adding the flag --enable-debug to the configure script.

    Problem:
    *** Error in ../../../itpp/base/algebra/ls_solve.cpp on line 271:
    LAPACK library is needed to use ls_solve() function
    
    Solution:
    It seems that itpp is not installed properly -it did not link to lapack.

    Wednesday, February 9, 2011

    Hadoop/Mahout - setting up a development environment

    This post explains how to setup a development environment for Hadoop and Mahout.

    Prerequisites:  need to have Mahout and Hadoop sources. (See previous posts).

    On a development machine
    1) Download Helios version of Eclipse like eclipse-java-helios-SR1-linux-gtk-x86_64.tar.gz
    and save it locally. Opem the zip file using:
     tar xvzf *.gz

     2) Install the Map-reduce eclipse plugin
    cd eclipse/plugins/
    wget https://issues.apache.org/jira/secure/attachment/12460491/hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar

     3) Follow the directions in
    http://m2eclipse.sonatype.org/installing-m2eclipse.html
    to install maven plugin in eclipse.

    4) Eclipse-> File-> import maven project -> select mahout root dir -> finish
    you will see a list of all subprojects. Press OK and wait for compilation to finish.
    If everything went smoothly project should compile.

    5)Select Map-reduce view -> Map-reduce location tab -> Edit Hadoop locations
    In general tab, Add location name (just a name to identify this configuration) and the host and
    port for the Map/reduce master (default port 50030 using the EC2 configuration described in previous posts) and DFS master (default port 50070) -> Finish

    CMU Pegasus on Hadoop

    Pegasus is a peta scale graph mining library.
    This post explains how to install it on Amazon EC2.

    1) Start with the Amazon EC2 image you created in http://bickson.blogspot.com/2011/01/how-to-install-mahout-on-amazon-ec2.html
    2) Run Hadoop on a single node as explained in http://bickson.blogspot.com/2011/01/mahout-on-amazon-ec2-part-2-testing.html
    3) Login into the EC2 machine
    4) wget http://www.cs.cmu.edu/%7Epegasus/PEGASUSH-2.0.tar.gz
    5) tar xvzf PEGASUSH-2.0.tar.gz
    6) cd PEGASUS
    7) export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/hadoop-0.20.2/bin/
    8)  sudo apt-get install gnuplot
    9) ./pegasus.sh
    PEGASUS> demo
    put: Target pegasus/graphs/catstar/edge/catepillar_star.edge already exists
    Graph catstar added.
    rmr: cannot remove dd_node_deg: No such file or directory.
    rmr: cannot remove dd_deg_count: No such file or directory.

    -----===[PEGASUS: A Peta-Scale Graph Mining System]===-----

    [PEGASUS] Computing degree distribution. Degree type = InOut

    11/02/09 14:47:36 INFO mapred.FileInputFormat: Total input paths to process : 1
    11/02/09 14:47:36 INFO mapred.JobClient: Running job: job_201102091432_0003
    11/02/09 14:47:37 INFO mapred.JobClient:  map 0% reduce 0%
    11/02/09 14:47:45 INFO mapred.JobClient:  map 18% reduce 0%
    11/02/09 14:47:48 INFO mapred.JobClient:  map 36% reduce 0%
    11/02/09 14:47:51 INFO mapred.JobClient:  map 54% reduce 0%
    11/02/09 14:47:54 INFO mapred.JobClient:  map 72% reduce 18%
    11/02/09 14:47:57 INFO mapred.JobClient:  map 90% reduce 18%
    11/02/09 14:48:00 INFO mapred.JobClient:  map 100% reduce 18%
    11/02/09 14:48:03 INFO mapred.JobClient:  map 100% reduce 24%
    11/02/09 14:48:09 INFO mapred.JobClient:  map 100% reduce 100%
    11/02/09 14:48:11 INFO mapred.JobClient: Job complete: job_201102091432_0003
    11/02/09 14:48:11 INFO mapred.JobClient: Counters: 18
    11/02/09 14:48:11 INFO mapred.JobClient:   Job Counters
    11/02/09 14:48:11 INFO mapred.JobClient:     Launched reduce tasks=1
    11/02/09 14:48:11 INFO mapred.JobClient:     Launched map tasks=11
    11/02/09 14:48:11 INFO mapred.JobClient:     Data-local map tasks=11


    An image named catstar_deg_inout.eps will be created.

    Tuesday, February 8, 2011

    Hadoop on Amazon EC2 - Part 4 - Running on a cluster

    1) Edit the file conf/hdfs-conf.xml
    Set the number of replicas as the number of nodes you plan to use. In this example, 4.
    
    
     
      hadoop.tmp.dir
       /mnt/tmp/
      
      
       dfs.data.dir
       /mnt/tmp2/
       
     
       dfs.name.dir
       /mnt/tmp3/
       
      dfs.replication 
      4
      Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
       
    
    
    

    2) Edit the file conf/slaves and list the DNS names of all of the machines you are going to use. For example:
     ec2-67-202-45-10.compute-1.amazonaws.com
     ec2-67-202-45-11.compute-1.amazonaws.com
     ec2-67-202-45-12.compute-1.amazonaws.com
     ec2-67-202-45-13.compute-1.amazonaws.com 

    3) Edit the file conf/master and enter the DNS name of the master node. For example
    ec2-67-202-45-10.compute-1.amazonaws.com

    Note that the master node can appear also in the salves list.

    4) Edit the file conf/core-site.xml to include the master name
    
      
        fs.default.name
        hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000
      
    
      
        mapred.job.tracker
        ec2-67-202-45-10.compute-1.amazonaws.com:9001
      
    
      
      hadoop.tmp.dir
       /mnt/tmp/
      
    
    
    5) Edit the file conf/mapred-site.xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
      <property>
        <name>fs.default.name</name> 
        <value>hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000</value>
      </property>
    
      <property>
        <name>mapred.job.tracker</name>
      <value>ec2-67-202-45-10.compute-1.amazonaws.com:9001</value>
      </property>
      <property>
      <name>hadoop.tmp.dir</name>
       <value>/mnt/tmp/</value>
      </property>
    
      <property>
      <name>mapred.map.tasks</name>
       <value>10</value> <!-- about the number of cores>
      </property>
    
       <property>
      <name>mapred.reduce.tasks</name>
       <value>10</value> <!-- about the number of cores>
      </property>
    
      <property>
      <name>mapred.tasktracker.map.tasks.maximum</name>
       <value>12</value> <!-- slightly more than cores>  </property>
    
       <property>
      <name>mapred.tasktracker.reduce.tasks.maximum</name>
       <value>12</value> <!-- slightly more than cores>
      </property>
       
    </configuration>
    
    6) Login into the master node. For each of the 3 slaves machines, copy the DSA key from the master node:
    sh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-11.compute-1.amazonaws.com
    ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-12.compute-1.amazonaws.com
    ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-13.compute-1.amazonaws.com
    

    7) To start Hadoop. On the master machine
    /usr/local/hadoop-0.20.2/bin/hadoop namenode -format
    /usr/local/hadoop-0.20.2/bin/start-dfs.sh
    /usr/local/hadoop-0.20.2/bin/start-mapred.sh
    

    8) To stop Hadoop
    /usr/local/hadoop-0.20.2/bin/stop-mapred.sh
    /usr/local/hadoop-0.20.2/bin/stop-dfs.sh
    

    Friday, February 4, 2011

    Mahout - SVD matrix factorization - reading output

    Converting Mahout's SVD Distributed Matrix Factorization Solver Output Format into CSV format

    Purpose
    The code below, shows how to convert a matrix from Mahout's SVD output format
    into a matrix format.


    This code is based on code by Danny Leshem, ContextIn.

    Command line arguments:
    args[0] - path to svd output file
    args[1] - path to output csv file

    Compilation:
    Copy the java code below into an java file named SVD2CSV.java
    Add to the project path both Mahout and Hadoop jars.


    #
    import java.io.BufferedWriter;
    import java.io.FileWriter;
    import java.util.Iterator;
    
    import org.apache.mahout.math.SequentialAccessSparseVector;
    import org.apache.mahout.math.Vector;
    import org.apache.mahout.math.VectorWritable;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.SequenceFile;
    
    
    public class SVD2CSV {
       
       
        public static int Cardinality;
       
        /**
         *
         * @param args[0] - input csv file
         * @param args[1] - cardinality (length of vector)
         * @param args[2] - output file for svd
         */
        public static void main(String[] args){
       
    try {
        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
        BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
        IntWritable key = new IntWritable();
        VectorWritable vec = new VectorWritable();
             
              while (reader.next(key, vec)) {
                  //System.out.println("key " + key);
                  SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
                   System.out.println("key " + key + " value: " + vect);
                  Iterator<Vector.Element> iter = vect.iterateNonZero();
    
                   while(iter.hasNext()){
                      Vector.Element element = iter.next();
                      br.write(key + "," + element.index() + "," + vect.getQuick(element.index())+"\n");
                }
              }
         
              reader.close();
              br.close();
    
           
        } catch(Exception ex){
            ex.printStackTrace();
        }
        }
    }


    When parsing the output please look here about a discussion regarding validity of computed results.

    Further reading: Yahoo! KDD Cup 2011 - large scale matrix factorization.

    Mahout - SVD matrix factorization - formatting input matrix

    Converting Input Format into Mahout's SVD Distributed Matrix Factorization Solver

    Purpose
    The code below, converts a matrix from csv format:
    <from row>,<to col>,<value>\n
    Into Mahout's SVD solver format.


    For example, 
    The 3x3 matrix:
    0    1.0 2.1
    3.0  4.0 5.0
    -5.0 6.2 0


    Will be given as input in a csv file as:
    1,0,3.0
    2,0,-5.0
    0,1,1.0
    1,1,4.0
    2,1,6.2
    0,2,2.1
    1,2,5.0

    NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
    This code is based on code by Danny Leshem, ContextIn.

    Command line arguments:
    args[0] - path to csv input file
    args[1] - cardinality of the matrix (number of columns)
    args[2] - path the resulting Mahout's SVD input file

    Method:
    The code below, goes over the csv file, and for each matrix column, creates a SequentialAccessSparseVector which contains all the non-zero row entries for this column.
    Then it appends the column vector to file.

    Compilation:
    Copy the java code below into an java file named Convert2SVD.java
    Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a command line option for compilation is given below.


    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.util.StringTokenizer;
    
    import org.apache.mahout.math.SequentialAccessSparseVector;
    import org.apache.mahout.math.Vector;
    import org.apache.mahout.math.VectorWritable;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.SequenceFile;
    import org.apache.hadoop.io.SequenceFile.CompressionType;
    
    /**
     * Code for converting CSV format to Mahout's SVD format
     * @author Danny Bickson, CMU
     * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE SECOND FIELD).
     *
     */
    
    public class Convert2SVD {
    
    
            public static int Cardinality;
    
            /**
             * 
             * @param args[0] - input csv file
             * @param args[1] - cardinality (length of vector)
             * @param args[2] - output file for svd
             */
            public static void main(String[] args){
    
    try {
            Cardinality = Integer.parseInt(args[1]);
            final Configuration conf = new Configuration();
            final FileSystem fs = FileSystem.get(conf);
            final SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(args[2]), IntWritable.class, VectorWritable.class, CompressionType.BLOCK);
    
              final IntWritable key = new IntWritable();
              final VectorWritable value = new VectorWritable();
    
       
               String thisLine;
            
               BufferedReader br = new BufferedReader(new FileReader(args[0]));
               Vector vector = null;
               int from = -1,to  =-1;
               int last_to = -1;
               float val = 0;
               int total = 0;
               int nnz = 0;
               int e = 0;
               int max_to =0;
               int max_from = 0;
    
               while ((thisLine = br.readLine()) != null) { // while loop begins here
                
                     StringTokenizer st = new StringTokenizer(thisLine, ",");
                     while(st.hasMoreTokens()) {
                         from = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero based
                         to = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero basd
                         val = Float.parseFloat(st.nextToken());
                         if (max_from < from) max_from = from;
                         if (max_to < to) max_to = to;
                         if (from < 0 || to < 0 || from > Cardinality || val == 0.0)
                             throw new NumberFormatException("wrong data" + from + " to: " + to + " val: " + val);
                     }
                  
                     //we are working on an existing column, set non-zero rows in it
                     if (last_to != to && last_to != -1){
                         value.set(vector);
                         
                         writer.append(key, value); //write the older vector
                         e+= vector.getNumNondefaultElements();
                     }
                     //a new column is observed, open a new vector for it
                     if (last_to != to){
                         vector = new SequentialAccessSparseVector(Cardinality); 
                         key.set(to); // open a new vector
                         total++;
                     }
    
                     vector.set(from, val);
                     nnz++;
    
                     if (nnz % 1000000 == 0){
                       System.out.println("Col" + total + " nnz: " + nnz);
                     }
                     last_to = to;
    
              } // end while 
    
               value.set(vector);
               writer.append(key,value);//write last row
               e+= vector.getNumNondefaultElements();
               total++;
               
               writer.close();
               System.out.println("Wrote a total of " + total + " cols " + " nnz: " + nnz);
               if (e != nnz)
                    System.err.println("Bug:missing edges! we only got" + e);
              
               System.out.println("Highest column: " + max_to + " highest row: " + max_from );
            } catch(Exception ex){
                    ex.printStackTrace();
            }
        }
    }
    


    A second option to compile this file is create a Makefile, with the following in it:
    all:
            javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java
    
    Note that you will have the change location of the jars to point to where your jars are stored.

    Example for running this conversion for netflix data:
    java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770 netflixe.seq

    Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader
    WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool getCompressor
    INFO: Got brand-new compressor
    Wrote a total of 241 rows, nnz: 1000000
    Wrote a total of 381 rows, nnz: 2000000
    Wrote a total of 571 rows, nnz: 3000000
    Wrote a total of 789 rows, nnz: 4000000
    Wrote a total of 1046 rows, nnz: 5000000
    Wrote a total of 1216 rows, nnz: 6000000
    Wrote a total of 1441 rows, nnz: 7000000
    
    ...
    

    NOTE: You may want also to checkout GraphLab's collaborative filtering library: here. GraphLab has a 100% compatible SVD solver to Mahout, with performance gains up to x50 times faster. I have created Java code to convert Mahout sequence files into Graphlab's format and back. Email me and I will send you the code.

    Tuesday, February 1, 2011

    Mahout on Amazon EC2 - part 3 - Debugging

    Connecting to management web interface of an Hadoop node


    1) Login into AWS management consolute setup
    Select the default security group, and add tcp ports 50010-50090 (with ip 0.0.0.0/0).

    2) You can view hadoop node status (after starting Hadoop) by opening a web browser and
    entering the following address:

    http://ec2-50-16-155-136.compute-1.amazonaws.com:50070/
    
    where ec2-XXX-XXXXXXXX is the nodename, and 50070 is the default port for namenode sever,
    and 50030 is the default port of the job tracker. 50060 is the default port for the task tracker.



    Common errors and their solutions:

    * When starting hadoop, the following message is presented:
    <32|0>bickson@biggerbro:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2$ ./bin/start-all.sh
    localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    localhost: @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
    localhost: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    localhost: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
    localhost: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    localhost: It is also possible that the RSA host key has just been changed.
    localhost: The fingerprint for the RSA key sent by the remote host is
    localhost: 06:95:7b:c8:0e:85:e7:ba:aa:b1:31:6e:fc:0e:ae:4d.
    localhost: Please contact your system administrator.
    localhost: Add correct host key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts to get rid of this message.
    localhost: Offending key in /mnt/bigbrofs/usr6/bickson/.ssh/known_hosts:1
    localhost: RSA host key for localhost has changed and you have requested strict checking.
    localhost: Host key verification failed.
    
    Solution:

    echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
    
    Note:

    If the file ~/.ssh/config does not exist, change the command to:
    echo "NoHostAuthenticationForLocalhost yes" >~/.ssh/config 
    

    The following exception is received:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException:
    File /user/ubuntu/temp/markedPreferences/_temporary/_attempt_local_0001_r_000000_0/part-r-00000 could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock
    (FSNamesystem.java:1271)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke
    (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 
    


    Solution:
    1) I saw this error when system is out of disk space. Increase number of nodes or instance type.
    2) Another cause is that the datanode did not finish to boot. Wait at least 200 seconds after starting Hadoop before actually starting to run jobs.

    * Job tracker fails to run with the following error:
    2011-02-02 16:02:47,097 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: 
    /************************************************************
    STARTUP_MSG: Starting JobTracker
    STARTUP_MSG:   host = ip-10-114-75-91/10.114.75.91
    STARTUP_MSG:   args = []
    STARTUP_MSG:   version = 0.20.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/
    
    hadoop/common/branches/branch-0.20 -r 911707; compiled by 
    
    'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
    ************************************************************/
    2011-02-02 16:02:47,200 INFO org.apache.hadoop.mapred.JobTracker: 
    
    Scheduler  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, 
    
    limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
    2011-02-02 16:02:47,220 FATAL org.apache.hadoop.mapred.JobTracker:
    
    java.lang.RuntimeException: 
    
    Not a host:port pair: local
     at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
     at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
     at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
     at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1579)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
     at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)
     at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)
    


    Solution: edit the file /path/to/hadoop/conf/mapred-site.xml:
     
     
    mapred.job.tracker
    localhost:9001 
    
    
    



    * When connecting to EC2 host you get the following error:
    ssh -i ./graphlabkey.pem -o "StrictHostKeyChecking no" 
    ubuntu@ec2-50-16-101-232.compute-1.amazonaws.com "/home/ubuntu/ec2-metadata -i"
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    Permissions 0644 for './graphlabkey.pem' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    
    Solution:
    chmod 400 graphlabkey.pem





    * Exception : can not lock storage
    ************************************************************/
    2011-02-03 23:47:48,623 INFO org.apache.hadoop.hdfs.server.common.Storage: 
    
    Cannot lock storage /mnt. The directory is already locked.
    2011-02-03 23:47:48,736 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
    
    Cannot lock storage /mnt. The directory is already locked.
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:510)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:363)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:112)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 
    
    Solution: search and remove the file in_use.lock

    * Exception:
    tasktracker running as process XXX. Stop it first.

    Solution:
    1) Hadoop is already running - kill it first using stop-all.sh (on a single machine) or stop-mapred.sh and stop-dfs.sh (on a cluster)
    2) If you killed Hadoop and you are still getting this error - check under /tmp
    if it contains files *.pid - if so remove them.

    2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/jchen/Data/Hadoop/dfs/data: namenode namespaceID = 773619367; datanode namespaceID = 2049079249
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:216)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
    
    Solution: Remove all files named VERSION from all tmp directories (need to search very well Hadoop has at least 3 working directories) and reformat the namenode file system.

    Error:
    bash-3.2$ ./bin/start-all.sh 
    starting namenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-namenode-biggerbro.ml.cmu.edu.out
    localhost: starting datanode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-datanode-biggerbro.ml.cmu.edu.out
    localhost: starting secondarynamenode, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-secondarynamenode-biggerbro.ml.cmu.edu.out
    localhost: Exception in thread "main" java.net.BindException: Address already in use
    localhost: 	at sun.nio.ch.Net.bind(Native Method)
    localhost: 	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
    localhost: 	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
    localhost: 	at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    localhost: 	at org.apache.hadoop.http.HttpServer.start(HttpServer.java:425)
    localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:165)
    localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:115)
    localhost: 	at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:469)
    starting jobtracker, logging to /mnt/bigbrofs/usr6/bickson/usr7/hadoop-0.20.2/bin/../logs/hadoop-bickson-jobtracker-biggerbro.ml.cmu.edu.out
    
    Solution: kill every process using ./bin/stop-all.sh, wait a few mins and retry. If this does not help you may need to change port numbers in config files.

    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 04:33:10,572 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/name is in an inconsistent state: storage directory does not exist or is not accessible.
    
    Solution: it seemed you did not format properly hdfs using the command
    ./bin/hadoop namenode -format

    Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector
    	at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
    	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
    	at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
    	at java.lang.Class.forName0(Native Method)
    	at java.lang.Class.forName(Class.java:247)
    	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
    	at org.apache.hadoop.io.WritableName.getClass(WritableName.java:71)
    	at org.apache.hadoop.io.SequenceFile$Reader.getValueClass(SequenceFile.java:1613)
    	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1555)
    	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
    	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
    	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
    	at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
    	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
    	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
    	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    	at org.apache.hadoop.mapred.Child.main(Child.java:170)
    
    
    Solution: verify that MAHOUT_HOME is properly defined.

    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:31:35,693 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call delete(/tmp/hadoop/mapred/system, true) from 127.0.0.1:51103: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop/mapred/system. Name node is in safe mode.
    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:2011-09-05 05:33:39,712 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call addBlock(/user/bickson/small_netflix_mahout_transpose/part-r-00000, DFSClient_-1810781150) from 127.0.0.1:48972: error: java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
    hadoop-bickson-namenode-biggerbro.ml.cmu.edu.log:java.io.IOException: File /user/bickson/small_netflix_mahout_transpose/part-r-00000 could only be replicated to 0 nodes, instead of 1
    
    Solution: This error may happen if you try to access hdfs file system before Hadoop finished loading up properly. Wait a few mins and try again.

    Mahout on CMU OpenCloud

    This post explains how to run Mahout on top of CMU OpenCloud.

    1) log into the cloud login node
    ssh -L 8888:proxy.opencloud:8888 login.cloud.pdl.cmu.local.

    2) copy mahout directory tree into your home folder. 

    3) Run Mahout example
    cd mahout-0.4/
    export JAVA_HOME=/usr/lib/jvm/java-6-sun/
     ./examples/bin/build-reuters.sh

    You should see:
    sh -x ./examples/bin/build-reuters.sh
    11/02/01 15:13:27 INFO driver.MahoutDriver: Program took 225915 ms
    + ./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
    Running on hadoop, using HADOOP_HOME=/usr/local/sw/hadoop
    HADOOP_CONF_DIR=/etc/hadoop/conf/global
    11/02/01 15:13:38 INFO driver.MahoutDriver: Program took 10087 ms
    + ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
    Running on hadoop, using HADOOP_HOME=/usr/local/sw/hadoop
    HADOOP_CONF_DIR=/etc/hadoop/conf/global
    11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
    11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
    11/02/01 15:13:40 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
    11/02/01 15:13:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    11/02/01 15:13:42 INFO input.FileInputFormat: Total input paths to process : 3
    11/02/01 15:13:47 INFO mapred.JobClient: Running job: job_201101170028_1733
    11/02/01 15:13:48 INFO mapred.JobClient:  map 0% reduce 0%
    11/02/01 15:17:49 INFO mapred.JobClient:  map 33% reduce 0%
    11/02/01 15:17:55 INFO mapred.JobClient:  map 66% reduce 0%
    11/02/01 15:18:01 INFO mapred.JobClient:  map 100% reduce 0%
    11/02/01 15:18:08 INFO mapred.JobClient: Job complete: job_201101170028_1733
    11/02/01 15:18:08 INFO mapred.JobClient: Counters: 6
    11/02/01 15:18:08 INFO mapred.JobClient:   Job Counters
    11/02/01 15:18:08 INFO mapred.JobClient:     Rack-local map tasks=5
    11/02/01 15:18:08 INFO mapred.JobClient:     Launched map tasks=5
    11/02/01 15:18:08 INFO mapred.JobClient:   FileSystemCounters
    11/02/01 15:18:08 INFO mapred.JobClient:     HDFS_BYTES_READ=13537042
    11/02/01 15:18:08 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=11047110
    11/02/01 15:18:08 INFO mapred.JobClient:   Map-Reduce Framework
    11/02/01 15:18:08 INFO mapred.JobClient:     Map input records=16115
    11/02/01 15:18:08 INFO mapred.JobClient:     Spilled Records=0
    11/02/01 15:18:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    11/02/01 15:18:09 INFO input.FileInputFormat: Total input paths to process : 3
    11/02/01 15:18:15 INFO mapred.JobClient: Running job: job_201101170028_1736
    11/02/01 15:18:16 INFO mapred.JobClient:  map 0% reduce 0%
    ...