Large Scale Machine Learning and Other Animals: January 2011

Tuesday, January 25, 2011

Mahout on Amazon EC2 - part 2 - Running Hadoop on a single node

Following part 1 of this posting which explained how to install Mahout and Hadoop on Amazon EC2.

We start by testing logistic regression

1) Launch Amazon AMI image you constructed using the explanation in part 1 of this post.

2) Run Hadoop using

# $HADOOP_HOME/bin/hadoop namenode -format
# $HADOOP_HOME/bin/start-all.sh
# jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)

3) Run logistic regression example

cd /usr/local/mahout-0.4/
./bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 50 --lambda 0.001 --input examples/src/main/resources/donut.csv --features 21 --output donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

You should see the following output:

11/01/25 14:42:45 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
21
color ~ 0.353*Intercept Term + 5.450*x + -1.671*y + -4.740*xx + 0.353*xy + 0.353*yy + 5.450*a + 2.765*b + -24.161*c
      Intercept Term 0.35319
                   a 5.45000
                   b 2.76534
                   c -24.16091
                   x 5.45000
                  xx -4.73958
                  xy 0.35319
                   y -1.67092
                  yy 0.35319

    2.765337737     0.000000000    -1.670917299     0.000000000     0.000000000     0.000000000     5.449999190     0.000000000   -24.160908591    -4.739579336     0.353190637     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000

11/01/25 14:42:46 INFO driver.MahoutDriver: Program took 1016 ms

Now we run alternating matrix factorization. Based on instructions by Sebastian Schelter (see https://issues.apache.org/jira/browse/MAHOUT-542).
A related GraphLab implementation is found here

0) Download the patch MAHOUT-542.patch from the above webpage.
Installl it using the command

cd /usr/local/mahout-0.4/src/
wget https://issues.apache.org/jira/secure/attachment/12469671/MAHOUT-542-5.patch
patch -p0 < MAHOUT-542-5.patch

1) Get the movie lens 1M movie dataset

cd /usr/local/mahout-0.4/
wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
tar xvzf million-ml-data.tar__0.gz

2) Convert dataset to csv format

cat ratings.dat |sed -e s/::/,/g| cut -d, -f1,2,3 > ratings.csv
cd /usr/local/hadoop-0.20.2/
./bin/hadoop fs -copyFromLocal /path/to/ratings.csv ratings.csv
./bin/hadoop fs -ls

Should see something like

/user/ubuntu/ratings.csv

3) # create a 90% percent training set and a 10% probe set

/usr/local/mahout-0.4$ ./bin/mahout splitDataset  --input /user/ubuntu/ratings.csv --output /user/ubuntu/myout --trainingPercentage 0.9 --probePercentage 0.1

The output should look like:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 01:09:39 WARN driver.MahoutDriver: No splitDataset.props found on classpath, will use command-line arguments only
11/01/27 01:09:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=/user/ubuntu/ratings.csv, --output=/user/ubuntu/myout, --probePercentage=0.1, --startPhase=0, --tempDir=temp, --trainingPercentage=0.9}
11/01/27 01:09:40 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:40 INFO mapred.JobClient: Running job: job_local_0001
11/01/27 01:09:40 INFO input.FileInputFormat: Total input paths to process : 1
11/01/27 01:09:41 INFO mapred.MapTask: io.sort.mb = 100
11/01/27 01:09:41 INFO mapred.MapTask: data buffer = 79691776/99614720
11/01/27 01:09:41 INFO mapred.MapTask: record buffer = 262144/327680
11/01/27 01:09:42 INFO mapred.JobClient:  map 0% reduce 0%
11/01/27 01:09:42 INFO mapred.MapTask: Spilling map output: record full = true
11/01/27 01:09:42 INFO mapred.MapTask: bufstart = 0; bufend = 5970616; bufvoid = 99614720
11/01/27 01:09:42 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
11/01/27 01:09:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library

4)# run distributed ALS-WR to factorize the rating matrix based on the training set

bin/mahout parallelALS --input /user/ubuntu/myout/trainingSet/ --output /tmp/als/out --tempDir /tmp/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065
...
11/01/27 02:40:28 INFO mapred.JobClient:     Spilled Records=7398
11/01/27 02:40:28 INFO mapred.JobClient:     Map output bytes=691713
11/01/27 02:40:28 INFO mapred.JobClient:     Combine input records=0
11/01/27 02:40:28 INFO mapred.JobClient:     Map output records=3699
11/01/27 02:40:28 INFO mapred.JobClient:     Reduce input records=3699
11/01/27 02:40:28 INFO driver.MahoutDriver: Program took 1998612 ms

5)# measure the error of the predictions against the probe set

usr/local/mahout-0.4$ bin/mahout evaluateALS --probes /user/ubuntu/myout/probeSet/ --userFeatures /tmp/als/out/U/ --itemFeatures /tmp/als/out/M/
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2/
HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
11/01/27 02:42:37 WARN driver.MahoutDriver: No evaluateALS.props found on classpath, will use command-line arguments only
11/01/27 02:42:37 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --itemFeatures=/tmp/als/out/M/, --probes=/user/ubuntu/myout/probeSet/, --startPhase=0, --tempDir=temp, --userFeatures=/tmp/als/out/U/}

...

Probe [99507], rating of user [4510] towards item [2560], [1.0] estimated [1.574626183998361]
Probe [99508], rating of user [4682] towards item [171], [4.0] estimated [4.073943928686575]
Probe [99509], rating of user [3333] towards item [1215], [5.0] estimated [4.098295242062813]
Probe [99510], rating of user [4682] towards item [173], [2.0] estimated [1.9625234269143972]
RMSE: 0.8546120366924382, MAE: 0.6798083002225481
11/01/27 02:42:50 INFO driver.MahoutDriver: Program took 13127 ms

Useful HDFS commands * View the current state of the file system

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop dfsadmin -report
Configured Capacity: 10568916992 (9.84 GB)
Present Capacity: 3698495488 (3.44 GB)
DFS Remaining: 40173568 (38.31 MB)
DFS Used: 3658321920 (3.41 GB)
DFS Used%: 98.91%
Under replicated blocks: 56
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 10568916992 (9.84 GB)
DFS Used: 3658321920 (3.41 GB)
Non DFS Used: 6870421504 (6.4 GB)
DFS Remaining: 40173568(38.31 MB)
DFS Used%: 34.61%
DFS Remaining%: 0.38%
Last contact: Tue Feb 01 21:10:15 UTC 2011

* Delete a directory

ubuntu@domU-12-31-39-00-18-51:/usr/local/hadoop-0.20.2$ ./bin/hadoop fs -rmr temp/markedPreferences
Deleted hdfs://localhost:9000/user/ubuntu/temp/markedPreferences

Monday, January 24, 2011

Mahout/Hadoop on Amazon EC2 - part 1 - Installation

This post explains how to install Mahout ML framework on top of Amazon EC2 (Ubuntu based machine).
The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated

The next of the post (part 2) explains how to run two Mahout applications:
logistic regression and alternating least squares.

Note: part 5 of this post, explains how to make the same installation on top of
ec2 high computing node (CentOS/Redhat machine). Unfortunately, several steps
are different..

Part 6 of this post explains how to fine tune performance on large cluster.

Full procedure should take around 2-3 hours.. :-(

To confuse the users, Amazon has 5 types of IDs:
- Your email and password for getting into the AWS console
- Your AWS string name and private key string
- Your public/private key pair
- Your X.509 certificate (another private/public key pair)
- Your Amazon ID (12 digit number) which is very hard to find on their website
Make sure you have all your IDS ready, if you did not do it yet, generate the keys using AWS console.

1) select and launch instance ami-08f40561 from Amazon AWS console. Alternatively you can select any other Ubuntu based 64 bit image.
TIP: It is recommended using EBS backed image, since saving your work at the end will be made way easier.

2) verify java is installed correctly - some libs are missing in the ami

sudo apt-get install openjdk-6-jdk
sudo apt-get install openjdk-6-jre-headless
sudo apt-get install openjdk-6-jre-lib

3) In the root home directory evaluate:

# sudo apt-get update
# sudo apt-get upgrade
# sudo apt-get install python-setuptools
# sudo easy_install "simplejson==2.0.9"
# sudo easy_install "boto==1.8d"
# sudo apt-get install ant
# sudo apt-get install subversion
# sudo apt-get install maven2

4) for getting hadoop source

# wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz 
# tar vxzf hadoop-0.20.2.tar.gz
# sudo  mv hadoop-0.20.2 /usr/local/

A comment: I once managed to install 0.21.0, but after the EC2 node was killed and restarted
Mahout refused to work any more. So I reverted to Hadoop 0.20.2

add the following to $HADOOP_HOME/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
# The maximum amount of heap to use, in MB. Default is 1000
export HADOOP_HEAPSIZE=2000

add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml

<pre class="xml" name="code"><configuration>     
<property>     
<name>fs.default.name</name>     
<value>hdfs://localhost:9000</value>   
</property>   <property>     
<name>mapred.job.tracker</name>      
<value>localhost:9001</value>    
</property>  
 <property>      
<name>dfs.replication</name>      
 <value>1</value>           
  </property>   
<property> 
 <name>hadoop.tmp.dir</name> 
<value>/mnt/tmp/</value>  
</property>  
</configuration></pre>

Edit the file hdfs-site.xml

<pre class="xml" name="code"><configuration>
 <property>  
  <name>hadoop.tmp.dir</name> 
  <value>/mnt/tmp/</value>   
 </property>   
<property>   
 <name>dfs.data.dir</name>
 <value>/mnt/tmp2/</value>
</property>  
<property> 
 <name>dfs.name.dir</name>
 <value>/mnt/tmp3/</value> 
</property> 
</configuration> 
</pre>

Note: pointing the directories to /mnt is done since on Amazon EC2 regular instances has 400GB free space there (vs. only 10GB free space on root partition). You may
need to change permissions of /mnt in so this file system will be writable by Hadoop.
So execute the following command:

sudo chmod 777 /mnt

Set up authorized keys for localhost login w/o passwords and format your name node

# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

5)Add the following to your .profile

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_HOME=/usr/local/hadoop-0.20.2
export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
export MAHOUT_HOME=/usr/local/mahout-0.4/
export MAHOUT_VERSION=0.4-SNAPSHOT
export MAVEN_OPTS=-Xmx1024m

6) Checkout and build Mahout from trunk. ify that the paths on .profile point to the exact version you downloaded

svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
cd mahout
mvn clean install
cd ..
sudo mv mahout /usr/local/mahout-0.4

Note: I am getting a lot of questions about the mvn compilation.
a) On windows based machines, it seems that running a Linux VM makes some
of the tests fail. Try to compile with the flag -DskipTests
b) If compilation fails, you can try and download compiled jars from
http://mirror.its.uidaho.edu/pub/apache//mahout/0.4/ (the compiled jar are
in the files without "src" in the filename). Just open the tgz and place it
on /usr/local/mahout-0.4/ instead of the compilation step above.

7) Install other required stuff (optional: in the Amazon EC2 image I am using
those libraries are preinstalled).

sudo apt-get install wget alien ruby libopenssl-ruby1.8 rsync curl

8) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.

$HADOOP_HOME/bin/hadoop namenode -format
$HADOOP_HOME/bin/start-all.sh
jps     // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker)
cd $MAHOUT_HOME
./examples/bin/build-reuters.sh
$HADOOP_HOME/bin/stop-all.sh
rm -rf /tmp/*   // delete the Hadoop files

Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy with the other conf file settings. The Hadoop startup scripts will not make any changes to them. In particular, upping the Java heap size is required for many of the Mahout jobs.

// edit $HADOOP_HOME/conf/mapred-site.xml to include the following:
<property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx2000m</value>
</property>

9) Allow for Hadoop to run even if you will work on a different EC2 machine:

echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config

If everything went well, you may want to bundle the output into an AMI image, so next time you will not need to install everything from scratch:
10) Install Amazon AMI tools
a) Edit the file /etc/apt/sources.list
and uncomment all the lines with multiverse (note: you need to call the editor as root!)
b) update the repositories

sudo apt-get update

c) Install ami and api tools

sudo apt-get install ec2-ami-tools ec2-api-tools

Thanks Kevin for this fix!

11) In order to save your work, you need to bundle and save the image.
Here there are two alternatives. If you started EBS backed image, you can simply use the Amazon AWS user interface, right mouse click on the running instance and select "save instance".
If the image is not EBS, you will need to do it manually:

- note you need to use the private key of the x.509 certificate and not the private key of the public private key pair!!!!!!!

[All the following commands should span one shell line..]

First you need to create a bucket named mahoutbucket using the Amazon AWS console
under S3 tab.

sudo ec2-bundle-vol -k /mnt/pk-<your private X.509 key>.pem -c /mnt/cert-<your public x.509 key>.pem -u <Your AWS ID (12 digit number)> -d /mnt -p mahout
sudo ec2-upload-bundle -b mahoutbucket -m /mnt/mahout.manifest.xml -a <Your AWS String> -s <Your AWS string password> 
sudo ec2-register -K /mnt/pk-<Your X.509 private key>.pem -C /mnt/cert-<Your X.509 public certificate>.pem --name mahoutbucket/  mahoutbucket/mahout.manifest.xml

If you are lucky -You will get a result of the type: IMAGE ami-XXXXXXX
where XXXXXXX is the generated image number.

More detailed explanations about this procedure, along with many potential pitfalls are found
in my blog post here.
Thanks to Kevin and Selwyn!

Large Scale Machine Learning and Other Animals

Tuesday, January 25, 2011

Mahout on Amazon EC2 - part 2 - Running Hadoop on a single node

Monday, January 24, 2011

Mahout/Hadoop on Amazon EC2 - part 1 - Installation

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax