How to build and execute examples in Mahout in Action - hadoop

I am learning Mahout in Action now and writing to ask how to build and execute examples in the book. I can find instructions with eclipse, but my environment doesn't include UI. So I copied the first example (RecommenderIntro) to RecommenderIntro.java and compile it through javac.
I got an error because the package was not imported. So I am looking for :
Approaches to import missing packages.
I guess, even it compiles successfully, .class file will be generated,
how can I execute it? through "java RecommnderIntro"? I can execute
mahout examples through sudo -u hdfs hadoop jar
mahout-examples-0.7-cdh4.2.0-job.jar
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job, how can I
do something similar for my own example?
All my data is saved in HBase tables, but in the book (and even
google), I cannot find a way to integrate it with HBase, any
suggestions?

q1 and q2, you need a java build tool like maven.
You build the hadoop-jar with : 'mvn clean install' This creates your hadoop job in target/mia-job.jar
You then execute your job with:
hadoop jar target/mia-job.jar RecommenderIntro inputDirIgnored outputDirIgnored
(The RecommenderIntro ignores parameters, but hadoop forces you to specify at least 2 parameters usually the input and output dir )
q3: You can't out-of-the-box.
Option1: export your hbase data to a text file 'intro.csv' with content like: "%userId%, %ItemId%, %score%" as described in the book. Because that's the file the RecommenderIntro is looking for.
Option2: Modify the example code to read data from hbase...
ps1. for developing such an application I'd really advise using an IDE. Because it allows you to use code-completion, execute, build, etc. A simple way to get started is to download a virtual image with hadoop like Cloudera or HortonWorks and install an IDE like eclipse. You can also configure these images to use your hadoop cluster, but you dont need to for small data sets.
ps2. The RecommenderIntro code isn't a distributed implementation and thus can't run on large datasets. It also runs locally instead of on a hadoop cluster.

Related

Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

How to run mapreduce samples using jar file?

I am trying to run mapreduce samples such as Bellard ,LongLong, Montgomery,Summation,DistributedPentomino ,OneSidedPentomino usi such as Bellard ,LongLong, Montgomery,Summation,DistributedPentomino ,OneSidedPentomino using hadoop mapresuce example jar. Can anyone please tell me the command to run these samples using jar file?
Here is exact chapter from HortonWorks about Hadoop MapReduce out of the box examples. It is part of their guide.
Running MapReduce examples on Hadoop YARN.
Another. more straightforward short article but without too much details:
Run Sample MapReduce Examples
In fact, having Hadoop version XXX you need:
export YARN_EXAMPLES=$YARN_HOME/share/hadoop/mapreduce
yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-XXX.jar
Please check your actual path.

Setting Spark Classpath on Amazon EMR

I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
lines.collect()
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?
I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/
It seems that you have not imported the proper libraries from with-in the spark-shell.
To do so :
import path.to.Class
or more likely if you want to import the RDD class, per say:
import org.apache.spark.rdd.RDD

How to run GIS codes through hadoop's prompt?

I am running a GIS code through hadoop's prompt in following manner:
Wrote the GIS code in Eclipse including all the GIS jars (relevant).
Went into the dir. where my eclipse workspace is.
Compiled the code by adding all the relevant jars in the classpath. *(The compilation was successful).
Built the jar.
Now running the same jar using hadoop: bin/hadoop jar my_jar_file_name.jar my_pkg_structure.Main_class_file
Now, inspite of the code being error free, when i try to execute through hadoop's propmpt, it gives me multiple issues.
Is there a workable alternative way to do the same without any hassles?
Also note, the gid code runs beautifully in eclipse. Since, I have to do Geo processing over hadoop, I need to run it through hadoop's prompt.

Hadoop can't find mahout-examples-$MAHOUT_VERSION-job.jar

I am trying to do a simple clustering job by using mahout on top of hadoop (following this tutorial).
So far, I have hadoop running in a single node mode, I have downloaded and built the mahout core and examples mvn projects, but when I try to run the job, I get a FileNotFound Exception. Here is a screen-shot.
Note that I have checked that the mahout-examples-0.5-job.jar is where it is supposed to be (in D:\mahout\mahout-distribution-0.5\examples\target).

Resources