Can Mapper and Reducer be on separate jars - hadoop

In a MapReduce job I understand it is possible for the Job runner class itself to reside in a separate jar than the mapper and reducer (check this answer)
And that the setJarByClass is the place in the job I'd pass that separate jar info.
Is there a way, however, to have the mapper and reducer each in its own separate jar?
10x!

Yes, it is possible to have Mapper and Reducer in separate JARs.
What I have done in the past to enable this is:
Place the required JARS that contain Mapper and Reducer on the HADOOP_CLASSPATH environment variable
Provide the JARS that contain Mapper and Reducer to the Hadoop Distributed Cache via -libjars option using Hadoop ToolRunner, if the Mapper/Reducer are not included in the Driver JAR.
Manually load the Mapper and Reducer classes onto the runtime classpath with the appropriate Java ClassLoader

Related

How HBase add its dependency jars and use HADOOP_CLASSPATH

48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?
As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml

Adding dependent jars for UDF in the PIG

I've a UDF which I use to do custom processing on the records. In the eval function I am using a third party jar for processing. I saw the job jar file, but it does not include this dependency. Is there any way to include dependent jar in the job jar ?
(For testing I am running the cluster in the local mode).
Or can I use distributed cache to make the dependent jar available to the UDF ?
I've tried registering the dependent jars in the pig. For the first registered jar (all udfs are bundled in this jar) I do not face the issues. But for the second jar, I am facing issues when UDF tries to access the class from it.
REGISTER '/home/user/pig/udfrepository/projectUDF.jar'
REGISTER '/home/user/thridpartyjars/xyz.jar';
The logs I get on the console are like this :
2013-08-11 10:35:02,485 [Thread-14] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.NoSuchMethodError: org.xyz.abc.convertToOtherFormat(Lorg/DateTimeZone;)Lorg/DateTime;
at com.myproject.MyUDF.exec(MyUDF.java:70)
Any help on this is highly appreciated.
Thanks in advance.
The same issue I found resolved and documented here :
http://hadooptips.wordpress.com/2013/08/13/nosuchmethoderror-while-using-joda-time-2-2-jar-in-pig/

Hadoop - submit a job with lots of dependencies (jar files)

I want to write some sort of "bootstrap" class, which will watch MQ for incoming messages and submit map/reduce jobs to Hadoop. These jobs use some external libraries heavily. For the moment I have the implementation of these jobs, packaged as ZIP file with bin,lib and log folders (I'm using maven-assembly-plugin to tie things together).
Now I want to provide small wrappers for Mapper and Reducer, which will use parts of the existing application.
As far as I learned, when a job is submitted, Hadoop tries to find out JAR file, which has the mapper/reducer classes, and copy this jar over network to data node, which will be used to process the data. But it's not clear how do I tell Hadoop to copy all dependencies?
I could use maven-shade-plugin to create an uber-jar with the job and dependencies, And another jar for bootstrap (which jar would be executed with hadoop shell-script).
Please advice.
One way could be to put the required jars in distributed cache. Another alternative would be to install all the required jars on the Hadoop nodes and tell TaskTrackers about their location. I would suggest you to go through this post once. Talks about the same issue.
Use maven to manage the dependencies and ensure the correct versions are used during builds and deployment. Popular IDE's have maven support that makes it so you don't have to worry about building class paths for edit and build. Finally, you can instruct maven to build a single jar (a "jar-with-dependencies") containing your app and all dependencies, making deployment very easy.
As for dependencies, like hadoop, which are guaranteed to be in the runtime class path, you can define them with a scope of "provided" so they're not included in the uber jar.
Use -libjars option of hadoop launcher script for specify dependencies for jobs running on remotes JVMs;
Use $HADOOP_CLASSPATH variable for set dependencies for JobClient running on local JVM
Detailed discussion is here: http://grepalex.com/2013/02/25/hadoop-libjars/

How could I run a MapReduce program in Java with testNG?

I have a MapReduce program based on Hadoop. I know how to run it with the command of ${hadoop_home}/bin/hadoop jar .
However, I want to run this program with testNG. What should I do to start a testNG case with hadoop?
If you're looking for a unit testing framework, look at mr-unit (Apache). You can wrap the driver code in either JUnit or im assuming TestNG (disclaimer - I've only ever used JUnit).
This allows you to push values through a mapper or reducer and assert that particular output / counters came out
Finally I figured it out, the key point is the classpath setting.
Besides the hadoop-core-.jar, all jar files under ${hadoop_home}/share/hadoop/lib/, jar files under ${hadoop_home}/share/hadoop/lib/jsp-/, and other jar files the program code depend on, "${hadoop.home}/conf", "${hadoop.home}/share/hadoop" and the MR program itself must been compiled and packed as a Jar file and added into runtime class path.
After such setting finished, testNG test cases invoking the MR method could be run by the regular way.

Hadoop Distributed Cache don't work

I'm new with Hadoop. I'm using Hadoop 0.22.
In the Driver I'used this code:
Job job = Job.getInstance(configuration);
...
job.addArchiveToClassPath(new Path(JAR_DIR);
...
In the Map class what code I have to use to add the jar in the local classpath?
More details
I have a job that need in the map and reduce phases htmlunit.jar.I add this jar to classpath with the code above, but when I submit the job, I have a ClassNotFoundException at line that I use htmlunit references.If the code above it's ok, and the DistributedCache add automatically the jar in the tasktrackers classpath, what could be the problem?
I also have to use the option -libjars htmlunit.jar when I submit the job? I have to use another hadoop component?
You don't need to do anything.
Once you add a jar to the job classpath what you're saying is
"include this in the class path of the map and reduce jobs"
So long as your mappers and reducers extend from the Mapper and Reducer base classes then it will 'just work'.
Worth noting, you should probably instead use addFileToClassPath for each individual Jar you need.
An alternative (we do this) is to create one single jar containing your source and your dependencies.
Build your code jar as usual, then create a subdirectory in the jar called 'lib', and add all of your dependency jars in here. Then your entire job is self contained and you don't need to worry about adding other jars to the distributed cache.
So for example you'd have a jar with the following contents:
/com/example/Something.class
/com/example/SomethingElse.class
/lib/dependency.jar
/lib/dependency2.jar
(a jar is just a zip file, so you can use regular zip creation utilities to build it)
For various reasons this also performs better than adding the .class files of your dependency to the jar directly.

Resources