Running Mahout Job on Hadoop: Got ClassNotFoundException - hadoop

I try to run a Mahout Kmeans Example on the cloudera quickstart vm for hadoop. I read here link to clouudera block and here stack overflow post that i can use the -libjars command to attach the mahout .jars
I put the jar-files: KMeansHadoop.jar mahout-core-0.9.jar and mahout-math-0.9.jar in the same folder and run:
hadoop jar KMeansHadoop.jar SimpleKMeansClustering -libjars mahout-core-0.9.jar mahout-math-0.9.jar
But i still get the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector
What do i wrong? Thank you!

Firstly, I believe that the -libjars values need to be comma-separated. But that only makes your third-party jars available to the cluster. You may also need to use HADOOP_CLASSPATH to make those jars available on the client side (e.g: on the edge node from which you're kicking off your job).
Check out this post. It helped me a lot when I was working my way through this exact issue with getting Driven to work with Cascading.

Related

What is 'NoClassDefFoundError: clojure.core.protocols$seq_reduce' while running Storm topology?

When I try to run the word counting topology as storm jar wordcount.jar words.txt, I get the below error:
My topology is as below:
Why I might be getting this error? I'm using storm-0.8.2 an external jar and have Storm 0.8.2 installed.
Read the stacktrace again please:
Exception in thread "main" java.lang.NoClassDefFoundError: streaming.words.txt
Does this ring a bell? Check your setup again.
Holy smokes, gcj? Clojure requires an actual java runtime environment, 1.5+. I have never found gcj to be good for much of anything, but especially I'm certain it can't run Clojure.

Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses-submiting job2remoteClustr

I recently upgraded my cluster from Apache Hadoop1.0 to CDH4.4.0. I have a weblogic server in another machine from where i submit jobs to this remote cluster via mapreduce client. I still want to use MR1 and not Yarn. I have compiled my client code against the client jars in the CDH installtion (/usr/lib/hadoop/client/*)
Am getting the below error when creating a JobClient instance. There are many posts related to the same issue but all the solutions refer to the scenario of submitting the job to a local cluster and not to remote and specifically in my case from a wls container.
JobClient jc = new JobClient(conf);
Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
But running from the command prompt on the cluster works perfectly fine.
Appreciate your timely help!
I had a similar error and added the following jars to classpath and it worked for me:
hadoop-mapreduce-client-jobclient-2.2.0.2.0.6.0-76:hadoop-mapreduce-client-shuffle-2.3.0.jar:hadoop-mapreduce-client-common-2.3.0.jar
It's likely that your app is looking at your old Hadoop 1.x configuration files. Maybe your app hard-codes some config? This error tends to indicate you are using the new client libraries but that they are not seeing new-style configuration.
It must exist since the command-line tools see them fine. Check your HADOOP_HOME or HADOOP_CONF_DIR env variables too although that's what the command line tools tend to pick up, and they work.
Note that you need to install the 'mapreduce' service and not 'yarn' in CDH 4.4 to make it compatible with MR1 clients. See also the '...-mr1-...' artifacts in Maven.
In my case, this error was due to the version of the jars, make sure that you are using the same version as in the server.
export HADOOP_MAPRED_HOME=/cloudera/parcels/CDH-4.1.3-1.cdh4.1.3.p0.23/lib/hadoop-0.20-mapreduce
I my case i was running sqoop 1.4.5 and pointing it to the latest hadoop 2.0.0-cdh4.4.0 which had the yarn stuff also thats why it was complaining.
When i pointed sqoop to hadoop-0.20/2.0.0-cdh4.4.0 (MR1 i think) it worked.
As with Akshay (comment by Setob_b) all I needed to fix was to get hadoop-mapreduce-client-shuffle-.jar on my classpath.
As follows for Maven:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
<version>${hadoop.version}</version>
</dependency>
In my case, strangely this error was because in my 'core-site.xml' file, I mentioned "IP-address" rather than "hostname".
The moment I mentioned "hostname" in place of IP address and in "core-site.xml" and "mapred.xml" and re-installed mapreduce lib files, error got resolved.
in my case, i resolved this by using hadoop jar instead of java -jar .
it's usefull, hadoop will provide the configuration context from hdfs-site.xml, core-site.xml ....

Cloudera Manager: Where do I put Java ClassPath for MapReduce jobs?

I've got Hadoop-Lzo working happily on my local pseudo-cluster but the second I try the same jar file in production, I get:
java.lang.RuntimeException: native-lzo library not available
The libraries are verified to be on the DataNodes, so my question is:
In what screen / setting do I specify the location of the native-lzo library?
For MapReduce you need to add the entries to the MapReduce Client Environment Safety valve. You can find MapReduce Client Safety by going to View and Edit tab under Configuration. Then add these lines over there :
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
Also add the LZO codecs to the io.compression.codecs property under the MapReduce Service. To do that go to io.compression under View and Edit tab under Configuration and these lines :
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
Do not forget to restart your MR daemons after making the changes. Once restarted redeploy your MR client configuration.
For a detailed help on how to use LZO you can visit this link :
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html
HTH
try sudo apt-get install lzop in your TaskTracker nodes.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Hadoop streaming with python on Windows

I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves.
I'm using the following command;
bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27
The mapper runs through fine, but the log reports that the reduce.py file wasn't found. In the exception it looks like the hadoop taskrunner is creating the symlink for the reducer to the mapper.py file.
When I check the job configuration file, I noticed that mapred.cache.files is set to;
hdfs://MDAC-HD1:8020/mapred/staging/administrator/.staging/job_201304251054_0021/files/mapper.py#mapper.py
It looks like although the reduce.py file is being added to the jar file, it's not being included in the configuration correctly and can't be found when the reducer tries to run.
I think my command is correct, I've tried using -file parameters instead but then neither file is found.
Can anyone see or know of an obvious reason?
Please note, this is on Windows.
EDIT- I've just run it locally and it worked, looks like my problem may be with the copying of the files round the cluster.
Still welcome input!
Well, thats embarrassing... my first question and I answer it myself.
I found the problem by renaming the hadoop conf file to force default settings which meant the local job tracker.
The job ran properly and it gave me the room to work out what the problem is, looks like communication around the cluster isn't as complete as it need be.
When I see your command, it shows "file:///d:/dev/python/reducer.py" for -files option, but you specify the reduce.py for -reducer. Does this cause the problem?? Sorry I am not sure.

Resources