I have written a hadoop 1.0.4 application that runs fine locally in semi-distributed mode. I have also installed Cloudera Hadoop 4 on my cluster. I thought that CDH4 runs hadoop 1.0.4 since it is listed as stable on the hadoop site, but that seems not to be the case. When i run the application on my cluster I get the following errors:
12/11/27 16:14:38 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/11/27 16:14:38 INFO input.FileInputFormat: Total input paths to process : 16
12/11/27 16:14:39 INFO mapred.JobClient: Running job: job_201211271520_0004
12/11/27 16:14:40 INFO mapred.JobClient: map 0% reduce 0%
12/11/27 16:14:50 INFO mapred.JobClient: Task Id : attempt_201211271520_0004_m_000013_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
12/11/27 16:14:50 INFO mapred.JobClient: Task Id : attempt_201211271520_0004_m_000000_0, Status : FAILED
... and so on...
Am I right in my assmption that this is because CHD4 is not compatible with hadoop 1.0.4? And if so does anyone know what version is compatible with hadoop 1.0.4? I rather switch cloudera software than rewrite my application.
You are correct; CDH3 uses version 0.20.2, CDH4 uses version 2.0.0. The nomenclature for Hadoop versions is a mess, and I don't pretend to understand it. But it looks like you may be able to use CDH3 based on the following stated in this blog post by Cloudera:
"The CDH3 distribution incorporated the 0.20.2 Apache Hadoop release plus the features of the 0.20.append and 0.20.security branches that collectively are now known as “1.0.” The Apache Hadoop in CDH3 has been the equivalent of the recently announced Apache Hadoop 1.0 for approximately a year now."
If this is the case, I would give CDH3 a try. If it doesn't work, you may just have to look for something besides Cloudera's installation.
Related
How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution
Any option to make Hadoop job historyserver high available ? I am using Hadoop 2.7
Also ResourceMgr high availability is not so much matured than of namenode .... even start-yarn.sh doesnot start the standby RM. Any out of the box solution for both of those ?
Iam new to Hadoop 2.5.1. As i have already installed Hadoop 1.0.4 previously, i thought installation process would be same so followed following tutorial.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Every thing was fine, even i have given these settings in core-site.xml
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
But i have seen in several sites this value as 9000.
And also changes in yarn.xml.
Still everything works fine when i run a mapreduce job. But my question is
when i run command jps it gives me this output..
hduser#secondmaster:~$ jps
5178 ResourceManager
5038 SecondaryNameNode
4863 DataNode
5301 NodeManager
4719 NameNode
6683 Jps
I dont see task tracker and job tracker in jps. Where are these demons running.
And without these deamons how am i able to run Mapreduce job.
Thanks,
Sreelatha K.
From hadoop version hadoop 2.0 onwards, default processing framework has been changed to YARN from Classic Mapreduce. You are using YARN, where you cannot see Jobtracker, Tasker in YARN. Jobtracker and Tasktracker is replaced by Resource manager and Nodemanager respectively in YARN.
But still you have an option to use Classic Mapreduce framework instead of YARN.
In Hadoop 2 there is an alternative method to run MapReduce jobs, called YARN. Since you have made changes in yarn.xml, MapReduce processing happens using YARN, not using the traditional MapReduce framework. That's probably be the reason why you don't see TaskTracker and JobTracker listed after executing the jps command. Note that ResourceManager and NodeManager are the daemons for YARN.
YARN is next generation of Resource Manager who can able to integrate with Apache spark, storm and many more tools you can use to write map-reduce jobs
I'd like to know your input as to why this error is happening. On production environment onshore, we're using CDH4. On our local testing environment, we're just using Apache Hadoop v2.2.0. When I run the same jar compiled on CDH4, the MR jobs are executed fine. But when I run the jar on Hadoop v2.2.0 (YARN enabled), I get this error:
INFO mapreduce.Job: Task Id : attempt_1391062333435_0001_m_000000_0, Status : FAILED
Error: java.lang.UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation
The log showed Map jobs ran successfully, but the Reduce jobs - all of them failed - with the above error. There's not too many hits on Google regarding this error so I'm kind of nowhere to run but here.
Any thoughts guys? Thanks.
Sorry for the lateness of this reply.
This problem was solved when we synched our environment with the one onshore. That is, instead of using plain Apache Hadoop, we used the Cloudera distribution.
I'm making my first steps mastering hadoop. I've setup a CDH4.5 in distributed mode (on two virtual machines). I'm having problems running MapReduce jobs with YARN. I could launch successfully a DistributedShell application (from CDH examples), but once I run a MapReduce job, it just hangs there forever.
This is what I'm trying to launch:
sudo -uhdfs yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1 1
These are the last resource manager's log lines:
13/12/10 23:30:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1386714123362_0001
13/12/10 23:30:02 INFO client.YarnClientImpl: Submitted application application_1386714123362_0001 to ResourceManager at master/192.168.122.175:8032
13/12/10 23:30:02 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1386714123362_0001/
13/12/10 23:30:02 INFO mapreduce.Job: Running job: job_1386714123362_0001
The node manager's log doesn't get any new messages once I run the job.
This is what I see on resource manager's web page regarding the job:
State - ACCEPTED
FinalStatus - UNDEFINED
Progress - (progress bar in 0%)
Tracking UI - UNASSIGNED
Apps Submitted - 1
Apps Pending - 1
Apps Running - 0
I found this at http://hadoop.apache.org/docs/r2.0.6-alpha/hadoop-project-dist/hadoop-common/releasenotes.html:
YARN-300. Major bug reported by shenhong and fixed by Sandy Ryza (resourcemanager , scheduler)
After YARN-271, fair scheduler can infinite loop and not schedule any application.
After yarn-271, when yarn.scheduler.fair.max.assign<=0, when a node was been reserved, fairScheduler will infinite loop and not schedule any application.
try with new version i.e. 2.0 above
Probably caused by system resource issue, I fixed it by restarting my system.