Without yarn map-reduce work? - hadoop

I'm studying about hadoop map-reduce on centos 6.5 and hadoop 2.7.2. I learned that hdfs is just distributed file system and Yarn administers map-reduce work, so I thought that if i don't turn on Yarn(resource manager, node manager), map-reduce doesn't work.
Therefore, I think, wordcount should not do map-reduce process in the system working only hdfs not yarn.
(on the pseudo distribute mode)
But when I turn hdfs on not Yarn as you see in the below, and execute wordcount example, it show 'map-reduce framework'. What's it meaning? Does it possible only hdfs process map-reduce without Yarn? Because Yarn manage resource and job, is it right that map-reduce doesn't work without Yarn?
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/input /user/output

With Hadoop 2.0 YARN takes responsibility of resource management, this is true. But even without YARN the Map Reduce applications can run using the older flavor.
The mapred-site.xml has a configuration - mapreduce.framework.name
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
The above can be configured to choose whether to use YARN or not. The possible values for this property are - local, classic or yarn
The default value is "local". Set this to yarn, if you want to use YARN

Related

How to submit MR job to YARN cluster with ResourceManager HA wrt Hortowork's HDP?

I am trying to understand how to submit a MR job to Hadoop cluster, YARN based.
Case1:
For case in which there is only one ResourceManager (that is NO HA), we can submit the job like this (which i actually used and I believe is correct).
hadoop jar word-count.jar com.example.driver.MainDriver -fs hdfs://master.hadoop.cluster:54310 -jt master.hadoop.cluster:8032 /first/dir/IP_from_hdfs.txt /result/dir
As can be seen, RM is running on port 8032 and NN on 54310 and I am specifying the hostname becasue there is only ONE master.
case2:
Now, for the case when there is HA for both NN and RM, how do I submit the job? I am not able to understand this, because now we have two RM and NN (active / standby), and I understand that there is zookeeper to keep track of failures. So, from client perspective trying to submit a job, do I need to know the exact NN and RM for submitting the job or is there some logical naming which we have to use for submitting the job?
Can anyone please help me understand this?
With or without HA, the command to submit the job remains the same.
hadoop jar <jar> <mainClass> <inputpath> <outputpath> [args]
Using -fs and -jt is optional and are not used unless you want to specify a Namenode and JobTracker that is different from the one in the configurations.
If the fs.defaultFS property in core-site.xml and the properties defining the nameservice (dfs.nameservices) and its namenodes are configured properly in hdfs-site.xml of the client, the Active Master will be chosen whenever a client operation is performed.
By default, this Java class is used by the DFS Client to determine which NameNode is currently Active.
<property>
<name>dfs.client.failover.proxy.provider.<nameserviceID></name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

How to separately specify a set of nodes for HDFS and others for MapReduce jobs?

While deploying hadoop, I want some set of nodes to run HDFS server but not to run any MapReduce tasks.
For example, there are two nodes A and B that run HDFS.
I want to exclude the node A from running any map/reduce task.
How can I achieve it? Thanks
If you do not want to run any MapReduce job in a particular node or a set of nodes,
Stopping the nodemanager daemon would be the simplest option if they are already running.
Run this command on the nodes where the MR tasks should not be attempted.
yarn-daemon.sh stop nodemanager
Or exclude the hosts using the property yarn.resourcemanager.nodes.exclude-path in yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/path/to/excludes.txt</value>
<description>Path of the file containing the hosts to exclude. Should be readable by YARN user</description>
</property>
On adding this property, refresh the resourcemanager
yarn rmadmin -refreshNodes
The nodes specified in the file will be exempted from attempting MapReduce tasks.
I answer my question
If you use Yarn for resource management,
go check franklinsijo's answer.
If you use standalone mode,
make a list of nodes that you will run MR tasks and specify its path as 'mapred.hosts' at mapred-default file. (https://hadoop.apache.org/docs/r1.2.1/mapred-default.html)

why on mapred-site.xml hadoop 2 there is an mapreduce.jobtracker.address property

YARN is the Hadoop second generation that not use the jobtracker daemon anymore, and substitute it with resource manager. But why, on mapred-site.xml hadoop 2 there is an mapreduce.jobtracker.address property?
in order to running Hadoop MapReduce Application from Eclipse , is there a plugin eclipse for yarn , because i find all plugins are spesified for jobtracker.
thanks in advance.
hadoop 2 is compatible to run on a Yarn cluster as well as MR cluster .
Just in case one needs to run hadoop 2 compiled code on a NON-YARN cluster one has to specify JT url:port .
hadoop 2 in binary compatible on both MR/YARN cluster .

Spark with custom Hadoop FileSystem

I already have a cluster with Yarn, configured to use a custom Hadoop FileSystem in core-site.xml:
<property>
<name>fs.custom.impl</name>
<value>package.of.custom.class.CustomFileSystem</value>
</property>
I want to run a Spark Job on this Yarn cluster, which reads an input RDD from this CustomFilesystem:
final JavaPairRDD<String, String> files =
sparkContext.wholeTextFiles("custom://path/to/directory");
Is there some way I can do this without re-configuring Spark? i.e. Can I point Spark to the existing core-site.xml, and what would be the best way to do that?
Set HADOOP_CONF_DIR to the directory that contains core-site.xml. (This is documented in Running Spark on YARN.)
You will still need to make sure package.of.custom.class.CustomFileSystem is on the classpath.

unable to see Task tracker and Jobtracker after Hadoop single node installation 2.5.1

Iam new to Hadoop 2.5.1. As i have already installed Hadoop 1.0.4 previously, i thought installation process would be same so followed following tutorial.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Every thing was fine, even i have given these settings in core-site.xml
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
But i have seen in several sites this value as 9000.
And also changes in yarn.xml.
Still everything works fine when i run a mapreduce job. But my question is
when i run command jps it gives me this output..
hduser#secondmaster:~$ jps
5178 ResourceManager
5038 SecondaryNameNode
4863 DataNode
5301 NodeManager
4719 NameNode
6683 Jps
I dont see task tracker and job tracker in jps. Where are these demons running.
And without these deamons how am i able to run Mapreduce job.
Thanks,
Sreelatha K.
From hadoop version hadoop 2.0 onwards, default processing framework has been changed to YARN from Classic Mapreduce. You are using YARN, where you cannot see Jobtracker, Tasker in YARN. Jobtracker and Tasktracker is replaced by Resource manager and Nodemanager respectively in YARN.
But still you have an option to use Classic Mapreduce framework instead of YARN.
In Hadoop 2 there is an alternative method to run MapReduce jobs, called YARN. Since you have made changes in yarn.xml, MapReduce processing happens using YARN, not using the traditional MapReduce framework. That's probably be the reason why you don't see TaskTracker and JobTracker listed after executing the jps command. Note that ResourceManager and NodeManager are the daemons for YARN.
YARN is next generation of Resource Manager who can able to integrate with Apache spark, storm and many more tools you can use to write map-reduce jobs

Resources