How to separate Hadoop MapReduce from HDFS? - hadoop

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!

You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!

Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.

Related

How does Hadoop distribute the data/tasks for MapReduce jobs?

I've setup a Hadoop cluster with 4 nodes, one of which serves as the NameNode for HDFS as well as the Yarn master. This node is also the most powerful.
Now, I've distributed 2 text files, one on the node01 (namenode) and one on node03 (datanode). When running the basic WordCount MapReduce job, I can see in the logs that only node01 was doing any calculations.
My question is why Hadoop didn't decide to do MapReduce on node03 and transfer the result instead of transferring the entire book to node01. I also checked, duplication is disabled and the book is only available on node03.
So, how does Hadoop decide between transferring the data and setting up the jobs and in this decision, does it check which machine has more compute power (e.g. did it decide to transfer to node01 because node01 is a 4 core 4gig ram machine vs 2core 1 gig on node03)?
I couldn't find anything on this topic, so any guidance would be appreciated.
Thank you!
Some more clarifications:
node01 is running a NameNode as well as a DataNode and a ResourceManager as well as a NodeManager. Thus, it serves as "main node" as well as a "compute node".
I made sure to put one file on node01 and one file on node03 by running:
hdfs dfs -put sample1.txt samples on node01 and hdfs dfs -put sample02.txt samples on node03. As replication is disabled, this leads to the data - that was available locally on node01 respective node03 - only being stored there.
I verified this using the HDFS Webinterface. For sample1.txt, it says the blocks are only available on node01; for sample2.txt, it says the blocks are only available on node03.
Regarding #cricket_007:
My concern is that sample2.txt is only available on node03. The YARN Webinterface tells me that that for the Application Attempt, only one container was allocated on node01. If the map task for file sample2.txt, there would have been a container on node03 as well.
Thus, node01 needs to have fetched the sample2.txt file from node03.
Yes, I know Hadoop is not running well on 1gig of RAM, but I am working with a Raspberry Pi cluster just to fiddle around and learn a little. This is not for production usage.
The YARN application master picks a node at random to run the calculation based on information available from the Namenode where files are stored. DataNodes and NodeManagers should run on the same machines.
If your file isn't larger than the HDFS block size, there is no reason to fetch the data from other nodes.
Note: Hadoop services don't run that well on only 1G of RAM, and you need to adjust the YARN settings differently for different sized nodes.
For anyone else wondering:
At least for me, the HistoryServer UI (which needs to be started manually) shows correctly that node03 and node01 were running map jobs. Thus, my statement was incorrect. I still wonder why the application attempt UI speaks of one container, but I guess that doesn't matter.
Thank you guys!

Datanode decommisioning vs Write

What happens when we decommission a datanode while write is happening to HDFS on that node?
Will it stop writing the data to HDFS on that node and decommision that node or will it finish writing and then decommision it.
I found a solution for this in cloudera.
Decommissioning applies to only to HDFS DataNode, MapReduce TaskTracker, YARN NodeManager, and HBase RegionServer roles. If the host has other roles running on it, those roles are stopped/killed.
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_decomm_host.html
As staded at the Hadoop wiki:
Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup.
The exclude file property names a file that contains a list of hosts that are not permitted to connect to the namenode.
When that occurs, I think an IOException is gotten by the process writting to the decommissioned node.
After you add the entries for the nodes that you want to decommission. You need to execute namenode -refreshnodes command to start actual process of decommission. Actual process of decomission is very slow. All the current tasks either read/write will continue as per usual. But all future writes are not allowed but reads are allowed with least priority. Meanwhile all the blocks are replicated to other nodes.

Hadoop Namenode without HDFS storage

I have installed a hadoop cluster with total 3 machines, with 2 nodes acting as datanodes and 1 node acting as Namenode and as well as a Datanode.
I wanted to clear certain doubts regarding hadoop cluster installation and architecture.
Here is a list of questions I am looking answers for----
I uploaded a data file around 500mb size in the cluster and then checked the hdfs report.
I noticed that the namenode I made is also occupying 500mb size in the hdfs, along with datanodes with a replication factor of 2.
The problem here is that I want the namenode not to store any data on it, in short i dont want it to work as a datanode as it is also storing the file I am uploading. So what is the way of making it only act as a Master Node and not like a datanode?
I tried running the command hadoop -daemon.sh stop on the Namenode to stop the datanode services on it but it wasnt of any help.
How much metadata does a Namenode generate for a filesize typically of 1 GB? Any approximations?
Go to conf directory inside your $HADOOP_HOME directory on your master. Edit the file named slaves and remove the entry corresponding to your name node from it. This way you are only asking the other two nodes to act as slaves and name node as only the master.

How to submit a job to specifc nodes in Hadoop?

I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.
In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.
There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.
For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.

Hadoop Configuration Physical Location

I want to ask a basic question that I couldn't find in online tutorials.
Does the hadoop config files need to be on all nodes? (NameNode, DataNode, JobTracker, and etc)
Or do they only need to reside on the user#machine where NameNode resides?
In other words, to properly set up a fully distributed cluster, do I need to replicate the config files to every single node?
Thanks!
Yes you are right, the config files need to be on every slave.
I say just slave, because a master has usually other configurations you may want to use, which makes the configuration on the slaves a bit more verbose.
Two things that make live more easier:
Use a NFS Mount for the configuration of the slaves
Or use a tool that does this for you like Chef

Resources