Hadoop use only master node for processing data - hadoop

I've setup a Hadoop 2.5 cluster with 1 master node(namenode and secondary namenode and datanode) and 2 slave nodes(datanode).All of the machines use Linux CentOS 7 - 64bit. When I run my MapReduce program (wordcount), I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing.
I've checked the logs from all of the namenode and there is nothing wrong on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager.
Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?
Thank you

Make sure you are mentioaning the IP's Addresses of the daanodes on the Masternode networking files. Also each node in the cluster is supposed to contain IP address of the other machines.
Besides that check the includes file if it contains the relevant datanodes entry onto it or not.

Related

How does Hadoop distribute the data/tasks for MapReduce jobs?

I've setup a Hadoop cluster with 4 nodes, one of which serves as the NameNode for HDFS as well as the Yarn master. This node is also the most powerful.
Now, I've distributed 2 text files, one on the node01 (namenode) and one on node03 (datanode). When running the basic WordCount MapReduce job, I can see in the logs that only node01 was doing any calculations.
My question is why Hadoop didn't decide to do MapReduce on node03 and transfer the result instead of transferring the entire book to node01. I also checked, duplication is disabled and the book is only available on node03.
So, how does Hadoop decide between transferring the data and setting up the jobs and in this decision, does it check which machine has more compute power (e.g. did it decide to transfer to node01 because node01 is a 4 core 4gig ram machine vs 2core 1 gig on node03)?
I couldn't find anything on this topic, so any guidance would be appreciated.
Thank you!
Some more clarifications:
node01 is running a NameNode as well as a DataNode and a ResourceManager as well as a NodeManager. Thus, it serves as "main node" as well as a "compute node".
I made sure to put one file on node01 and one file on node03 by running:
hdfs dfs -put sample1.txt samples on node01 and hdfs dfs -put sample02.txt samples on node03. As replication is disabled, this leads to the data - that was available locally on node01 respective node03 - only being stored there.
I verified this using the HDFS Webinterface. For sample1.txt, it says the blocks are only available on node01; for sample2.txt, it says the blocks are only available on node03.
Regarding #cricket_007:
My concern is that sample2.txt is only available on node03. The YARN Webinterface tells me that that for the Application Attempt, only one container was allocated on node01. If the map task for file sample2.txt, there would have been a container on node03 as well.
Thus, node01 needs to have fetched the sample2.txt file from node03.
Yes, I know Hadoop is not running well on 1gig of RAM, but I am working with a Raspberry Pi cluster just to fiddle around and learn a little. This is not for production usage.
The YARN application master picks a node at random to run the calculation based on information available from the Namenode where files are stored. DataNodes and NodeManagers should run on the same machines.
If your file isn't larger than the HDFS block size, there is no reason to fetch the data from other nodes.
Note: Hadoop services don't run that well on only 1G of RAM, and you need to adjust the YARN settings differently for different sized nodes.
For anyone else wondering:
At least for me, the HistoryServer UI (which needs to be started manually) shows correctly that node03 and node01 were running map jobs. Thus, my statement was incorrect. I still wonder why the application attempt UI speaks of one container, but I guess that doesn't matter.
Thank you guys!

Determin whether slave nodes in hadoop cluster has been assigned tasks

I'm new to Hadoop and MapReduce. I just deployed a Hadoop cluster with one master machine and 32 slave machines. However when I start to run an example program, it seems that it just runs to slow. How can I determine whether a map/reduce task has really been assigned to a slave node for execution?
The example program is executed like that:
hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 32 100
okay lots of possibilities there. Hadoop comes out to help in distributed task.
So if your code is written in way that everything is dependent then there is no use of 32 slaves. rather it will take overhead time to manage connection.
check your hadoopMasterIp:50070 if if all the datanodes(slave) is running or not. obviously if you did not change dfs.http.address in your core-site.xml.
The easiest way to take a look at Yarn Web UI. By default it uses port 8088 on your master node (change master in the URI by your own IP address):
http://master:8088/cluster
There you can see total resources of your cluster and list of all applications. For every application you can find out how many mappers/reducers were used and where (on what machine) they were executed.

Why is the Secondary Name Node included on the slaves file as well?

I am reading blogs and tutorials and noticed that the node that is configured on the masters file is also included in the name node, why are they including the secondary name node on the slaves file as well? A technical or conceptual explanation is very much appreciated
It is possible to configure both master processes and slave processes on the same node in a cluster. But it is not recommended in production. In production, you will not see the overlap between masters and slaves. However on single node Hadoop set up as well as small clusters for development, it is common to have overlap between master and slave processes.
Master processes: namenode, secondary namenode, resource manager, jobhistory server etc.
Slave processes: datanode, nodemanager etc.
If you have node as part of masters and based up on the ip address in core-site.xml and yarn-site.xml it will start namenode and resourcemanager respectively.
For slaves configuration files typically have 0.0.0.0 as ip address, so it will start both datanode as well as nodemanager on all the nodes that are defined as slaves (unless you exclude it).

Hadoop Client Node Configuration

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
Now i know that hadoop software has to be installed in all those 20 machines.
but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?
I am new to hadoop, so from what I understood:
If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.
An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.
In case it can help someone, here is what I have done to connect to a cluster that I don't administer:
get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"
Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.
Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.
Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.
So based on your requirement the definition and placement of client varies.
I recommend this article.
"Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

How to separate Hadoop MapReduce from HDFS?

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.

Resources