SecondaryNamenode and MapReduce jobs - hadoop

maybe that's a silly question... but anyway...
How would I understand that the secondary namenode does something (I mean it works), I must configure It to do something?
Also jobs in MapReduce run in parallel by default, I mean what you program in MR always run in parallel?
I made these questions because I have to proof (I have an project to do) that jobs on hadoop run in parallel.
Thanks you in advance.
P.S: Sorry for my bad english, and hope that I was understandable.

Yon, when you configure Hadoop you put hostname of some machine into the /conf/masters. This is where your SNN will run. You could go to the terminal of that machine and issue JPS. This will show you all the java processing running currently. You should be able to see SecondaryNameNode along with other processes. Something like this :
apache#hadoop:~$ jps
21615 TaskTracker
21268 SecondaryNameNode
21014 DataNode
27656 HRegionServer
21362 JobTracker
19908 org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar
17643 Jps
27364 HMaster
28451 Main
27194 HQuorumPeer
29811 RunJar
20744 NameNode
To cross check you could change this to some other machine and see the effect. Alternatively you could check it via the SNN port, which is 50090 by default. Does it make sense?
And when you run a MR job, you could open the mapreduce webUI by pointing your web browser to jobtracker_machine:50030. Here you can see a list of all the jobs you are running(or which you have run previously) along with the total number of mappers/reducers created for a particular job. You can click on a job and it will show you all the mappers and reducers running currently on your cluster. You can see the progress of each mapper/reducer over there. All these mappers/reducers run in parallel in different machines. To verify that you could click on each mapper and it will show you the machine where that particular mapper/reducer is running along with the % completion of each mapper/reducer.
HTH

Related

Understanding mapreduce.framework.name wrt Hadoop

I am learning Hadoop and came to know that that there are two versions of the framework viz: Hadoop1 and Hadoop2.
If my understanding is correct, in Hadoop1, the execution environment is based on two daemons viz TaskTracker and JobTracker whereas in Hadoop2 (aka yarn), the execution environment is based on "new daemons" viz ResourceManager, NodeManager, ApplicationMaster.
Please correct me if this is not correct.
I came to know of the following configuration parameter:
mapreduce.framework.name
possible values which it can take: local , classic , yarn
I don't understand what does they actually mean; for example if I install Hadoop 2 , then how can it have old execution environment (which has TaskTracker, JobTracker).
Can anyone help me what these values mean?
yarn stands for MR version 2.
classic is for MR version 1
local for local runs of the MR jobs.
MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.
jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.
MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
Local mode is given to simulate and debug MR application within a single machine/JVM.
EDIT: Based on comments
jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:
The jps tool lists the instrumented HotSpot Java Virtual Machines
(JVMs) on the target system. The tool is limited to reporting
information on JVMs for which it has the access permissions.
So,
jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.
It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.
Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:
Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.
Hence each of the processes may or may not run in same JVM and hence jps output will be different.
Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running
Namenode
Datanode
ResourceManager
NodeManager
Apache Hadoop 1.x (MRv1) consists of the following daemons:
Namenode
Datanode
Jobtracker
Tasktracker
Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.

What happens to hadoop job when the NameNode is down?

In Hadoop 1.2.1, I would like to know some basic understanding on these below questions
Who receives the hadoop job? Is it NameNode or JobTracker?
What will happen if somebody submits a Hadoop job when the NameNode is down?Does the hadoop job fail? or Does it get in to Hold?
What will happen if somebody submits a Hadoop job when the JobTracker is down? Does the hadoop job fail? or Does it get in to Hold?
By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
1) JobTracker (Yarn ResourceManager with hadoop 2.x)
2) I am not completely sure, but probably job will become submitted and fail afterwards
3) You cannot submit job to a stopped JobTracker.
Client submits job to the Namenode. Namenode looks for the data requested by the client and gives the block information.
JobTracker is responsible for the job to be completed and the allocation of resources to the job.
In Case 2 & 3 - Jobs fails.

How do I safely remove a Hadoop datanode for maintenance?

I want to take a single machine out of a Hadoop cluster temporarily.
Most documentation says take it out of by adding it to the yarn and dfs .exclude files. I don't want to add it to the dfs.exclude and yarn.exclude files and decommission it with hdfs dfsadmin -refreshNodes, though, because I want to take it out, make some changes to the machine, and bring it back online as soon as possible. I don't want to copy hundreds of gigabytes of data over to avoid under-replicated blocks!
Instead, I'd like to be able to power off the machine quickly while making sure:
The cluster as a whole is still operational.
No data is lost by the journalmanager or nodemanager processes.
No Yarn jobs fail or go AWOL when the process dies.
My best guess at how to do this is by issuing:
./hadoop-daemon.sh --hosts hostname stop datanode
./hadoop-daemon.sh --hosts hostname stop journalnode
./yarn-daemon.sh --hosts hostname stop nodemanager
And then starting each of these processes individually again when the machine comes back online.
Is that safe? And is there a more efficient way to do this?

How to start multiple datanode processes on standalone hadoop setup(pseudo-distributed)

I am new to Hadoop. I have configured standalone hadoop setup on single VM running Ubuntu 13.03. After starting the hadoop processes using start-all.sh, jps command shows
775 DataNode
1053 JobTracker
962 SecondaryNameNode
1365 Jps
1246 TaskTracker
590 NameNode
As per my understanding Hadoop has started with 1 namenode and 1 datanode. I want to create multiple datanode processes i.e. multiple instances of datanode. Is there any way I can do that?
There are multiple possibilities how to install and configure Hadoop.
Local (standalone) Mode - it means all Hadoop components run in a signle Java process
Pseudo-Distributed Mode - Hadoop runs all its components (datanode, tastracker, jobtracker, namenode, ...) as separate Java processes. It servers as a simulation for fully distributed installation but it runs on local machine only.
Distributed Mode - fully distributed installation. Shortly without any details: Some machines play 'slave' role and contain Datanode+Tasktracker components and there is a server playing 'master' role and contains Namenode+JobTracker.
Back to your queastion, if you would like to run Hadoop on single machine, you have the first two options. It is impossible to run it in fully distributed mode on a single node. Maybe you can do do a workaround, but it is nonsence from basic point of view. Hadoop was designed as a distributed system, the possibility to run it on a single machine serves IMHO for debug/trial purposes only.
For more details follow Hadoop documentation. I hope I answered your question.

How to separate Hadoop MapReduce from HDFS?

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.

Resources