I want to ask a basic question that I couldn't find in online tutorials.
Does the hadoop config files need to be on all nodes? (NameNode, DataNode, JobTracker, and etc)
Or do they only need to reside on the user#machine where NameNode resides?
In other words, to properly set up a fully distributed cluster, do I need to replicate the config files to every single node?
Thanks!
Yes you are right, the config files need to be on every slave.
I say just slave, because a master has usually other configurations you may want to use, which makes the configuration on the slaves a bit more verbose.
Two things that make live more easier:
Use a NFS Mount for the configuration of the slaves
Or use a tool that does this for you like Chef
Related
This question may seem very trivial, but I'm new to Hadoop and currently confused by one question.
When starting the daemons, how can the appropriate files be located on the slave nodes?
I know you specify the masters and the slaves in the appropriate files, but how does it know about the location on the file system of that nodes(the path to where Hadoop is installed)?
Should I perhaps setup something like HADOOP_HOME (or HADOOP_PREFIX) for that?
Also, I read that in masters file you specify only the Secondary Name Node, while the Name Node and Job Tracker are considered to be located on the node from which your are calling start-all.sh. But what happens if you're not logged on that node, but on the other, client node? Maybe I didn't understand that part well...
The installation on your different nodes of hadoop should be almost identical, and for that reason, you must specify HADOOP_HOME (i also specify HADOOP_PREFIX to the same location) pointing to your hadoop installation, in every node of your cluster.
Everyone of your nodes should be capable of connecting each other with ssh "password less" mode, so i believe that the last part of your question doesn't make much sense ;)
I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.
In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.
There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.
For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.
I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.
Suppose I change the port numbers for tasktrackers or change the number of maximum map tasks through the conf files in hadoop, do I need to stop and restart the servers/daemons?
It depends on what options you change, but for the two examples you provide i would say yes, restart the mapred services (you don't need to restart the DFS services for these options).
I don't think there is an exhaustive list anywhere of what you need to restart when you amend a specific option.
I'm trying to setup a hadoop cluster on 5 machines on same lan with NFS. The problem im facing is that the copy of hadoop on one machine is replicated on all the machines, so i cant provide exclusive properties for each slaves. Due to this, i get "Cannot create lock" kind of errors. The FAQ suggests that NFS should not be used, but i have no other option.
Is there a way where i can specify properties like, Master should pick its conf files from location1, slave1 should pick its conf files from location2 .....
Just to be clear, there's a difference between configurations for compute nodes and HDFS storage. Your issue appears to be solely the storage for configurations. This can and should be done locally, or at least let each machine map to a symlink based on some locally identified configuration (e.g. Mach01 -> /etc/config/mach01, ...).
(Revision 1) Regarding the comment/question below about symlinks: First, I'm going to admit that this is not something I can immediately solve. There are 2 approaches I see:
Have a script (e.g. at startup or as a wrapper for starting Hadoop) on the machine determine the hostname (e.g. hostname -a') which then identifies a local symlink (e.g./usr/local/hadoopConfig') to the correct directory on the NFS directory structure.
Set an environment variable, a la HADOOP_HOME, based on the local machine's hostname, and let various scripts work with this.
Although #1 should work, it is a method relayed to me, not one that I set up, and I'd be a little concerned about symlinks in the event that the hostname is misconfigured (this can happen). Method #2 is one that seems more robust.