Please explain what's dfs.include file purpose and how to define it.
I've added a new node to the Hadoop cluster but it's not identified by the namenode. In one of the posts I found that dfs.include can resolve this issue.
Thank you in advance,
Vladi
Just including the node name in the dfs.include and mapred.include is not sufficient. The slave file has to be updated on the namenode/jobtracker. The tasktracker and the datanode have to be started on the new node and the refreshNodes command has to be run on the NameNode and the JobTracker to make them aware of the new node.
Here are the instructions on how to do this.
According to the 'Hadoop : The Definitive Guide'
The file (or files) specified by the dfs.hosts and mapred.hosts properties is different from the slaves file. The former is used by the namenode and jobtracker to determine which worker nodes may connect. The slaves file is used by the Hadoop control scripts to perform cluster-wide operations, such as cluster restarts. It is never used by the Hadoop
daemons.
Related
How should I add a new datanode to an existing hadoop cluster?
Do I just stop all, set up a new datanode server as existing datanodes, and add the new server IP to the namenode and change the number of slaves to a correct number?
Another question is: After I add a new datanode to the cluster, do I need to do anything to balance all datanodes or "re-distribute" the existing files and directories to different datanodes?
For the Apache Hadoop you can select one of two options:
1.- Prepare the datanode configuration, (JDK, binaries, HADOOP_HOME env var, xml config files to point to the master, adding IP in the slaves file in the master, etc) and execute the following command inside this new slave:
hadoop-daemon.sh start datanode
2.- Prepare the datanode just like the step 1 and restart the entire cluster.
3.- To redistribute the existing data you need to enable dfs.disk.balancer.enabled in hdfs-site.xml. This enable the HDFS Disk Balancer and you need to configure a plan.
You don't need to stop anything to add datanodes, and datanodes should register themselves to the Namenode on their own; I don't recall manually adding any information or needing to restart a namenode to detect datanodes (I typically use Ambari to provision new machines)
You will need to manually run the HDFS balancer in order to spread the data over to the new servers
What happens when we decommission a datanode while write is happening to HDFS on that node?
Will it stop writing the data to HDFS on that node and decommision that node or will it finish writing and then decommision it.
I found a solution for this in cloudera.
Decommissioning applies to only to HDFS DataNode, MapReduce TaskTracker, YARN NodeManager, and HBase RegionServer roles. If the host has other roles running on it, those roles are stopped/killed.
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_mc_decomm_host.html
As staded at the Hadoop wiki:
Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup.
The exclude file property names a file that contains a list of hosts that are not permitted to connect to the namenode.
When that occurs, I think an IOException is gotten by the process writting to the decommissioned node.
After you add the entries for the nodes that you want to decommission. You need to execute namenode -refreshnodes command to start actual process of decommission. Actual process of decomission is very slow. All the current tasks either read/write will continue as per usual. But all future writes are not allowed but reads are allowed with least priority. Meanwhile all the blocks are replicated to other nodes.
I have installed a hadoop cluster with total 3 machines, with 2 nodes acting as datanodes and 1 node acting as Namenode and as well as a Datanode.
I wanted to clear certain doubts regarding hadoop cluster installation and architecture.
Here is a list of questions I am looking answers for----
I uploaded a data file around 500mb size in the cluster and then checked the hdfs report.
I noticed that the namenode I made is also occupying 500mb size in the hdfs, along with datanodes with a replication factor of 2.
The problem here is that I want the namenode not to store any data on it, in short i dont want it to work as a datanode as it is also storing the file I am uploading. So what is the way of making it only act as a Master Node and not like a datanode?
I tried running the command hadoop -daemon.sh stop on the Namenode to stop the datanode services on it but it wasnt of any help.
How much metadata does a Namenode generate for a filesize typically of 1 GB? Any approximations?
Go to conf directory inside your $HADOOP_HOME directory on your master. Edit the file named slaves and remove the entry corresponding to your name node from it. This way you are only asking the other two nodes to act as slaves and name node as only the master.
On a distributed Hadoop cluster, can I copy the same hdfs-site.xml file to the namenodes and datanodes?
Some of the set-up instructions I've seen (i.e. Cloudera) say to have the dfs.data.dir property in this file on the datanodes and and the dfs.name.dir property in this file on the namenode. Meaning I should have two copies of hdfs-site.xml, one for the namenode and one for the datanodes.
But if it's all the same I'd rather just own/maintain one copy of the file and push it to ALL nodes anytime I change it.
Is there any harm/risk in having both dfs.name.dir and dfs.data.dir properties in the same file? What issues might happen if a data node sees the property for "dfs.name.dir" ?
And if there are issues, what other properties should be in the hdfs-site.xml file on the namenode but not on datanode? and vice versa.
And finally, what properties need to be included in the hdfs-site.xml file that I copy to a client machine (who isn't a tasktracker or datanode, but just talks to the Hadoop cluster) ?
I'v searched around, including the O'reilly operations book, but can't find any good article describing how the config file needs to differ across different nodes.
Thanks!
The namenode is picked up from masters file therefore essentially FSimage and edit logs will be written only on namenode and not in the datanode even though you copy the same hdfs-site.xml.
For the second question..You can't necessarily communicate with hdfs without being on the cluster directly. If you want to have a remote client you might try webhdfs and create certain web services using which you can write or access files in hdfs
I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.