does configuration properties on hdfs-site.xml applies to NameNode in hadoop? - hadoop

I recently set up a test environment cluster for hadoop -One master and two slaves.
Master is NOT a dataNode (although some use master node as both master and slave).
So basically I have 2 datanodes. The default configuration for replication is 3.
Initially, I did not change any configuration on conf/hdfs-site.xml. I was getting error could only be replicated to 0 nodes instead of 1.
I then changed the configuration in conf/hdfs-site.xml in both my master and slave as follows:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
and lo! everything worked fine.
My question is: does this configuration applies to NameNode or DatNode although I changed hdfs-site.xml in all my datanodes and NameNodes.
if my understanding is correct, NameNode allocates the block for datanodes. so replication configuration in master or NameNode is important and probably not needed in datanodes. Is this correct?
I am confused with the actual purpose of different xml in hadoop framework: from my little understanding:
1) core-site.xml - configuration parameters for the entire framework, such as where the logs files should go, what is the default name of the filesystem etc
2) hdfs-site.xml - applies to individual datanodes. how many replication, data dir in the local filesystem of the datanode, size of the block etc
3) mapred-site.xml - applies to datanode and gives configuration for the task tracker.
please correct if this is wrong. These configuration files are not well explained in the tutorials I had. so it comes from my look into these files in the defaults.

This is my understanding and I may be wrong.
{hdfs-site.xml} - is to for the properties of HDFS(Hadoop Distributed File System)
{mapred-site.xml} - is to for the properties of MapReduce
{core-site.xml} - is for other properties which touch both HDFS and MapReduce

this is usually caused by insufficient space.
please check the total capacity of your cluster and used, remaining ratio using
hdfs dfsadmin -report
also check dfs.datanode.du.reserved in the hdfs-site.xml, if this value is larger than your remained capacity
look for other possible causes explained here

Related

Which processes need access to core-site.xml and hdfs-site.xml

The core-site.xml file informs Hadoop daemon where NameNode runs in
the cluster. It contains the configuration settings for Hadoop Core
such as I/O settings that are common to HDFS and MapReduce.
The hdfs-site.xml file contains the configuration settings for HDFS
daemons; the NameNode, the Secondary NameNode, and the DataNodes.
Here, we can configure hdfs-site.xml to specify default block
replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
I'm looking to understand which processes [Namenode, Datanode, HDFS client] need access to which of those configuration files?
Namenode: I presume it only needs hdfs-site.xml because it doesn't need to know its own location.
Datanode: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
HDFS client: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
Is that accurate?
The clients and server processes need access to both files
If you use HDFS nameservices with highly available Namenodes, then the two Namenodes need to find each other
Some comments:
core-site.xml hdfs-site.xml Are the two used by external
programs (such as NiFi) to access the cluster/WEB HDFS API
Edge nodes require both for cluster access
Ambari will manage both of these along with all the others
The three you listed all need access in order to run the cluster and at a bare minimum set basic settings such as proxy settings and cluster access

How to add a Secondary NameNode in a HBase cluster setup?

I've a Hbase cluster setup with 3 nodes: A NameNode and 2 DataNodes.
The NameNode is a server with 4GB memory and 20GB hard disk while each DataNode has 8GB memory and 100GB hard disk.
I'm using
Apache Hadoop version: 2.7.2 and
Apache Hbase version: 1.2.4
I've seen some people mentioned about a Secondary NameNode.
My questions are,
What is the impact of not having a Secondary NameNode in my setup?
Is it possible to use one of the DataNodes as the Secondary NameNode?
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)
What is the impact of not having a Secondary NameNode in my setup?
SecondaryNamenode does the job of periodically merging the namespace image with the edit log (called as checkpointing). Your setup is not an High-Availability setup, thus not having one will cause the edit log to grow large in size which would eventually add an overhead to the NameNode during startup.
Is it possible to use one of the DataNodes as the Secondary NameNode?
Running the SNN in a Datanode host is not recommended. A separate host is preferred to run the Secondary Namenode process. The host chosen for SNN must have identical memory as the NN.
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)
masters file is not in use anymore. Add this property in hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>SNN_host:50090</value>
</property>
Also note that, SecondaryNamenode process is started by default in the node where start-dfs.sh is executed.

Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

Hadoop config - hdfs-site.xml : Should I use the same file on namenode and datanode?

On a distributed Hadoop cluster, can I copy the same hdfs-site.xml file to the namenodes and datanodes?
Some of the set-up instructions I've seen (i.e. Cloudera) say to have the dfs.data.dir property in this file on the datanodes and and the dfs.name.dir property in this file on the namenode. Meaning I should have two copies of hdfs-site.xml, one for the namenode and one for the datanodes.
But if it's all the same I'd rather just own/maintain one copy of the file and push it to ALL nodes anytime I change it.
Is there any harm/risk in having both dfs.name.dir and dfs.data.dir properties in the same file? What issues might happen if a data node sees the property for "dfs.name.dir" ?
And if there are issues, what other properties should be in the hdfs-site.xml file on the namenode but not on datanode? and vice versa.
And finally, what properties need to be included in the hdfs-site.xml file that I copy to a client machine (who isn't a tasktracker or datanode, but just talks to the Hadoop cluster) ?
I'v searched around, including the O'reilly operations book, but can't find any good article describing how the config file needs to differ across different nodes.
Thanks!
The namenode is picked up from masters file therefore essentially FSimage and edit logs will be written only on namenode and not in the datanode even though you copy the same hdfs-site.xml.
For the second question..You can't necessarily communicate with hdfs without being on the cluster directly. If you want to have a remote client you might try webhdfs and create certain web services using which you can write or access files in hdfs

Simulating Map-reduce using Cloudera

I want to use cloudera to simulate Hadoop job on a single machine (of course with many VMs). I have 2 question
1) Can I change the replication policy of HDFS in cloudera?
2) Can I see cpu usage of each VMs?
You can use hadoop fs -setrep to change the replication factor on any file. You can also change the default replication factor by modifying hdfs-site.xml by adding the following:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
You'll have to log into each box and use top to see the cpu usage of each VM. There is nothing out of the box in Hadoop that lets you see this.
I found out that I can change data replication policy by changing "ReplicationTargetChooser.java".

Resources