How to allocate memory to datanode in hadoop configuration - hadoop

we have a below requirement.
We have a totally 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (we are not going to use any distribution like cloudera, hortonworks...etc).
All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu Linux 14.04LTS server
We would install hadoop on all the servers. Server3,4,5 will be completely using them for datanode (slave machines) whereas server1 would have Active Namenode and Datanode. Server2 would have standby Namenode and datanode.
we want to configure 300GB RAM for Namenode and 212GB RAM for datanode while configuring hadoop.
Could anyone help me how to do that. which configuration file in hadoop needs to be changed. what are the parameter we need to configure in hadoop configuration files.
Thanks and Regards,
Suresh Pitchaipillai

You can cset these properties from cloudera manager (in case you are using CDH) or from Ambari (if you use Hortonworks).
Also you do not need 300GB for Namenode as namenode only stores metadat. Roughly speaking 1GB of namenode heap can store metadata of 1milion blocks (block size = 128MB).
More details here : https://issues.apache.org/jira/browse/HADOOP-1687

Assuming that you are going to use latest hadoop distribution with Yarn.
Read this article - Reference. It has explained every parameter in details and it is awesome in explanation.
There is one more article from Hortenworks, though it is applicable to all apache based hadoop distribution.
At last keep this handly - Yarn-configuration. It is self explanatory.

Related

How does Hadoop distribute the data/tasks for MapReduce jobs?

I've setup a Hadoop cluster with 4 nodes, one of which serves as the NameNode for HDFS as well as the Yarn master. This node is also the most powerful.
Now, I've distributed 2 text files, one on the node01 (namenode) and one on node03 (datanode). When running the basic WordCount MapReduce job, I can see in the logs that only node01 was doing any calculations.
My question is why Hadoop didn't decide to do MapReduce on node03 and transfer the result instead of transferring the entire book to node01. I also checked, duplication is disabled and the book is only available on node03.
So, how does Hadoop decide between transferring the data and setting up the jobs and in this decision, does it check which machine has more compute power (e.g. did it decide to transfer to node01 because node01 is a 4 core 4gig ram machine vs 2core 1 gig on node03)?
I couldn't find anything on this topic, so any guidance would be appreciated.
Thank you!
Some more clarifications:
node01 is running a NameNode as well as a DataNode and a ResourceManager as well as a NodeManager. Thus, it serves as "main node" as well as a "compute node".
I made sure to put one file on node01 and one file on node03 by running:
hdfs dfs -put sample1.txt samples on node01 and hdfs dfs -put sample02.txt samples on node03. As replication is disabled, this leads to the data - that was available locally on node01 respective node03 - only being stored there.
I verified this using the HDFS Webinterface. For sample1.txt, it says the blocks are only available on node01; for sample2.txt, it says the blocks are only available on node03.
Regarding #cricket_007:
My concern is that sample2.txt is only available on node03. The YARN Webinterface tells me that that for the Application Attempt, only one container was allocated on node01. If the map task for file sample2.txt, there would have been a container on node03 as well.
Thus, node01 needs to have fetched the sample2.txt file from node03.
Yes, I know Hadoop is not running well on 1gig of RAM, but I am working with a Raspberry Pi cluster just to fiddle around and learn a little. This is not for production usage.
The YARN application master picks a node at random to run the calculation based on information available from the Namenode where files are stored. DataNodes and NodeManagers should run on the same machines.
If your file isn't larger than the HDFS block size, there is no reason to fetch the data from other nodes.
Note: Hadoop services don't run that well on only 1G of RAM, and you need to adjust the YARN settings differently for different sized nodes.
For anyone else wondering:
At least for me, the HistoryServer UI (which needs to be started manually) shows correctly that node03 and node01 were running map jobs. Thus, my statement was incorrect. I still wonder why the application attempt UI speaks of one container, but I guess that doesn't matter.
Thank you guys!

Migrating Hadoop Clusters from Big Insights to Cloudera

What are the best approaches to migrate clusters of size 1 TB from Big Insights to Cloudera.
Cloudera being a kerborized cluster.
The current approach which we are following is through batches:
a. Take the cluster and move it to Unix filesystem
b. SCP to Cloudera filesystem
c. Dump from cloudera file system to cloudera HDFS
This is not an effective approach
Distcp does work with a kerberized cluster
However it's not clear if you actually have 333GB x3 replicas = 1TB or actually 1TB of raw data.
In either case, you're more than welcome to purchase an external 4TB (or more) drive and copyToLocal every file on your cluster, then upload it anywhere else.

How to add a Secondary NameNode in a HBase cluster setup?

I've a Hbase cluster setup with 3 nodes: A NameNode and 2 DataNodes.
The NameNode is a server with 4GB memory and 20GB hard disk while each DataNode has 8GB memory and 100GB hard disk.
I'm using
Apache Hadoop version: 2.7.2 and
Apache Hbase version: 1.2.4
I've seen some people mentioned about a Secondary NameNode.
My questions are,
What is the impact of not having a Secondary NameNode in my setup?
Is it possible to use one of the DataNodes as the Secondary NameNode?
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)
What is the impact of not having a Secondary NameNode in my setup?
SecondaryNamenode does the job of periodically merging the namespace image with the edit log (called as checkpointing). Your setup is not an High-Availability setup, thus not having one will cause the edit log to grow large in size which would eventually add an overhead to the NameNode during startup.
Is it possible to use one of the DataNodes as the Secondary NameNode?
Running the SNN in a Datanode host is not recommended. A separate host is preferred to run the Secondary Namenode process. The host chosen for SNN must have identical memory as the NN.
If possible how can I do it? (I inserted only the NameNode in /etc/hadoop/masters file.)
masters file is not in use anymore. Add this property in hdfs-site.xml
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>SNN_host:50090</value>
</property>
Also note that, SecondaryNamenode process is started by default in the node where start-dfs.sh is executed.

differences between HDFS and ZooKeeper?

While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?
Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.

When will HDFS be unavailable?

Name node is the single point of failure for HDFS. Is this correct?
Then what about Jobtracker? If Jobtracker fails, is HDFS available?
HDFS is completely independent of the Jobtracker. As long as at least the NN is up, HDFS is nominally usable, with overall degradation dependent on the number of Datanodes that are down.
As Ambar mentioned HDFS as in the file system does not depend on the JobTracker. The current released version of Hadoop does not support Namenode high availability out of the box but you can work around it (e.g. deploy the namenode using a traditional clustering solution of active/passive with shared storage).
The next release (2.0/0.23) does fix the namenode availability issue.
You can read more about it in a blog post by Aaron Myers "High Availability for the Hadoop Distributed File System (HDFS)"
If the JobTracker is not available you cannot execute map/reduce jobs

Resources