HDP cluster with RAID? - hadoop

What is your experience with RAID1 on HDP cluster?
I have in my mind two options:
Setup RAID 1 for master and zoo nodes, and don't use RAID at all on slave nodes like kafka brokers, hbase regionservers and yarn nodemanager's.
Even if I loose one slave node, I will have two other replicas.
In my opinion, RAID will only slow down my cluster.
Despite everything, setup everything using RAID 1.
What do you think about it? What is you experience with HDP and RAID?
What do you think about using RAID 0 for slave nodes?

I'd recommend no RAID at all on Hadoop hosts. There is one caveat, in that if you are running services like Oozie and the Hive metastore that use a relational DB behind the scenes, raid may well make sense on the DB host.
On a master node, assuming you have Namenode, zookeeper etc - generally the redundancy is built into the service. For namenodes, all the data is stored on both namenodes. For Zookeeper, if you lose one node, then the other two nodes have all the information.
Zookeeper likes fast disks - ideally dedicate a full disk to zookeeper. If you have namenode HA, give the namenode edits directory and each journal node a dedicated disk too.
For the slave nodes, the datanode will write across all disks, effectively striping the data anyway. Each 'write' is at most the HDFS block size, so if you were writing a large file, you could get 128MB on disk 1, then the next 128MB on disk 2 etc.

Related

Memory for Namenode(s) in Hadoop

Environment: The production cluster has 2 name-nodes (active and standby namely) and the nodes are SAS drives in Raid-1 configuration. These nodes have nothing but the master services (NN and Standby NN) running on each. They have a Ram of 256GB while the data nodes (where most of the processing happens) are set with only 128GB.
My Question: Why does Hadoop’s master nodes have such high Ram and why not the Datanodes when most of the processing is done where the data is available.?
P.S. As per hadoop thumb rule, we only require 1GB for every 1million files.
The Namenode stores all file references from all the datanodes in memory.
The datanode process doesn't need much memory, it's the YARN NodeManagers that do

Does hbase really scales linearly?

I started to learn hbase and I don't understand how it scales linearly.
The problem is that before you install hbase you have to have an hdfs cluster. The HDFS cluster have a master node which can be only one in the whole cluster, so it is a bottleneck. Ofcourse we can run 1 more master node (it is possible to run only 1 more master node) but it will be in the standby state.
As I understand hbase uses the HDFS cluster to store data. So, for me it is logically that it have no sense to run more than one Hmaster because all requests will go to the hdfs active master which performance can suffer if we have too much requests.
Also I don't understand properly do we need to install hbase on the same nodes with hdfs or separately. What are the benefits if we run hbase separately from HDFS.
As for me it is logically to install hbase cluster on the same nodes with hdfs as in the following example:
HDFS active master - HMaster
HDFS standby master - HMaster backup
HDFS Data node - HRegion server
for me it is the most logically structure because if we separate hdfs master from hmaster then probability to loose hbase cluster will be two times bigger.
I will be very happy if someone can share information about all these stuff. Because I really don't understand how hbase can linearly scales and how it works with hdfs.
First if you want you can install HBase over any supported file system. It is not mandatory to use it over Hdfs but using it with Hdfs give advantage to it like
Fault taulrence , Data replication, checksums etc.
That's why it is recommended to use HBase over hdfs
Moreover although there is a bottleneck of namenode in hdfs but it does not effect HBase efficiency because it is not that every operation internal working is dependent on namenode of hdfs for instance Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly while Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Which means that reading and writing of data is independent of creating and deleting of table.
You can refer https://www.mapr.com/blog/in-depth-look-hbase-architecture for more details about hdfs.
Also see this webinar on HBase by lars george. https://m.youtube.com/watch?v=_HLoH_PgrLk
Hope this will clear your doubts.

Hadoop multi node cluster

I am a newbie to Hadoop. Please correct me if I am asking nonsense and help me to solve this problem :).
I installed and configured a two node hadoop cluster (yarn).
Master node : 2TB HDD, 4GB RAM
Slave node : 500GB HDD, 4GB RAM
Datanode:
Master node only (Not keeping replicated data in Slave node)
Map/Reduce :
Master node & Slave node.
Out of 10TB data, I uploaded 2TB to Master node(Data node). I am using slave nodes for Map/Reduce only (to use 100% CPU of slave node for running queries).
My questions:
If I add a new 2TB HDD to master node and I want to upload 2TB more to master node, how can I use both the HDD(data in Old HDD and New HDD in master)? Is there any way to give multiple HDD path in hdfs-site.xml?
Do I need to add 4TB HDD in slave node(With all the data in master) to use 100% of CPU of slave? Or Do slave can access data from master and run Map/Reduce jobs?
If I add 4TB to slave and uploading data to hadoop. Will that make any replication in master(duplicates)? Can I get access to all data in the primary HDD of master and primary HDD of slave? Do the queries uses 100% CPU of both nodes if I am doing this?
As a whole, If I have a 10TB data. what is the proper way to configure Hadoop two node cluster? what specification(for master and datanode) should I use to run the Hive queries fast?
I got stuck. I really need your suggestions and help.
Thanks a ton in advance.
Please find the answers below:
provide a comma-separated list of directories in hdfs-site.xml . source https://www.safaribooksonline.com/library/view/hadoop-mapreduce-cookbook/9781849517287/ch02s05.html
No. you don't need to add HDD on slave to use 100% CPU. Under the current configuration the node manager running on slave will read data from data node running on master (over network). This is not efficient in terms of data locality, but it doesn't affect the processing throughput. It will add additional latency due to network transfer.
No. The replication factor (number of copies to be stored) is independent of number of data nodes. The default replication factor can be changed hdfs-site.xml using the property dfs.replication. You can also configure this on per file basis.
You will need at least 10GB of storage across you cluster (all the data node combined, with replication factor 1). For a production system I would recommend a replication factor 3 (to handle node failure), that is 10*3 = 30GB storage across at least 3 nodes. Since 10GB is very small in Hadoop terms, have 3 nodes each with 2 or 4 core processor and 4 to 8 GB memory. Configur this as - node1: name node + data node + node manager, node2: resource manager + data node + node manager, node3: data node + node manager.

DataNode and TaskTracker on separate machines?

I am fairly new to Hadoop and I have following questions on Hadoop framework. Can anybody please guide on this?
Is DataNode and TaskTracker located physically on separate machines in a production environment?
When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Short answer
Most of the time, but not necessarily.
Yes.
Long Answer
1)
An installation of Hadoop on a cluster will have 2 main types of nodes:
Master Nodes
Data Nodes
Master Nodes typically run at least:
CLDB
Zookeeper
JobTracker
Data Nodes typically run at least:
TaskTracker
The DataNode service can run on a different node than the TaskTracker service. However, the Hadoop Docs for the DataNode service recommend to run DataNode and TaskTracker on the same nodes so that MapReduce operations are performed close to the data.
For the MapR distribution of Hadoop, the two server roles typically run:
MapR Control Node
ZooKeeper *
CLDB *
JobTracker *
HBaseMaster
NFS Gateway
Webserver
MapR Data Node
TaskTracker *
RegionServer (sometimes)
Zookeeper (sometimes)
2)
While most filesystems store data in blocks, HDFS distributes & replicates the blocks across DataNodes. When you first store data in HDFS, it will break it into blocks and store it across different nodes according to the specified replication factor. However, if you add new DataNodes to the cluster, it will not automatically rebalance old blocks across them unless the replication factor is not met.
(Thanks to #javadba for clarifying this!)
Given TrinitronX has already answered #1 - though the Short Answer should be NO - the datanode/task tracker MAY be on different physical machines, but it is uncommon. You are best to start off with "slave" machines being datanode plus task tracker.
So this is an answer to the second part of the question
2) When does Hadoop splits a file into blocks? Does this happen when you copy a file from local filesystem into HDFS?
Yes. The file is broken into blocks upon loading into HDFS.
Data-node and Job tracker can be run on different machines.
Hadoop always stores a files as a blocks during all operations on hadoop
Refer
1.Hadoop Job tracker and task tracker
2.Hadoop block size and Replication

HADOOP HDFS imbalance issue

I have a Hadoop cluster that have 8 machines and all the 8 machines are data nodes.
There's a program running on one machine(say machine A) that will create sequence files ( each of the file is about 1GB) in HDFS continuously.
Here's the problem: All of the 8 machines are the same hardware and has the same capacity. When other machines still have about 50% free space on the disks for HDFS, machine A has only 5% left.
I checked the block info and found that almost every block has one replica on machine A.
Is there any way to balance the replicas?
Thanks.
This is the default placement policy. It works well for the typical M/R pattern, where each HDFS node is also a compute node and the writer machines are uniformly distributed.
If you don't like it, then there is HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS. You need to write a class that implements BlockPlacementPolicy interface, and then set this class in as the dfs.block.replicator.classname in hdfs-site.xml.
There is a way. you can use hadoop command line balancer tool.
HDFS data might not always be be placed uniformly across the DataNode.To spread HDFS data uniformly across the DataNodes in the cluster, this can be used.
hadoop balancer [-threshold <threshold>]
where, threshold is Percentage of disk capacity
see the following links for details:
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Rebalancer

Resources