Hadoop Cluster Failover - hadoop

I have some questions about the Hadoop Cluster datanode failover:
1: What happen the link is down between the namenode and a datanode
(or between 2 datanodes) when the hadoop cluster is processing some data?
Does Hadoop cluster have any OOTB to recover this problem?
2: What happen one datanode is down when the hadoop cluster is processing
some data?
Also, another question is about the hadoop cluster hardware configuration. Let's say we will use our hadoop cluster to process 100GB log files each day, how many datanodes do we need to set up? And for each datanode hardware configuration(e.g. CPU, RAM, Hardisk)?

1: What happen the link is down between the namenode and a datanode
(or between 2 datanodes) when the hadoop cluster is processing some data?
Does Hadoop cluster have any OOTB to recover this problem?
NN will not receive any heartbeat from that node and hence consider it as dead. In such a case the task running on that node will be scheduled on some other node having that data.
2: What happen one datanode is down when the hadoop cluster is processing
some data?
Same as above.
For the second part of your question :
It totally depends on your data and the kind of processing you are going to perform and a few other things. 100G is not a suitable candidate for MR processing on the first place. But, if you still need it any decent machine would be sufficient enough to process 100G data.
As a thumb rule you can consider :
RAM : 1G RAM for each 1 million HDFS blocks+some additional for other things.
CPU : Based totally on your needs.
Disk : 3 times your datasize(if replication factor =3)+some additional space for stuff like temporary files, other apps etc. JBOD is preferable.
Frankly speaking the process is a lot more involved. I would strongly suggest you to go through this link in order to get a proper idea.
I would start with a cluster with 5 machines :
1 * Master(NN+JT) -
Disk : 3 * 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image)
CPU : 2 quad core CPUs, running at least 2-2.5GHz
RAM : 32 GB of RAM
3 * Slaves(DN+TT) -
Disk : 3 * 2 TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
CPU : 2 quad core CPUs, running at least 2-2.5GHz
RAM : 16 GB of RAM
1 * SNN -
I would keep it same as master machine.

Depending on whether the namenode or datanode is down, the job will be rewired to different machines. HDFS was specifically designed for this. Yes, it's definitely out of the box.
If there are more datanodes available then the job is transferred.
100GB is not large enough to justify using hadoop. Don't use hadoop unless you absolutely need to.

Related

Memory for Namenode(s) in Hadoop

Environment: The production cluster has 2 name-nodes (active and standby namely) and the nodes are SAS drives in Raid-1 configuration. These nodes have nothing but the master services (NN and Standby NN) running on each. They have a Ram of 256GB while the data nodes (where most of the processing happens) are set with only 128GB.
My Question: Why does Hadoop’s master nodes have such high Ram and why not the Datanodes when most of the processing is done where the data is available.?
P.S. As per hadoop thumb rule, we only require 1GB for every 1million files.
The Namenode stores all file references from all the datanodes in memory.
The datanode process doesn't need much memory, it's the YARN NodeManagers that do

Spark streaming application configuration with YARN

I'm trying to squeeze every single bit from my cluster when configuring the spark application but it seems I'm not understanding everything completely right. So I'm running the application on an AWS EMR cluster with 1 master and 2 core nodes from type m3.xlarge(15G ram and 4 vCPU for every node). This means that by default 11.25 GB are reserved on every node for applications scheduled by yarn. So the master node is used only by the resource manager(yarn) and that means the remaining 2 core nodes will be used to schedule applications(so we have 22.5G for that purpose). So far so good. But here comes the part which I don't get. I'm starting the spark application with the following parameters:
--driver-memory 4G --num-executors 4 --executor-cores 7 --executor-memory 4G
What this means by my perceptions(from what I found as information) is that for the driver will be allocated 4G and 4 executors will be launched with 4G every one of them. So a rough estimate makes it 5*4=20G(lets make them 21G with the expected memory reserves), which should be fine as we have 22.5G for applications. Here's a screenshot from the UI of the hadoop yarn after the launch:
What we can see is that 17.63 are used by the application but this a little bit less than the expected ~21G and this triggers the first question- what did happen here?
Then I go to the spark UI's executors page. Here comes the bigger question:
The executors are 3(not 4), the memory allocated for them and the driver is 2.1G(not the specified 4G). So hadoop yarn says 17.63G are used, but the spark says 8.4G are allocated. So, what is happening here? Is this related to the Capacity Scheduler(from the documentation I couldn't come up with this conclusion)?
Can you check whether spark.dynamicAllocation.enabled is turned on. If that is the case then spark your application may give resources back to the cluster if they are no longer used. The minimum number of executors to be launched at the startup will be decided by spark.executor.instances.
If that is not the case, what is your source for spark application and what is the partition size set for that, spark will literally map the partition size to the spark cores, if your source has only 10 partitions, and when you try to allocate 15 cores it will only use 10 cores because that is what is needed. I guess this might be the cause that spark has launched 3 executors instead of 4. Regarding memory i would recommend to revisit because you are asking for 4 executors and 1 driver with 4Gb each which would be 5*4+5*384MB approx equals to 22GB and you are trying to use up everything and not much is left for your OS and nodemanager to run that would not be the ideal way to do.

HDP cluster with RAID?

What is your experience with RAID1 on HDP cluster?
I have in my mind two options:
Setup RAID 1 for master and zoo nodes, and don't use RAID at all on slave nodes like kafka brokers, hbase regionservers and yarn nodemanager's.
Even if I loose one slave node, I will have two other replicas.
In my opinion, RAID will only slow down my cluster.
Despite everything, setup everything using RAID 1.
What do you think about it? What is you experience with HDP and RAID?
What do you think about using RAID 0 for slave nodes?
I'd recommend no RAID at all on Hadoop hosts. There is one caveat, in that if you are running services like Oozie and the Hive metastore that use a relational DB behind the scenes, raid may well make sense on the DB host.
On a master node, assuming you have Namenode, zookeeper etc - generally the redundancy is built into the service. For namenodes, all the data is stored on both namenodes. For Zookeeper, if you lose one node, then the other two nodes have all the information.
Zookeeper likes fast disks - ideally dedicate a full disk to zookeeper. If you have namenode HA, give the namenode edits directory and each journal node a dedicated disk too.
For the slave nodes, the datanode will write across all disks, effectively striping the data anyway. Each 'write' is at most the HDFS block size, so if you were writing a large file, you could get 128MB on disk 1, then the next 128MB on disk 2 etc.

Hadoop multi node cluster

I am a newbie to Hadoop. Please correct me if I am asking nonsense and help me to solve this problem :).
I installed and configured a two node hadoop cluster (yarn).
Master node : 2TB HDD, 4GB RAM
Slave node : 500GB HDD, 4GB RAM
Datanode:
Master node only (Not keeping replicated data in Slave node)
Map/Reduce :
Master node & Slave node.
Out of 10TB data, I uploaded 2TB to Master node(Data node). I am using slave nodes for Map/Reduce only (to use 100% CPU of slave node for running queries).
My questions:
If I add a new 2TB HDD to master node and I want to upload 2TB more to master node, how can I use both the HDD(data in Old HDD and New HDD in master)? Is there any way to give multiple HDD path in hdfs-site.xml?
Do I need to add 4TB HDD in slave node(With all the data in master) to use 100% of CPU of slave? Or Do slave can access data from master and run Map/Reduce jobs?
If I add 4TB to slave and uploading data to hadoop. Will that make any replication in master(duplicates)? Can I get access to all data in the primary HDD of master and primary HDD of slave? Do the queries uses 100% CPU of both nodes if I am doing this?
As a whole, If I have a 10TB data. what is the proper way to configure Hadoop two node cluster? what specification(for master and datanode) should I use to run the Hive queries fast?
I got stuck. I really need your suggestions and help.
Thanks a ton in advance.
Please find the answers below:
provide a comma-separated list of directories in hdfs-site.xml . source https://www.safaribooksonline.com/library/view/hadoop-mapreduce-cookbook/9781849517287/ch02s05.html
No. you don't need to add HDD on slave to use 100% CPU. Under the current configuration the node manager running on slave will read data from data node running on master (over network). This is not efficient in terms of data locality, but it doesn't affect the processing throughput. It will add additional latency due to network transfer.
No. The replication factor (number of copies to be stored) is independent of number of data nodes. The default replication factor can be changed hdfs-site.xml using the property dfs.replication. You can also configure this on per file basis.
You will need at least 10GB of storage across you cluster (all the data node combined, with replication factor 1). For a production system I would recommend a replication factor 3 (to handle node failure), that is 10*3 = 30GB storage across at least 3 nodes. Since 10GB is very small in Hadoop terms, have 3 nodes each with 2 or 4 core processor and 4 to 8 GB memory. Configur this as - node1: name node + data node + node manager, node2: resource manager + data node + node manager, node3: data node + node manager.

HADOOP HDFS imbalance issue

I have a Hadoop cluster that have 8 machines and all the 8 machines are data nodes.
There's a program running on one machine(say machine A) that will create sequence files ( each of the file is about 1GB) in HDFS continuously.
Here's the problem: All of the 8 machines are the same hardware and has the same capacity. When other machines still have about 50% free space on the disks for HDFS, machine A has only 5% left.
I checked the block info and found that almost every block has one replica on machine A.
Is there any way to balance the replicas?
Thanks.
This is the default placement policy. It works well for the typical M/R pattern, where each HDFS node is also a compute node and the writer machines are uniformly distributed.
If you don't like it, then there is HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS. You need to write a class that implements BlockPlacementPolicy interface, and then set this class in as the dfs.block.replicator.classname in hdfs-site.xml.
There is a way. you can use hadoop command line balancer tool.
HDFS data might not always be be placed uniformly across the DataNode.To spread HDFS data uniformly across the DataNodes in the cluster, this can be used.
hadoop balancer [-threshold <threshold>]
where, threshold is Percentage of disk capacity
see the following links for details:
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Rebalancer

Resources