Why is RAID not recommended for Hadoop HDFS setups? - hadoop

Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons:
Speed limited to slower disk (JBOD performs better).
Reliability
It is recommended to use RAID on NameNode.
But what about implementing RAID on each DataNode storage disk?

RAID is used for two purposes. Depending on the RAID configuration you can get:
Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.
HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.
This ensures:
Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).
Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.

RAID0 on and enterprise server is a huge mistake. I sure would like to meet the person that designed this. This makes no common sense to an IT operations manager. If you configure any of your local server disk with a RAID0 you risk a long and painful RAID0 recovery. If a single disk in a RAID0 fails that RAID partition becomes destroyed and it doesn't magically recover when the disk is replaced. Someone has to logon to the server and delete the old RAID partition and create a new one. This creates a lot of overhead in times when man hours and work cycles are at an all time high. An IT operations manager is either going to delay doing this due to more priority workload or refuse to do it because they don't have enough cycles to take people resources away for more important work. Then its going to get pushed off to another team. Then the politics begin and wham then it gets pushed back to the server owner/customer. If you wanted to make a RAID1 or SAN drive available then you could avoid that entire scenario.

Related

Impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance

What is the impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance ? I am having a HBase cluster hosted on Azure VMs with data stored in azure managed disks. Azure managed disk itself keeps 3 copies of the data for fault tolerance, so thinking of reducing the HDFS replication factor to save on storage overhead. Given that map reduce jobs make use of local availability of the data to avoid data transfer over network, wondering anyone has any information on the impact on map reduce performance if there just one replica of the data available?
This is a difficult question to answer as it depends greatly on what workloads you run.
By decreasing the replication factor, you can speed up the performance of write operations, since the data is written to fewer DataNodes. However, as you noted, you may have decreased locality since it can be more difficult to find a node which has a replica and has free space to execute a task.
Keeping only a single replica can have strong implications on the impact of a single node failure. If a single node dies, all of its data will be unavailable until you restart a new node with the same Azure managed disks. If there are multiple HDFS replicas, data availability is maintained throughout.
Running HDFS DataNodes on top of Azure managed disks sounds like a bit of a bad idea. In addition to breaking some of the core HDFS assumptions ("my disk might fail at any time"), it seems unlikely that you have true data locality if your data is stored in three replicas. I wonder if you have considered:
Using a non-managed disk service. Does Azure provide a way to use a disk which is not replicated? This is much closer to how HDFS is intended to be used.
Storing data in Azure storage (WASB or ADLS) instead of HDFS. This is more "cloud native" way of running things. If you find that performance is lacking, you can use HDFS for intermediate data and only store final data in Azure. HDFS also provides a way to cache data from external storage systems by using Provided Storage.

hadoop (HDFS) with diskless compute nodes

I have a small cluster with one node that has RAID storage, and several powerful diskless compute nodes that boot over PXE. All nodes are connected by InfiniBand (and 1G Ethernet for booting).
I need to deploy Hadoop on this cluster.
Please suggest optimal configuration
As I understand default configuration means that all compute nodes has self small storage, but in my situation (if I have NFS share) it will make too many copies by network. I have found resources about using Hadoop with Lustre, but I do not understand how to configure it
What you describe is probably possible but - instead of making use of Hadoop features - you are trying to find a way around them.
Moving computation is cheaper than moving data - data locality is one of the cornerstones of Hadoop and that's why all the worker nodes in the cluster are also storage nodes. Hadoop attempts to do as much computation as possible on the nodes where the processed blocks are located to avoid network congestion.
https://developer.yahoo.com/hadoop/tutorial/module1.html
The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
MapReduce tends to generate large volumes of temporary files, so 15 GB per node is simply not enough storage.

How Hadoop writes on hard drives of each data node?

I want to know in each data node if we have four hds with 500GB capacity is better or one with 2TB in other word in hds of one data node writing between hds is parallel or not?
It does not read/write the same single block in parallel. However, it does read/write several blocks in parallel. That is, if you are just writing one file, you won't see any difference... but if you are running a MapReduce job with several tasks per node (typical), you will benefit from the additional throughput.
There are other considerations than 500GB v. 2TB. Physical space in the nodes, cost, heat/cooling, etc. For example, if you fill a box with four times as many drives, do your nodes need to be 2U instead of 1U with 2TB? But if you are just talking about performance I'd take 4x 500GB over 1x 2 TB any day.
Keeping the cooling/power and other aspects out of consideration. Multiple HDDs provides better R/W throughput than a single HDD of the same capacity. Since, we are talking about Big Data this makes much more sense. Also, multiple HDDs provide better fault tolerance than a larger single HDD.
Check this blog about the general h/w recommendations.
If you have 4 disks mounted as /disk1, /disk2, /disk3 and /disk4 for a datanode, it usually uses round robin to write to those disks. It's usually a better approach to have multiple disks, since, when Hadoop will try to read distinct blocks from separate disks concurrently it won't be limited by the I/O capability of a single disk.

Hadoop put performance - large file (20gb)

I'm using hdfs -put to load a large 20GB file into hdfs. Currently the process runs # 4mins. I'm trying to improve the write time of loading data into hdfs. I tried utilizing different block sizes to improve write speed but got the below results:
512M blocksize = 4mins;
256M blocksize = 4mins;
128M blocksize = 4mins;
64M blocksize = 4mins;
Does anyone know what the bottleneck could be and other options I could explore to improve performance of the -put cmd?
20GB / 4minute comes out to about 85MB/sec. That's pretty reasonable throughput to expect from a single drive with all the overhead of HDFS protocol and network. I'm betting that is your bottleneck. Without changing your ingest process, you're not going to be able to make this magically faster.
The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).
Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.
I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.
It depends a lot on the details of your setup. First, know that 20GB in 4 mins is 80MBps.
The bottleneck is most likely your local machine's hardware or its ethernet connection. I doubt playing with block size will improve your throughput by much.
If your local machine has a typical 7200rpm hard drive, its disk to buffer transfer rate is about 128MBps, meaning that it could load that 20BG file into memory in about 2:35, assuming you have 20GB to spare. However, you're not just copying it to memory, you're streaming it from memory to network packets, so it's understandable that you incur an additional overhead for processing these tasks.
Also see the wikipedia entry on wire speed, which puts a fast ethernet setup at 100Mbit/s (~12MB/s). Note that in this case fast ethernet is a term for a particular group of ethernet standards. You are clearly getting a faster rate than this. Wire speed is a good measure, because it accounts for all the factors on your local machine.
So let's break down the different steps in the streaming process on your local machine:
Read a chunk from file and load it into memory. Components: hard drive, memory
Split and translate that chunk into packets. Last I heard Hadoop doesn't use DMA features out of the box, so these operations will be performed by your CPU rather than the NIC. Components: Memory, CPU
Transmit packets to hadoop file servers. Components: NIC, Network
Without knowing more about your local machine, it is hard to specify which of these components is the bottleneck. However, these are the places to start investigating bitrate.
you may want to use distcp
hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/outputdata
to perform parallel copy

Would HBase/HDFS deployment make sense with 100mbit/s network interfaces?

I guess that 100Mbit/s network interface will be bottle neck for HDFS and slow down HBase on top of it (max compactions speed about 10MB/s, etc.). Would this deployment make sense?
I am thinking that "now" when when SSD comes in to game even 1Gbit/s network interfeces still can be bottleneck, so maybe building a cluster with 100Mbit/s should never be taken into account (even for HDD)?
To keep it short:
You should never use a SSD in HDFS, these flash memorys have a limited number of writes. HDFS has many writes, that's mainly because of the replication. If you are using HBase as a NoSQL DB this will result in even more writes.
The bottlenecks are as you said the harddisk and the network. Network is an even higher bottleneck because you are distributing the data, so it has to be replicated and if you are running jobs, they could be copied if the data is not locally available (Reducers have to copy much stuff).
So you should definitely for a better network than 10Mbit or 100Mbit. That implies your switch and the NICs on the nodes.
A hdd raid will not result in a higher bandwidth in writing, there were several benchmarks that proof that. Have a look at the HDFS Wiki, it must be described there.
100MB network is not likely to be a good setup for an hadoop cluster you can see cisco's presentation from Hadoop World for some analysis of network usage. That said depending on your actual load and cluster size it might be workable - though you might want to make sure you actually need Hadoop if that is the case.
regarding SSDs they cost more per MB and depending on your write load you may have to replace them sooner than HDDs but they will save you electricity - I guess it wouldn't be cost effective to use them in a large cluster (I don't know of anyone who did)
You can use SSDs for some of the disks e.g. for the temporary space on the cluster (such as map/reduce intermediate results) to get the IO benefits
Whether or not your network will be the bottleneck depends on the kinds of jobs you are running. If you do text processing (e.g. running Stanford NER or coreference suite), then a 100Mbit/s network will be the least of your concerns. However, if you are doing a lot of I/O intensive processing (most jobs with big reduce steps), then it will be. As always, it depends on your workload. But, I think it is safe to say that a 100Mb network is the most likely culprit for a bottleneck given recent processors and nodes with several disks.

Resources