I have a small cluster with one node that has RAID storage, and several powerful diskless compute nodes that boot over PXE. All nodes are connected by InfiniBand (and 1G Ethernet for booting).
I need to deploy Hadoop on this cluster.
Please suggest optimal configuration
As I understand default configuration means that all compute nodes has self small storage, but in my situation (if I have NFS share) it will make too many copies by network. I have found resources about using Hadoop with Lustre, but I do not understand how to configure it
What you describe is probably possible but - instead of making use of Hadoop features - you are trying to find a way around them.
Moving computation is cheaper than moving data - data locality is one of the cornerstones of Hadoop and that's why all the worker nodes in the cluster are also storage nodes. Hadoop attempts to do as much computation as possible on the nodes where the processed blocks are located to avoid network congestion.
https://developer.yahoo.com/hadoop/tutorial/module1.html
The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
MapReduce tends to generate large volumes of temporary files, so 15 GB per node is simply not enough storage.
Related
Data locality as defined by many Hadoop tutorial sites (i.e. https://techvidvan.com/tutorials/data-locality-in-hadoop-mapreduce/) states that: "Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion."
I can understand having the node where the data resides process the computation for those data, instead of moving data around, would be efficient. However, what does it mean by "moving the computation close to where the actual data resides"? Does this mean that if the data sits in a server in Germany, it is better to use the server in France to do the computation on those data instead of using the server in Singapore to do the computation since France is closer to Germany than Singapore?
Typically people talk about this on a quite different scale, especially within a Hadoop context.
Suppose you have a cluster of 5 nodes, you store a file there and need to do a calculation on it.
With data locality you try to make the calculation happen on the node(s) where the data is stored (rather than for example the first node that has compute resources available).
This reduces network load.
It is good to realize that in many new infrastructures the network is not the bottleneck, so you will keep hearing more about the decoupling of compute and storage.
I +1 Dennis Jaheruddin's answer, and just wanted to add -- you can actually see different locality levels in MR when you check job counters, in Job History UI for example.
HDFS and YARN are rack-aware so its not just binary same-or-other node: in the above screen, Data-local means the task was running local to the machine that contained actual data; Rack-local -- that the data wasn't local to the node running the task and needed to be copied, but was still on the same rack; and finally the Other local case -- where the data wasn't available local, nor on the same rack, so it had to be copied over two switches to the node that run the computation.
What is the impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance ? I am having a HBase cluster hosted on Azure VMs with data stored in azure managed disks. Azure managed disk itself keeps 3 copies of the data for fault tolerance, so thinking of reducing the HDFS replication factor to save on storage overhead. Given that map reduce jobs make use of local availability of the data to avoid data transfer over network, wondering anyone has any information on the impact on map reduce performance if there just one replica of the data available?
This is a difficult question to answer as it depends greatly on what workloads you run.
By decreasing the replication factor, you can speed up the performance of write operations, since the data is written to fewer DataNodes. However, as you noted, you may have decreased locality since it can be more difficult to find a node which has a replica and has free space to execute a task.
Keeping only a single replica can have strong implications on the impact of a single node failure. If a single node dies, all of its data will be unavailable until you restart a new node with the same Azure managed disks. If there are multiple HDFS replicas, data availability is maintained throughout.
Running HDFS DataNodes on top of Azure managed disks sounds like a bit of a bad idea. In addition to breaking some of the core HDFS assumptions ("my disk might fail at any time"), it seems unlikely that you have true data locality if your data is stored in three replicas. I wonder if you have considered:
Using a non-managed disk service. Does Azure provide a way to use a disk which is not replicated? This is much closer to how HDFS is intended to be used.
Storing data in Azure storage (WASB or ADLS) instead of HDFS. This is more "cloud native" way of running things. If you find that performance is lacking, you can use HDFS for intermediate data and only store final data in Azure. HDFS also provides a way to cache data from external storage systems by using Provided Storage.
If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.
Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons:
Speed limited to slower disk (JBOD performs better).
Reliability
It is recommended to use RAID on NameNode.
But what about implementing RAID on each DataNode storage disk?
RAID is used for two purposes. Depending on the RAID configuration you can get:
Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.
HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.
This ensures:
Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).
Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.
RAID0 on and enterprise server is a huge mistake. I sure would like to meet the person that designed this. This makes no common sense to an IT operations manager. If you configure any of your local server disk with a RAID0 you risk a long and painful RAID0 recovery. If a single disk in a RAID0 fails that RAID partition becomes destroyed and it doesn't magically recover when the disk is replaced. Someone has to logon to the server and delete the old RAID partition and create a new one. This creates a lot of overhead in times when man hours and work cycles are at an all time high. An IT operations manager is either going to delay doing this due to more priority workload or refuse to do it because they don't have enough cycles to take people resources away for more important work. Then its going to get pushed off to another team. Then the politics begin and wham then it gets pushed back to the server owner/customer. If you wanted to make a RAID1 or SAN drive available then you could avoid that entire scenario.
Is there anyone can explain the major differences between HDFS and Grid Computing ?
I think you have to replace HDFS with Hadoop in your question.
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model - Map Reduce framework based on YARN (Yet Another Resource Negotiator).
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
Grid Computing approach is based on distributing the work across a cluster of machines, which access a shared file system, hosted by a storage area network (SAN). This works well for predominantly compute-intensive jobs, but it becomes a problem when nodes need to access larger data volumes.
HDFS is just a file system. Since you are comparing processing of data, you have to compare Grid Computing with Hadoop Map Reduce (YARN) instead of HDFS.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local. This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
You can refer to Hadoop, The Definitive guide (4th edition) to understand the concepts better.
How Hadoop is different from others distributed system
scale out
proven technology
low cost
used by big Giants
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
but....
Grid computing is the collection of computer resources from multiple locations to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Grid computing is distinguished from conventional high performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with general-purpose grid middleware software libraries.
i think hdfs is not relevant to grid computing. or perhaps it is used in super virtual computers in a grid