Would HBase/HDFS deployment make sense with 100mbit/s network interfaces? - hadoop

I guess that 100Mbit/s network interface will be bottle neck for HDFS and slow down HBase on top of it (max compactions speed about 10MB/s, etc.). Would this deployment make sense?
I am thinking that "now" when when SSD comes in to game even 1Gbit/s network interfeces still can be bottleneck, so maybe building a cluster with 100Mbit/s should never be taken into account (even for HDD)?

To keep it short:
You should never use a SSD in HDFS, these flash memorys have a limited number of writes. HDFS has many writes, that's mainly because of the replication. If you are using HBase as a NoSQL DB this will result in even more writes.
The bottlenecks are as you said the harddisk and the network. Network is an even higher bottleneck because you are distributing the data, so it has to be replicated and if you are running jobs, they could be copied if the data is not locally available (Reducers have to copy much stuff).
So you should definitely for a better network than 10Mbit or 100Mbit. That implies your switch and the NICs on the nodes.
A hdd raid will not result in a higher bandwidth in writing, there were several benchmarks that proof that. Have a look at the HDFS Wiki, it must be described there.

100MB network is not likely to be a good setup for an hadoop cluster you can see cisco's presentation from Hadoop World for some analysis of network usage. That said depending on your actual load and cluster size it might be workable - though you might want to make sure you actually need Hadoop if that is the case.
regarding SSDs they cost more per MB and depending on your write load you may have to replace them sooner than HDDs but they will save you electricity - I guess it wouldn't be cost effective to use them in a large cluster (I don't know of anyone who did)
You can use SSDs for some of the disks e.g. for the temporary space on the cluster (such as map/reduce intermediate results) to get the IO benefits

Whether or not your network will be the bottleneck depends on the kinds of jobs you are running. If you do text processing (e.g. running Stanford NER or coreference suite), then a 100Mbit/s network will be the least of your concerns. However, if you are doing a lot of I/O intensive processing (most jobs with big reduce steps), then it will be. As always, it depends on your workload. But, I think it is safe to say that a 100Mb network is the most likely culprit for a bottleneck given recent processors and nodes with several disks.

Related

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Why is RAID not recommended for Hadoop HDFS setups?

Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons:
Speed limited to slower disk (JBOD performs better).
Reliability
It is recommended to use RAID on NameNode.
But what about implementing RAID on each DataNode storage disk?
RAID is used for two purposes. Depending on the RAID configuration you can get:
Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.
HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.
This ensures:
Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).
Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.
RAID0 on and enterprise server is a huge mistake. I sure would like to meet the person that designed this. This makes no common sense to an IT operations manager. If you configure any of your local server disk with a RAID0 you risk a long and painful RAID0 recovery. If a single disk in a RAID0 fails that RAID partition becomes destroyed and it doesn't magically recover when the disk is replaced. Someone has to logon to the server and delete the old RAID partition and create a new one. This creates a lot of overhead in times when man hours and work cycles are at an all time high. An IT operations manager is either going to delay doing this due to more priority workload or refuse to do it because they don't have enough cycles to take people resources away for more important work. Then its going to get pushed off to another team. Then the politics begin and wham then it gets pushed back to the server owner/customer. If you wanted to make a RAID1 or SAN drive available then you could avoid that entire scenario.

Hadoop put performance - large file (20gb)

I'm using hdfs -put to load a large 20GB file into hdfs. Currently the process runs # 4mins. I'm trying to improve the write time of loading data into hdfs. I tried utilizing different block sizes to improve write speed but got the below results:
512M blocksize = 4mins;
256M blocksize = 4mins;
128M blocksize = 4mins;
64M blocksize = 4mins;
Does anyone know what the bottleneck could be and other options I could explore to improve performance of the -put cmd?
20GB / 4minute comes out to about 85MB/sec. That's pretty reasonable throughput to expect from a single drive with all the overhead of HDFS protocol and network. I'm betting that is your bottleneck. Without changing your ingest process, you're not going to be able to make this magically faster.
The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).
Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.
I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.
It depends a lot on the details of your setup. First, know that 20GB in 4 mins is 80MBps.
The bottleneck is most likely your local machine's hardware or its ethernet connection. I doubt playing with block size will improve your throughput by much.
If your local machine has a typical 7200rpm hard drive, its disk to buffer transfer rate is about 128MBps, meaning that it could load that 20BG file into memory in about 2:35, assuming you have 20GB to spare. However, you're not just copying it to memory, you're streaming it from memory to network packets, so it's understandable that you incur an additional overhead for processing these tasks.
Also see the wikipedia entry on wire speed, which puts a fast ethernet setup at 100Mbit/s (~12MB/s). Note that in this case fast ethernet is a term for a particular group of ethernet standards. You are clearly getting a faster rate than this. Wire speed is a good measure, because it accounts for all the factors on your local machine.
So let's break down the different steps in the streaming process on your local machine:
Read a chunk from file and load it into memory. Components: hard drive, memory
Split and translate that chunk into packets. Last I heard Hadoop doesn't use DMA features out of the box, so these operations will be performed by your CPU rather than the NIC. Components: Memory, CPU
Transmit packets to hadoop file servers. Components: NIC, Network
Without knowing more about your local machine, it is hard to specify which of these components is the bottleneck. However, these are the places to start investigating bitrate.
you may want to use distcp
hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/outputdata
to perform parallel copy

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Resources