New to hadoop, only setup a 3 debian server cluster for practice.
I was researching best practices on hadoop and came across:
JBOD no RAID
Filesystem: ext3, ext4, xfs - none of that fancy COW stuff you see with zfs and btrfs
So I raise these questions...
Everywhere I read JBOD is better then RAID in hadoop, and that the best filesystems are xfs and ext3 and ext4. Aside from the filesystem stuff which totally makes sense why those are the best... how do you implement this JBOD? You will see my confusion if you do the google search your self, JBOD alludes to a linear appendage or combination of just a bunch of disks kind of like a logical volume well at least thats how some people explain it, but hadoop seems to want a JBOD that doesnt combine. No body expands on that...
Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?
Question 2) Is it as simple as mounting each disk to a different directory is all?
Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?
Question 4) And then you just point hadoop at those data.dirs?
Question5)
I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?
This is off topic, but can someone verify if this is correct as well? Filesystems that use cow, copy on write, like zfs and btrfs just slow down hadoop but not only that the cow implementation is a waste with hadoop.
Question 6) Why is COW and things like RAID a waste on hadoop?
I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...
Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?
REASKING QUESTION 1 THRU 4
I just realized my question is so simple yet it's so hard for me to explain it that I had to split it up into 4 questions, and i still didn't get the answer I'm looking for, from what sounds like very smart individuals, so i must re-ask differently..
On paper I could easily or with a drawing... I'll attempt with words again..
If confused on what I am asking in the JBOD question...
** just wondering what kind of JBOD everyone keeps referring to in the hadoop world is all **
JBODs are defined differently with hadoop then in normal world and I want to know how the best way to implement hadoop is it on a concat of jbods(sda+sdb+sdc+sdd) or just leave the disks alone(sda,sdb,sdc,sdd)
I think the graphical representation below explain what I am asking best
(JBOD METHOD 1)
normal world: jbod is a concatination of disks - then if you were to use hadoop you would overlay the data.dir (Where hdfs virtualy sites) onto a directory inside this concat of disks, ALSO all of the disks would appear as 1... so if you had sda and sdb and sdc as your data disks in your node, you would make em appear as some entity1 (either with the hardware of the motherboard or mdadm or lvm) which is a linear concat of sda and sdb and sdc. you would then mount this entity1 to a folder in the Unix namespace like /mnt/jbod/ and then setup hadoop to run with in it.
TEXT SUMMARY: if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then this jbod would be 600gb big, and hadoop from this node would gain 600gb in capacity
* TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD:
* disk1 2 and 3 used for datanode for hadoop
* disk1 is sda 100gb
* disk2 is sdb 200gb
* disk3 is sdc 300gb
* sda + sdb + sdc = jbod of name entity1
* JBOD MADE ANYWAY - WHO CARES - THATS NOT MY QUESTION: maybe we made the jbod of entity1 with lvm, or mdadm using linear concat, or hardware jbod drivers which combine disks and show them to the operating system as entity1, it doesn't matter, either way its still a jbod
* This is the type of JBOD I am used to and I keep coming across when I google search JBOD
* cat /proc/partitions would show sda,sdb,sdc and entity1 OR if we used hardware jbod maybe sda and sdb and sdc would not show and only entity1 would show, again who cares how it shows
* mount entity1 to /mnt/entity1
* running "df" would show that entity1 is 100+200+300=600gb big
* we then setup hadoop to run its datanodes on /mnt/entity1 so that datadir property points at /mnt/entity1 and the cluster just gained 600gb of capacity
..the other perspective is this..
(JBOD METHOD 2)
in hadoop it seems to me they want every disk seperate. So I would mount disk sda and sdb and sdc in the unix namespace to /mnt/a and /mnt/b and /mnt/c... it seems from reading across the web lots of hadoop experts classify jbods as just that just a bunch of disks so to unix they would look like disks not a concat of the disks... and then of course i can combine then to become one entity either with logical volume manager (lvm) or mdadm (in a raid or linear fashion, linear prefered for jbod) ...... but...... nah lets not combine them because it seems in the hadoop world jbod is just a bunch of disks sitting by them selves...
if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then each mount disk1->/mnt/a and disk2->/mnt/b and disk3->/mnt/c would each be 100gb and 200gb and 300gb big respectively, and hadoop from this node would gain 600gb in capacity
TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD
* disk1 2 and 3 used for datanode for hadoop
* disk1 is sda 100gb
* disk2 is sdb 200gb
* disk3 is sdc 300gb
* WE DO NOT COMBINE THEM TO APPEAR AS ONE
* sda mounted to /mnt/a
* sdb mounted to /mnt/b
* sdc mounted to /mnt/c
* running a "df" would show that sda and sdb and sdc have the following sizes: 100,200,300 gb respectively
* we then setup hadoop via its config files to lay its hdfs on this node on the following "datadirs": /mnt/a and /mnt/b and /mnt/c.. gaining 100gb to the cluster from a, 200gb from b and 300gb from c... for a total gain of 600gb from this node... nobody using the cluster would tell the difference..
SUMMARY OF QUESTION
** Which method is everyone referring to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation? **
Both cases would gain hadoop 600gb... its just 1. looks like a concat or one entity that is a combo of all the disks, which is what I always thought was a jbod... Or it will be like 2 where each disk in the system is mounted to different directory, end result is all the same to hadoop capacity wise... just wondering if this is the best way for performance
I can try to answer few of the questions - tell me wherever you disagree.
1.JBOD: Just a bunch of disks; an array of drives, each of which is accessed directly as an independent drive.
From Hadoop Definitive Guide, topic Why not use RAID?, says that RAID Read and Write performance is limited by the slowest disk in the Array.
In addition, in case of HDFS, replication of data occurs across different machines residing in different racks. This handle potential loss of data even if a rack fails. So, RAID isn't that necessary. Namenode can though use RAID as mentioned in the link.
2.Yes That means independent disks (JBODs) mounted in each of the machines (e.g. /disk1, /disk2, /disk3 etc.) but not partitioned.
3, 4 & 5 Read Addendum
6 & 7. Check this link to see how replication of blocks occurs
Addendum after the comment:
Q1. Which method is everyone refering to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation?
Possible answer:
From Hadoop Definitive Guide -
You should also set the dfs.data.dir property, which specifies a list
of directories for a datanode to store its blocks. Unlike the
namenode, which uses multiple directories for redundancy, a datanode
round-robins writes between its storage directories, so for
performance you should specify a storage directory for each local
disk. Read performance also benefits from having multiple disks for
storage, because blocks will be spread across them, and concurrent
reads for distinct blocks will be correspondingly spread across disks.
For maximum performance, you should mount storage disks with the
noatime option. This setting means that last accessed time information
is not written on file reads, which gives significant performance gains.
Q2. Why LVM is not a good idea?
Avoid RAID and LVM on TaskTracker and DataNode machines – it generally
reduces performance.
This is because LVM creates logical layer over the individual mounted disks in a machine.
Check this link for TIP 1 more details. There are use cases where using LVM performed slow when running Hadoop jobs.
I'm late to the party but maybe I can chime in:
JBOD
Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?
Just a bunch of disks... you just format the whole disk and include it in the ´hdfs-site.xmlandmapred-site.xmloryarn-site-xml` on the datanodes. Hadoop takes care about distributing blocks across disks.
Question 2) Is it as simple as mounting each disk to a different directory is all?
Yes.
Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?
Yes. Hadoop does checksumming of the data and periodically verifies these checksums.
Question 4) And then you just point hadoop at those data.dirs?
Exactly. But there are directories for data storage (HDFS) and computation (MapReduce, YARN, ..) you can configure different directories and disks for certain tasks.
Question 5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?
The problem is faulty disks. If you keep it simple and just mount each disks at a time you just have to replace this disk. If you use mdadm or LVM in ja JBOD configuration you have are prone to loosing more data in case a disk dies as the striped or concat configuration may not survive a disk failure. As data for more blocks is spread across multiple disks.
Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...
HDFS is a competently seperate layer atop of your native filesystem. Disk failures are expected and that's why all data blocks are replicated at least 3 times across several machines. HDFS also does it's own checksumming so if the checksum of a block mismatches a replica of this block is used and the broken block will be deleted by HDFS.
So in theory it makes no sense to use RAID or COW for Hadoop drives.
It can make sense through if you have to deal with faulty disks that can't be replaced instantly.
Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?
The NameNode has a list of blocks and their locations on the datanodes. Each block has a checksum and locations. If a datanode goes down in a cluster the namenode replicates the blocks of this datanode to other datanodes.
If an older datanode comes online it is sending it's block list to the NameNode and depending on how many of the blocks are already replicated or not it will delete unneeded blocks on this datanode.
The age of data is not important it's only about blocks. If the NameNode still maintains the blocks and the datanode has them, they will be used again.
ZFS/btrfs/COW
In theory they additional features these filesystems provide are not required for Hadoop. However as you usally use cheap and huge 4TB+ desktop drives for datanodes you can run into problems if these disks start to fail.
ext4 remounts itself read-only on failure and at this point you'll see the disk dropping out of the HDFS on the datanode it it's configured to loose drives or you'll see the datanode die if disk failures that are not allowed. This can be a problem because modern drives often exhibit some bad sectors while still functioning fine for the most part and it's work intensive to fsck this disks and restart the datanode.
Another problem are computations through YARN/MapReduce.. these write also intermediate data on the disks and if this data gets corrupted or can't be written you'll run into errors. I'm not sure if YARN/MapReduce also checksum their temporary files - I think it's implemented through.
ZFS and btrfs provide some resilience against this errors on modern drives as they are able to deal better with corrupted metadata and avoid lengthy fsck checks due to internal checksumming.
I'm running a Hadoop cluster on ZFS (just JBOD with LZ4) with lots of disks that exhibitit some bad sectors and that are out of warranty but still perform well and it works fine despite these errors.
If you can replace the faulty disks instantly it does not matter much. If you need to live with partly broken disks ZFS/btrfs will buy you some time before having to replace the disks.
COW is not needed because Hadoop takes care of replication and security. Compression can be usefull if you store your data uncompressed on the cluster. LZ4 in ZFS should not provide a performance penalty and can speed up sequential reads (like HDFS and MapReduce do them).
Performance
The case against RAID is that at least MapReduce is implementing something similiar.. HDFS can read and write concurrently to all of the disks and usally multiple map and reduce jobs are running that can use a whole disk for writing and reading their data.
If you put RAID or striping below Hadoop these jobs have all to enqueue their reads and write to the single RAID controller and overall it's likely slower.
Depending on your jobs it can make sense to use something like RAID-0 for pairs of disks but be sure to first verify that sequential read or write is really the bottleneck for your job (and not the network, HDFS replication, CPU, ..) but make sure first that what you are doing is worth the work and the hassle.
Related
If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.
Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons:
Speed limited to slower disk (JBOD performs better).
Reliability
It is recommended to use RAID on NameNode.
But what about implementing RAID on each DataNode storage disk?
RAID is used for two purposes. Depending on the RAID configuration you can get:
Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.
HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.
This ensures:
Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.
Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).
Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.
RAID0 on and enterprise server is a huge mistake. I sure would like to meet the person that designed this. This makes no common sense to an IT operations manager. If you configure any of your local server disk with a RAID0 you risk a long and painful RAID0 recovery. If a single disk in a RAID0 fails that RAID partition becomes destroyed and it doesn't magically recover when the disk is replaced. Someone has to logon to the server and delete the old RAID partition and create a new one. This creates a lot of overhead in times when man hours and work cycles are at an all time high. An IT operations manager is either going to delay doing this due to more priority workload or refuse to do it because they don't have enough cycles to take people resources away for more important work. Then its going to get pushed off to another team. Then the politics begin and wham then it gets pushed back to the server owner/customer. If you wanted to make a RAID1 or SAN drive available then you could avoid that entire scenario.
Are there any suggestions about size of HDD on namenode physical machine? Sure, it does not store any data from HDFS like datanode but what should I depend on while creating cluster?
Physical disk space on the NameNode does not really matter unless you run a Datanode on the same node. However, it is very important to have good memory (RAM) space allocated to the NameNode. This is because the NameNode stores all the metadata of the HDFS (block allocations, block locations etc.), in memory. If sufficient memory is not allocated, the NameNode might run out of memory and fail.
You might need some space to actually store the the NameNode's FSImage, edit file and other relevant files.
It's actually recommended to configure NameNode to use multiple directories (one local and other NFS mount), so that multiple copies of File System metadata will be stored. That way, as long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Please see this link for more details.
We're hearing from Cloudera that they recommend name nodes have faster disks - combination of SSD and 10kRPM SAS drives over typical 2TB, 7200K SAS drives. Does this sound reasonable or overkill since everything else I've read suggests that you don't really need expensive high speed storage for Hadoop.
We are thinking about going with Hadoop in my company. From looking at the docs in the Internet I got the impression the idea of HDFS is to keep it in RAM to speed up things. Now our architects says that the main idea of HDFS is scalability. I'm fine with that. But then he also claims the main idea is to keep it on the on the harddisk. HDFS is basically a scalable harddisk. My opinion is that backing HDFS by the harddisk is an option. The main idea, however, is to keep it in RAM. Who is right now? I'm really confused now and the point is crucial for the understanding of Hadoop, I would say.
Thanks, Oliver
Oliver, your architect is correct. Horizontal scalability is one of the biggest advantages of HDFS(Hadoop in general). When you say Hadoop it implies that you are dealing with very huge amounts of data, right? How are you going to put so much data in-memory?(I am assuming that by the idea of HDFS is to keep it in RAM to speed up things you mean to keep the data stored in HDFS in RAM).
But, the HDFS's metadata is kept in-memory so that you can quickly access the data stored in your HDFS. Remember, HDFS is not something physical. It is rather a virtual filesystem that lies on top of your native filesystem. So, when you say you are storing data into HDFS, it eventually gets stored in your native/local filesystem on your machine's disk and not RAM.
Having said that, there are certain major differences in the way HDFS and native FS behave. Like the block size which is very large when compared to local FS block size. Similarly the replicated manner in which data is stored in HDFS(think of RAID but at the software level).
So how does HDFS make things faster?
Hadoop is a distributed platform and HDFS a distributed store. When you put a file into HDFS it gets split into n small blocks(of size 64MB default, but configurable). Then all the blocks
of a file get stored across all the machines of your Hadoop cluster. This allows us to read all of the block together in parallel thus reducing the total reading time.
I would suggest you to go through this link in order to get a proper understanding of HDFS :
http://hadoop.apache.org/docs/stable/hdfs_design.html
HTH
Hi I am trying to setup the hadoop environment. In short the problem which I am trying to solve involves billions of XML files of size few MB, extract relevant information from them using HIVE and do some analytic work with the information. I know this is a trivial problem in hadoop world but if Hadoop solution works well for me than size and number of files I will be dealing will increase in geometric progession form.
I did research by referring various books like "Hadoop - the definite guide", "Hadoop in action". Resources like documents by yahoo and hortonworks. I am not able to figure out the hardware /software specifications for establishing the hadoop environment. In the resources which I had referred so far I had kind of found standard solutions like
Namenode/JobTracker (2 x 1Gb/s Ethernet, 16 GB of RAM, 4xCPU, 100 GB disk)
Datanode (2 x 1Gb/s Ethernet, 8 GB of RAM, 4xCPU, Multiple disks with total amount
of 500+ GB)
but if anyone can give some suggestions that will be great. Thanks
First I would suggest you to consider: what do you need more processing + some storage or opposite, and from this view select hardware. Your case sounds more processing then storage.
I would specify a bit differently standard hardware for hadoop
NameNode: High quality disk in mirror, 16 GB HDD.
Data Nodes: 16-24 GB RAM, Dual Quad or Dual six cores CPU, 4 to 6 1-2-3 SATA TB Drives.
I would also consider 10 GBit option. I think if it does not add more then 15% of cluster price - it makes sense. 15% came from rough estimation that data shipping from mappers to reducers takes about 15% of job time.
In your case I would be more willing to sacrifice disc sizes to save money, but not CPU/Memory/number of drives.
"extract relevant information from them using HIVE"
That is going to be a bit tricky since hive doesn't really do well with xml files.
you are going to want to build a parsing script in another language (ruby, python, perl, etc) that can parse the xml files and produce columnar output that you will load into hive. You can then use hive to call that external parsing script with a transform, or just use hadoopstreaming to prepare the data for hive.
Then it is just a matter of how fast you need the work done and how much space you need to hold the amount of data you are going to have.
you could build the process with a handful of files on a single system to test it. But you really need to have a better handle on your overall planned workload to properly scale your cluster. Minimum production cluster size would be 3 or 4 machines at a minimum, just for data redundancy. Beyond that, add nodes as necessary to meet your workload needs.