Hadoop namenode disk size - hadoop

Are there any suggestions about size of HDD on namenode physical machine? Sure, it does not store any data from HDFS like datanode but what should I depend on while creating cluster?

Physical disk space on the NameNode does not really matter unless you run a Datanode on the same node. However, it is very important to have good memory (RAM) space allocated to the NameNode. This is because the NameNode stores all the metadata of the HDFS (block allocations, block locations etc.), in memory. If sufficient memory is not allocated, the NameNode might run out of memory and fail.

You might need some space to actually store the the NameNode's FSImage, edit file and other relevant files.
It's actually recommended to configure NameNode to use multiple directories (one local and other NFS mount), so that multiple copies of File System metadata will be stored. That way, as long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Please see this link for more details.

We're hearing from Cloudera that they recommend name nodes have faster disks - combination of SSD and 10kRPM SAS drives over typical 2TB, 7200K SAS drives. Does this sound reasonable or overkill since everything else I've read suggests that you don't really need expensive high speed storage for Hadoop.

Related

Memory required on NameNode for replicas in Hadoop

In this Cloudera blog post, in the Replication section, it has been explained that replication does not consume memory on the NameNode. However, I am skeptical about this because I understand that the NameNode stores information about each file, as well as its replicas, in main memory. How, then, is the memory requirement the same with or without replication?
Well memory consumption depends on what you mean, because there is physical memory and virtual memory (I am talking about Namenode only here)
In terms of physical memory, the Cloudera blog is correct, as the it is responsibility of the Datanode to communicate to the Namenode (when connect after restart for example) what blocks it maintains. The Namenode is storing solely the file-system structure to the disk (fsimage and edits files).
Now the situation is slightly different when you are talking about virtual memory, where you can clearly see from the source code that FSNamesystem (which is the component responsible for maintaining the FS structure in RAM), has a reference to BlockManager. BlockManager by itself maintains the reference to BlocksMap, which according to documentation does maintain the list of datanodes with respective blocks.
This class maintains the map from a block to its metadata. block's
metadata currently includes blockCollection it belongs to and the
datanodes that store the block.
If you go through the source code of the BlockManager you can clearly see what and where the BlocksMap is being used.
What actually comes to my mind, because Cloudera guys have experience in large scale computations and probably measured the impact, is that the size of this structure is not significant in comparison to the rest of the metadata the Namenode must be taking care of.

Where does the name node resides in RAM or in Hard disk?

Where does the name node resides in RAM or in Hard disk - Hadoop 1.2.1?
The Name node daemon is placed in RAM or in the Secondary memory. Can any one please help to know this?
Namenode is one of the java process running in a hadoop cluster. This has the responsibility to manage the metadata associated with the filesystem. So this is also called as the master node or the core node of the hadoop's file system known as Hadoop Distributed File System (HDFS). Namenode stores the metadata in memory as well as in disk. For frequent access, RAM will be faster, but when the machine fails or when the power goes off, the data in the RAM will be cleared. So it keeps a copy of the metadata in the disk also. The data in the disk will be stored as two files. One is FSImage and the other is editlog.
The complete metadata upto the last checkpoint will be stored in the FSImage and the recent transactions will be stored in the editlog. As the size of the editlog increases or after a certain predefined time or after a particular number of operations, the editlog will be merged to the FSImage and a new FSImage will be created. In this way, the editlog will always remain as a small file and hence the operations with editlog will be also faster.
The process of merging the FSImage and editlog to create a new FSImage is known as Checkpointing

Intention of Hadoop FS is to keep in RAM or disk?

We are thinking about going with Hadoop in my company. From looking at the docs in the Internet I got the impression the idea of HDFS is to keep it in RAM to speed up things. Now our architects says that the main idea of HDFS is scalability. I'm fine with that. But then he also claims the main idea is to keep it on the on the harddisk. HDFS is basically a scalable harddisk. My opinion is that backing HDFS by the harddisk is an option. The main idea, however, is to keep it in RAM. Who is right now? I'm really confused now and the point is crucial for the understanding of Hadoop, I would say.
Thanks, Oliver
Oliver, your architect is correct. Horizontal scalability is one of the biggest advantages of HDFS(Hadoop in general). When you say Hadoop it implies that you are dealing with very huge amounts of data, right? How are you going to put so much data in-memory?(I am assuming that by the idea of HDFS is to keep it in RAM to speed up things you mean to keep the data stored in HDFS in RAM).
But, the HDFS's metadata is kept in-memory so that you can quickly access the data stored in your HDFS. Remember, HDFS is not something physical. It is rather a virtual filesystem that lies on top of your native filesystem. So, when you say you are storing data into HDFS, it eventually gets stored in your native/local filesystem on your machine's disk and not RAM.
Having said that, there are certain major differences in the way HDFS and native FS behave. Like the block size which is very large when compared to local FS block size. Similarly the replicated manner in which data is stored in HDFS(think of RAID but at the software level).
So how does HDFS make things faster?
Hadoop is a distributed platform and HDFS a distributed store. When you put a file into HDFS it gets split into n small blocks(of size 64MB default, but configurable). Then all the blocks
of a file get stored across all the machines of your Hadoop cluster. This allows us to read all of the block together in parallel thus reducing the total reading time.
I would suggest you to go through this link in order to get a proper understanding of HDFS :
http://hadoop.apache.org/docs/stable/hdfs_design.html
HTH

What kind of JBOD in hadoop? and COW with hadoop?

New to hadoop, only setup a 3 debian server cluster for practice.
I was researching best practices on hadoop and came across:
JBOD no RAID
Filesystem: ext3, ext4, xfs - none of that fancy COW stuff you see with zfs and btrfs
So I raise these questions...
Everywhere I read JBOD is better then RAID in hadoop, and that the best filesystems are xfs and ext3 and ext4. Aside from the filesystem stuff which totally makes sense why those are the best... how do you implement this JBOD? You will see my confusion if you do the google search your self, JBOD alludes to a linear appendage or combination of just a bunch of disks kind of like a logical volume well at least thats how some people explain it, but hadoop seems to want a JBOD that doesnt combine. No body expands on that...
Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?
Question 2) Is it as simple as mounting each disk to a different directory is all?
Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?
Question 4) And then you just point hadoop at those data.dirs?
Question5)
I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?
This is off topic, but can someone verify if this is correct as well? Filesystems that use cow, copy on write, like zfs and btrfs just slow down hadoop but not only that the cow implementation is a waste with hadoop.
Question 6) Why is COW and things like RAID a waste on hadoop?
I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...
Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?
REASKING QUESTION 1 THRU 4
I just realized my question is so simple yet it's so hard for me to explain it that I had to split it up into 4 questions, and i still didn't get the answer I'm looking for, from what sounds like very smart individuals, so i must re-ask differently..
On paper I could easily or with a drawing... I'll attempt with words again..
If confused on what I am asking in the JBOD question...
** just wondering what kind of JBOD everyone keeps referring to in the hadoop world is all **
JBODs are defined differently with hadoop then in normal world and I want to know how the best way to implement hadoop is it on a concat of jbods(sda+sdb+sdc+sdd) or just leave the disks alone(sda,sdb,sdc,sdd)
I think the graphical representation below explain what I am asking best
(JBOD METHOD 1)
normal world: jbod is a concatination of disks - then if you were to use hadoop you would overlay the data.dir (Where hdfs virtualy sites) onto a directory inside this concat of disks, ALSO all of the disks would appear as 1... so if you had sda and sdb and sdc as your data disks in your node, you would make em appear as some entity1 (either with the hardware of the motherboard or mdadm or lvm) which is a linear concat of sda and sdb and sdc. you would then mount this entity1 to a folder in the Unix namespace like /mnt/jbod/ and then setup hadoop to run with in it.
TEXT SUMMARY: if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then this jbod would be 600gb big, and hadoop from this node would gain 600gb in capacity
* TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD:
* disk1 2 and 3 used for datanode for hadoop
* disk1 is sda 100gb
* disk2 is sdb 200gb
* disk3 is sdc 300gb
* sda + sdb + sdc = jbod of name entity1
* JBOD MADE ANYWAY - WHO CARES - THATS NOT MY QUESTION: maybe we made the jbod of entity1 with lvm, or mdadm using linear concat, or hardware jbod drivers which combine disks and show them to the operating system as entity1, it doesn't matter, either way its still a jbod
* This is the type of JBOD I am used to and I keep coming across when I google search JBOD
* cat /proc/partitions would show sda,sdb,sdc and entity1 OR if we used hardware jbod maybe sda and sdb and sdc would not show and only entity1 would show, again who cares how it shows
* mount entity1 to /mnt/entity1
* running "df" would show that entity1 is 100+200+300=600gb big
* we then setup hadoop to run its datanodes on /mnt/entity1 so that datadir property points at /mnt/entity1 and the cluster just gained 600gb of capacity
..the other perspective is this..
(JBOD METHOD 2)
in hadoop it seems to me they want every disk seperate. So I would mount disk sda and sdb and sdc in the unix namespace to /mnt/a and /mnt/b and /mnt/c... it seems from reading across the web lots of hadoop experts classify jbods as just that just a bunch of disks so to unix they would look like disks not a concat of the disks... and then of course i can combine then to become one entity either with logical volume manager (lvm) or mdadm (in a raid or linear fashion, linear prefered for jbod) ...... but...... nah lets not combine them because it seems in the hadoop world jbod is just a bunch of disks sitting by them selves...
if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then each mount disk1->/mnt/a and disk2->/mnt/b and disk3->/mnt/c would each be 100gb and 200gb and 300gb big respectively, and hadoop from this node would gain 600gb in capacity
TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD
* disk1 2 and 3 used for datanode for hadoop
* disk1 is sda 100gb
* disk2 is sdb 200gb
* disk3 is sdc 300gb
* WE DO NOT COMBINE THEM TO APPEAR AS ONE
* sda mounted to /mnt/a
* sdb mounted to /mnt/b
* sdc mounted to /mnt/c
* running a "df" would show that sda and sdb and sdc have the following sizes: 100,200,300 gb respectively
* we then setup hadoop via its config files to lay its hdfs on this node on the following "datadirs": /mnt/a and /mnt/b and /mnt/c.. gaining 100gb to the cluster from a, 200gb from b and 300gb from c... for a total gain of 600gb from this node... nobody using the cluster would tell the difference..
SUMMARY OF QUESTION
** Which method is everyone referring to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation? **
Both cases would gain hadoop 600gb... its just 1. looks like a concat or one entity that is a combo of all the disks, which is what I always thought was a jbod... Or it will be like 2 where each disk in the system is mounted to different directory, end result is all the same to hadoop capacity wise... just wondering if this is the best way for performance
I can try to answer few of the questions - tell me wherever you disagree.
1.JBOD: Just a bunch of disks; an array of drives, each of which is accessed directly as an independent drive.
From Hadoop Definitive Guide, topic Why not use RAID?, says that RAID Read and Write performance is limited by the slowest disk in the Array.
In addition, in case of HDFS, replication of data occurs across different machines residing in different racks. This handle potential loss of data even if a rack fails. So, RAID isn't that necessary. Namenode can though use RAID as mentioned in the link.
2.Yes That means independent disks (JBODs) mounted in each of the machines (e.g. /disk1, /disk2, /disk3 etc.) but not partitioned.
3, 4 & 5 Read Addendum
6 & 7. Check this link to see how replication of blocks occurs
Addendum after the comment:
Q1. Which method is everyone refering to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation?
Possible answer:
From Hadoop Definitive Guide -
You should also set the dfs.data.dir property, which specifies a list
of directories for a datanode to store its blocks. Unlike the
namenode, which uses multiple directories for redundancy, a datanode
round-robins writes between its storage directories, so for
performance you should specify a storage directory for each local
disk. Read performance also benefits from having multiple disks for
storage, because blocks will be spread across them, and concurrent
reads for distinct blocks will be correspondingly spread across disks.
For maximum performance, you should mount storage disks with the
noatime option. This setting means that last accessed time information
is not written on file reads, which gives significant performance gains.
Q2. Why LVM is not a good idea?
Avoid RAID and LVM on TaskTracker and DataNode machines – it generally
reduces performance.
This is because LVM creates logical layer over the individual mounted disks in a machine.
Check this link for TIP 1 more details. There are use cases where using LVM performed slow when running Hadoop jobs.
I'm late to the party but maybe I can chime in:
JBOD
Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?
Just a bunch of disks... you just format the whole disk and include it in the ´hdfs-site.xmlandmapred-site.xmloryarn-site-xml` on the datanodes. Hadoop takes care about distributing blocks across disks.
Question 2) Is it as simple as mounting each disk to a different directory is all?
Yes.
Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?
Yes. Hadoop does checksumming of the data and periodically verifies these checksums.
Question 4) And then you just point hadoop at those data.dirs?
Exactly. But there are directories for data storage (HDFS) and computation (MapReduce, YARN, ..) you can configure different directories and disks for certain tasks.
Question 5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?
The problem is faulty disks. If you keep it simple and just mount each disks at a time you just have to replace this disk. If you use mdadm or LVM in ja JBOD configuration you have are prone to loosing more data in case a disk dies as the striped or concat configuration may not survive a disk failure. As data for more blocks is spread across multiple disks.
Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...
HDFS is a competently seperate layer atop of your native filesystem. Disk failures are expected and that's why all data blocks are replicated at least 3 times across several machines. HDFS also does it's own checksumming so if the checksum of a block mismatches a replica of this block is used and the broken block will be deleted by HDFS.
So in theory it makes no sense to use RAID or COW for Hadoop drives.
It can make sense through if you have to deal with faulty disks that can't be replaced instantly.
Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?
The NameNode has a list of blocks and their locations on the datanodes. Each block has a checksum and locations. If a datanode goes down in a cluster the namenode replicates the blocks of this datanode to other datanodes.
If an older datanode comes online it is sending it's block list to the NameNode and depending on how many of the blocks are already replicated or not it will delete unneeded blocks on this datanode.
The age of data is not important it's only about blocks. If the NameNode still maintains the blocks and the datanode has them, they will be used again.
ZFS/btrfs/COW
In theory they additional features these filesystems provide are not required for Hadoop. However as you usally use cheap and huge 4TB+ desktop drives for datanodes you can run into problems if these disks start to fail.
ext4 remounts itself read-only on failure and at this point you'll see the disk dropping out of the HDFS on the datanode it it's configured to loose drives or you'll see the datanode die if disk failures that are not allowed. This can be a problem because modern drives often exhibit some bad sectors while still functioning fine for the most part and it's work intensive to fsck this disks and restart the datanode.
Another problem are computations through YARN/MapReduce.. these write also intermediate data on the disks and if this data gets corrupted or can't be written you'll run into errors. I'm not sure if YARN/MapReduce also checksum their temporary files - I think it's implemented through.
ZFS and btrfs provide some resilience against this errors on modern drives as they are able to deal better with corrupted metadata and avoid lengthy fsck checks due to internal checksumming.
I'm running a Hadoop cluster on ZFS (just JBOD with LZ4) with lots of disks that exhibitit some bad sectors and that are out of warranty but still perform well and it works fine despite these errors.
If you can replace the faulty disks instantly it does not matter much. If you need to live with partly broken disks ZFS/btrfs will buy you some time before having to replace the disks.
COW is not needed because Hadoop takes care of replication and security. Compression can be usefull if you store your data uncompressed on the cluster. LZ4 in ZFS should not provide a performance penalty and can speed up sequential reads (like HDFS and MapReduce do them).
Performance
The case against RAID is that at least MapReduce is implementing something similiar.. HDFS can read and write concurrently to all of the disks and usally multiple map and reduce jobs are running that can use a whole disk for writing and reading their data.
If you put RAID or striping below Hadoop these jobs have all to enqueue their reads and write to the single RAID controller and overall it's likely slower.
Depending on your jobs it can make sense to use something like RAID-0 for pairs of disks but be sure to first verify that sequential read or write is really the bottleneck for your job (and not the network, HDFS replication, CPU, ..) but make sure first that what you are doing is worth the work and the hassle.

The memory consumption of hadoop's namenode?

Can anyone give a detailed analysis of memory consumption of namenode? Or is there some reference material ? Can not find material in the network.Thank you!
I suppose the memory consumption would depend on your HDFS setup, so depending on overall size of the HDFS and is relative to block size.
From the Hadoop NameNode wiki:
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
From https://twiki.opensciencegrid.org/bin/view/Documentation/HadoopUnderstanding:
Namenode: The core metadata server of Hadoop. This is the most critical piece of the system, and there can only be one of these. This stores both the file system image and the file system journal. The namenode keeps all of the filesystem layout information (files, blocks, directories, permissions, etc) and the block locations. The filesystem layout is persisted on disk and the block locations are kept solely in memory. When a client opens a file, the namenode tells the client the locations of all the blocks in the file; the client then no longer needs to communicate with the namenode for data transfer.
the same site recommends the following:
Namenode: We recommend at least 8GB of RAM (minimum is 2GB RAM), preferably 16GB or more. A rough rule of thumb is 1GB per 100TB of raw disk space; the actual requirements is around 1GB per million objects (files, directories, and blocks). The CPU requirements are any modern multi-core server CPU. Typically, the namenode will only use 2-5% of your CPU.
As this is a single point of failure, the most important requirement is reliable hardware rather than high performance hardware. We suggest a node with redundant power supplies and at least 2 hard drives.
For a more detailed analysis of memory usage, check this link out:
https://issues.apache.org/jira/browse/HADOOP-1687
You also might find this question interesting: Hadoop namenode memory usage
There are several technical limits to the NameNode (NN), and facing any of them will limit your scalability.
Memory. NN consume about 150 bytes per each block. From here you can calculate how much RAM you need for your data. There is good discussion: Namenode file quantity limit.
IO. NN is doing 1 IO for each change to filesystem (like create, delete block etc). So your local IO should allow enough. It is harder to estimate how much you need. Taking into account fact that we are limited in number of blocks by memory you will not claim this limit unless your cluster is very big. If it is - consider SSD.
CPU. Namenode has considerable load keeping track of health of all blocks on all datanodes. Each datanode once a period of time report state of all its block. Again, unless cluster is not too big it should not be a problem.
Example calculation
200 node cluster
24TB/node
128MB block size
Replication factor = 3
How much space is required?
# blocks = 200*24*2^20/(128*3)
~12Million blocks
~12,000 MB memory.
I guess we should make the distinction between how namenode memory is consumed by each namenode object and general recommendations for sizing the namenode heap.
For the first case (consumption) ,AFAIK , each namenode object holds an average 150 bytes of memory. Namenode objects are files, blocks (not counting the replicated copies) and directories. So for a file taking 3 blocks this is 4(1 file and 3 blocks)x150 bytes = 600 bytes.
For the second case of recommended heap size for a namenode, it is generally recommended that you reserve 1GB per 1 million blocks. If you calculate this (150 bytes per block) you get 150MB of memory consumption. You can see this is much less than the 1GB per 1 million blocks, but you should also take into account the number of files sizes, directories.
I guess it is a safe side recommendation. Check the following two links for a more general discussion and examples:
Sizing NameNode Heap Memory - Cloudera
Configuring NameNode Heap Size - Hortonworks
Namenode Memory Structure Internals

Resources