Memory required on NameNode for replicas in Hadoop - hadoop

In this Cloudera blog post, in the Replication section, it has been explained that replication does not consume memory on the NameNode. However, I am skeptical about this because I understand that the NameNode stores information about each file, as well as its replicas, in main memory. How, then, is the memory requirement the same with or without replication?

Well memory consumption depends on what you mean, because there is physical memory and virtual memory (I am talking about Namenode only here)
In terms of physical memory, the Cloudera blog is correct, as the it is responsibility of the Datanode to communicate to the Namenode (when connect after restart for example) what blocks it maintains. The Namenode is storing solely the file-system structure to the disk (fsimage and edits files).
Now the situation is slightly different when you are talking about virtual memory, where you can clearly see from the source code that FSNamesystem (which is the component responsible for maintaining the FS structure in RAM), has a reference to BlockManager. BlockManager by itself maintains the reference to BlocksMap, which according to documentation does maintain the list of datanodes with respective blocks.
This class maintains the map from a block to its metadata. block's
metadata currently includes blockCollection it belongs to and the
datanodes that store the block.
If you go through the source code of the BlockManager you can clearly see what and where the BlocksMap is being used.
What actually comes to my mind, because Cloudera guys have experience in large scale computations and probably measured the impact, is that the size of this structure is not significant in comparison to the rest of the metadata the Namenode must be taking care of.

Related

Impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance

What is the impact of reducing HDFS replication factor to 2 (or just one) on HBase map/reduce performance ? I am having a HBase cluster hosted on Azure VMs with data stored in azure managed disks. Azure managed disk itself keeps 3 copies of the data for fault tolerance, so thinking of reducing the HDFS replication factor to save on storage overhead. Given that map reduce jobs make use of local availability of the data to avoid data transfer over network, wondering anyone has any information on the impact on map reduce performance if there just one replica of the data available?
This is a difficult question to answer as it depends greatly on what workloads you run.
By decreasing the replication factor, you can speed up the performance of write operations, since the data is written to fewer DataNodes. However, as you noted, you may have decreased locality since it can be more difficult to find a node which has a replica and has free space to execute a task.
Keeping only a single replica can have strong implications on the impact of a single node failure. If a single node dies, all of its data will be unavailable until you restart a new node with the same Azure managed disks. If there are multiple HDFS replicas, data availability is maintained throughout.
Running HDFS DataNodes on top of Azure managed disks sounds like a bit of a bad idea. In addition to breaking some of the core HDFS assumptions ("my disk might fail at any time"), it seems unlikely that you have true data locality if your data is stored in three replicas. I wonder if you have considered:
Using a non-managed disk service. Does Azure provide a way to use a disk which is not replicated? This is much closer to how HDFS is intended to be used.
Storing data in Azure storage (WASB or ADLS) instead of HDFS. This is more "cloud native" way of running things. If you find that performance is lacking, you can use HDFS for intermediate data and only store final data in Azure. HDFS also provides a way to cache data from external storage systems by using Provided Storage.

Distributed Cache Concept in Hadoop

My question is about the concept of distributed cache specifically for Hadoop and whether it should be called distributed Cache. A conventional definition of distributed cache is "A distributed cache spans multiple servers so that it can grow in size and in transactional capacity".
This is not true in hadoop as Distributed cache is distributed to all the nodes which runs the tasks i.e. the same file mentioned in the driver code.
Shouldn't this be called a replicative cache. The intersection of cache on all nodes should be null (or close to it) if we go by the conventional distributed cache definition. But for hadoop the result of intersection is the same file which is present in all nodes.
Is my understanding correct or am i missing something? Please guide.
Thanks
The general understanding and concept of any Cache is to make data available in memory and avoid hitting disk for reading the data. Because reading the data from disk is a costlier operation than reading from memory.
Now lets take the same analogy to Hadoop ecosystem. Here disk is your HDFS and memory is local file system where are the actual tasks run. During the life cycle of an application, there may be multiple tasks are running on the same node. So when the first task is launched in the node, it will fetch the data from HDFS and put it in the local system. Now the subsequent tasks on the same node will not fetch the same data again. That way it will save the cost of getting data from HDFS vs getting it from local file system. The is the concept of Distributed Cache in MapReduce framework.
The size of the data is usually small enough that it can be loaded in the Mapper memory, usually in few MBs.
I too agree that it's not really "Distributed cache". But I am convinced with YoungHobbit comments on efficiency of not hitting disk for IO operations.
The only merit I have seen in this mechanism as per Apache documentation:
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
Please note that DistributedCache has been deprecated since 2.6.0 release. You have to use new APIs in Job class to achieve the same functionality.

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

Hadoop namenode disk size

Are there any suggestions about size of HDD on namenode physical machine? Sure, it does not store any data from HDFS like datanode but what should I depend on while creating cluster?
Physical disk space on the NameNode does not really matter unless you run a Datanode on the same node. However, it is very important to have good memory (RAM) space allocated to the NameNode. This is because the NameNode stores all the metadata of the HDFS (block allocations, block locations etc.), in memory. If sufficient memory is not allocated, the NameNode might run out of memory and fail.
You might need some space to actually store the the NameNode's FSImage, edit file and other relevant files.
It's actually recommended to configure NameNode to use multiple directories (one local and other NFS mount), so that multiple copies of File System metadata will be stored. That way, as long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
Please see this link for more details.
We're hearing from Cloudera that they recommend name nodes have faster disks - combination of SSD and 10kRPM SAS drives over typical 2TB, 7200K SAS drives. Does this sound reasonable or overkill since everything else I've read suggests that you don't really need expensive high speed storage for Hadoop.

The memory consumption of hadoop's namenode?

Can anyone give a detailed analysis of memory consumption of namenode? Or is there some reference material ? Can not find material in the network.Thank you!
I suppose the memory consumption would depend on your HDFS setup, so depending on overall size of the HDFS and is relative to block size.
From the Hadoop NameNode wiki:
Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
From https://twiki.opensciencegrid.org/bin/view/Documentation/HadoopUnderstanding:
Namenode: The core metadata server of Hadoop. This is the most critical piece of the system, and there can only be one of these. This stores both the file system image and the file system journal. The namenode keeps all of the filesystem layout information (files, blocks, directories, permissions, etc) and the block locations. The filesystem layout is persisted on disk and the block locations are kept solely in memory. When a client opens a file, the namenode tells the client the locations of all the blocks in the file; the client then no longer needs to communicate with the namenode for data transfer.
the same site recommends the following:
Namenode: We recommend at least 8GB of RAM (minimum is 2GB RAM), preferably 16GB or more. A rough rule of thumb is 1GB per 100TB of raw disk space; the actual requirements is around 1GB per million objects (files, directories, and blocks). The CPU requirements are any modern multi-core server CPU. Typically, the namenode will only use 2-5% of your CPU.
As this is a single point of failure, the most important requirement is reliable hardware rather than high performance hardware. We suggest a node with redundant power supplies and at least 2 hard drives.
For a more detailed analysis of memory usage, check this link out:
https://issues.apache.org/jira/browse/HADOOP-1687
You also might find this question interesting: Hadoop namenode memory usage
There are several technical limits to the NameNode (NN), and facing any of them will limit your scalability.
Memory. NN consume about 150 bytes per each block. From here you can calculate how much RAM you need for your data. There is good discussion: Namenode file quantity limit.
IO. NN is doing 1 IO for each change to filesystem (like create, delete block etc). So your local IO should allow enough. It is harder to estimate how much you need. Taking into account fact that we are limited in number of blocks by memory you will not claim this limit unless your cluster is very big. If it is - consider SSD.
CPU. Namenode has considerable load keeping track of health of all blocks on all datanodes. Each datanode once a period of time report state of all its block. Again, unless cluster is not too big it should not be a problem.
Example calculation
200 node cluster
24TB/node
128MB block size
Replication factor = 3
How much space is required?
# blocks = 200*24*2^20/(128*3)
~12Million blocks
~12,000 MB memory.
I guess we should make the distinction between how namenode memory is consumed by each namenode object and general recommendations for sizing the namenode heap.
For the first case (consumption) ,AFAIK , each namenode object holds an average 150 bytes of memory. Namenode objects are files, blocks (not counting the replicated copies) and directories. So for a file taking 3 blocks this is 4(1 file and 3 blocks)x150 bytes = 600 bytes.
For the second case of recommended heap size for a namenode, it is generally recommended that you reserve 1GB per 1 million blocks. If you calculate this (150 bytes per block) you get 150MB of memory consumption. You can see this is much less than the 1GB per 1 million blocks, but you should also take into account the number of files sizes, directories.
I guess it is a safe side recommendation. Check the following two links for a more general discussion and examples:
Sizing NameNode Heap Memory - Cloudera
Configuring NameNode Heap Size - Hortonworks
Namenode Memory Structure Internals

Resources