Size of the NAMENODE and DATANODE in hadoop - hadoop

What is the size of the NAMENODE and DATANODE and wheather the block and datanode is different or not in hadoop
if input file is 200mb size then how many datanode will be created and how many blocks will created.

NameNode, SecondaryNameNode, JobTracker, DataNode are components in a hadoop cluster.
Blocks are how data is stored on the HDFS.
Datanode can be simply seen as a hard disk(lot of memory - TBs) with a ram and a processor on the network.
Just like you can store many files on a hard disk, you can do that hear as well.
Now if you want to store a 200mb file on a Hadoop-
v1.0 system or lower, it will be split into 4 blocks of 64mb size
v2.0 System or higher, it will be split into 2 blocks of 128mb size
After this small explanation, I'd revert you to #charles_Babbage's suggestion to go and start with a book or tutorials on youtube.

Related

Memory for Namenode(s) in Hadoop

Environment: The production cluster has 2 name-nodes (active and standby namely) and the nodes are SAS drives in Raid-1 configuration. These nodes have nothing but the master services (NN and Standby NN) running on each. They have a Ram of 256GB while the data nodes (where most of the processing happens) are set with only 128GB.
My Question: Why does Hadoop’s master nodes have such high Ram and why not the Datanodes when most of the processing is done where the data is available.?
P.S. As per hadoop thumb rule, we only require 1GB for every 1million files.
The Namenode stores all file references from all the datanodes in memory.
The datanode process doesn't need much memory, it's the YARN NodeManagers that do

hdfs filesystem difference between namenode and datanode

requesting to please help me in one small query of mine.
when we issue hdfs dfs command it will show the filesystem of namenode or datanode. ?
how can we see filesystem of namenode and datanode separately?
in my project, when i issue hdfs dfs -ls command it shows me files and directories. if i create a file it will create the file by default on it's choice of data node or somewhere else.
TIA
dfs commands will communicate with both the namenode and datanode
Namenode has no "filesystem" for listing content - its just metadata in memory. There of course is a local disk directory for a namenode, for checkpointing and backups, but the primary operations are against the memory storage for quick lookup.
A single datanode holds a subset of files called blocks. The namenode maintains the block locations and how files are collectively built from those blocks throughout the cluster
Persistence of FileSystem Metadata
Creation of HDFS files cannot target specific datanodes, and again, the file is split apart to many datanodes, whose locations are placed into the namenode memory

How huge amount of data is inputted in hadoop?

I am new to big data and hadoop. I would like to know are name node, data node, secondary name node, job tracker, task tracker different systems ? If i want to process 1000 PB data, How data is divided and who is doing that task and where should i input 1000 PB data.
Yes namenode, dataNode, secondaryNameNode, jobTracker, taskTracker are all different virtual machines (JVMs you can call them). You can start them all in one physical machine (pseudo/local mode) or you can start them on different physical machines (distributed mode). These are all in Hadoop1.
Hadoop2 has introduced containers with YARN in which jobTracker and taskTracer are removed with more efficient resourceManager, applicationManager, nodeManager etc. You can find more info hadoop-yarn-site
Data are stored in HDFS (Hadoop Distributed File System) and are stored in blocks, default to 64MB. When data is loaded to hdfs, hadoop distributes the data equally in the cluster with the defined block size. When a job is run code is distributed to the nodes in cluster so that each processing occurs where the data is residing except in shuffle and sorting cases.
I hope you must have general idea of how hadoop and hdfs works. Followings are some links for you to start with
Map Reduce programming
cluster setup
hadoop commands

Can we specify the size of DATANODE in hdfs file system

While formatting DATANODE with following command:
hdfs dfs datanode -format
Is it possible to specify the size of HDFS? I understand horizontal scalability will be impacted.
HDFS is as large as the datanodes attached to it... So by adding more hardware you are specifying the size.
It's not like a disk that you can partition (at least, not in the general sense of allocating specific sizes of disk for specific tasks).

HADOOP HDFS imbalance issue

I have a Hadoop cluster that have 8 machines and all the 8 machines are data nodes.
There's a program running on one machine(say machine A) that will create sequence files ( each of the file is about 1GB) in HDFS continuously.
Here's the problem: All of the 8 machines are the same hardware and has the same capacity. When other machines still have about 50% free space on the disks for HDFS, machine A has only 5% left.
I checked the block info and found that almost every block has one replica on machine A.
Is there any way to balance the replicas?
Thanks.
This is the default placement policy. It works well for the typical M/R pattern, where each HDFS node is also a compute node and the writer machines are uniformly distributed.
If you don't like it, then there is HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS. You need to write a class that implements BlockPlacementPolicy interface, and then set this class in as the dfs.block.replicator.classname in hdfs-site.xml.
There is a way. you can use hadoop command line balancer tool.
HDFS data might not always be be placed uniformly across the DataNode.To spread HDFS data uniformly across the DataNodes in the cluster, this can be used.
hadoop balancer [-threshold <threshold>]
where, threshold is Percentage of disk capacity
see the following links for details:
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html#Rebalancer

Resources