Where input data gets stored initially? - hadoop

First step in Map Reduce is to copy input file(s) to HDFS.
Want to know where exactly this gets stored; On name node or data node or somewhere else ?
When we say copy to HDFS, where exactly we store input files initially ?
( I know later we split and store on data nodes ).
Or its something we directly copy from chunks from source/input machine to data nodes ? ( I am sure that is not the case )

Putting files in HDFS is a coordination effort between the client, Name node and the Data nodes. At a very high level the client talks to the name node to identify the data nodes where the file need to be stored, the client then sends the first block to the initial data node and transfers the file, the subsequent transfer for replication of that particular block happens from that particular data node.
Read the detailed protocol from here.

Related

How does a Spark task access HDFS?

Suppose that
input to a Spark application is a 1GB text file on HDFS,
HDFS block size is 16MB,
Spark cluster has 4 worker nodes.
In the first stage of the application, we read the file from HDFS by sc.textFile("hdfs://..."). Since the block size is 16MB, this stage will have 64 tasks (one task per partition/block). These tasks will be dispatched to the cluster nodes. My questions are:
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Thanks!
In general, Spark's access to HDFS is probably as efficient as you think it should be. Spark uses Hadoop's FileSystem object to access data in HDFS.
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
Each task fetches its own block from HDFS.
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
It pulls a specific block. It does not scan the entire file to get to the block.
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Spark will attempt to assign tasks based on the location preferences of the partitions in the RDD. In the case of a HadoopRDD (which you get from sc.textFile), the location preference for each partition is the set of datanodes that have the block local. If a task is not able to be run local to the data, it will run on a separate node and the block will be streamed from a datanode that has the block to the task that is executing on the block.

Why datanode sends the block location information to namenode?

On the https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html there are words:
the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
But why is this information sent to the namenode and its fallback brother? I thought that this information already contains in the namenode's fs image. The namenode should know where he put blocks.
Name Node contains the meta data of the entire cluster. It contains the details of each folder, file, replication factor, block names etc. The Name Node also stores the information about the location of the blocks for each file (this information is constructed from the Block Reports sent by the Data Nodes) in memory.
Data Nodes store following information for each block:
Actual data stored in the block
Meta data for the data stored in the block. Mainly contains checksums for the data stored in the block.
They periodically send the heart beat and block reports to the Name Node.
Heart Beat:
Interval of heart beat reports is determined by configuration parameter dfs.heartbeat.interval (in hdfs-site.xml). By default this is set to 3 seconds.
Some of the information contained in the Heart beat is:
Registration: Data node registration information
Capacity: Total storage capacity available at Data Node
dfsUsed: Storage used by HDFS
remaining: Remaining storage available for HDFS
blockPoolUsed: Storage used by the block pool
xmitsInProgress: Number of transfers from this Data Node to others
xceiverCount: Number of active transceiver threads
xmitsInProgress: Number of transfers from this Data Node to others
cacheCapacity: Total cache capacity available at Data Node
cacheUsed: Amount of cache used
This information is used by the Name Node in the following ways:
Health of the Data Node: Should this data node be marked as dead or alive?
Registration of new Data Node: If this is a newly added Data Node, its information is registered
Update the metrics of the Data Node: The information sent in the heart beat is used for updating the metrics of the node
Issue commands to the Data Node: The Name Node can issue following commands to the Data Node, based on the information received in the heart beat: BlockRecoveryCommand (to recover specified blocks), BlockCommand (for transferring blocks to another Data Node, for invalidating certain blocks), Cache/Uncache (commands for caching / uncaching the blocks)
Block Reports:
Interval of block reports is determined by configuration dfs.blockreport.intervalMsec (in hdfs-site.xml). By default this is set to 21600000 milliseconds.
Some of the information contained in the block report is:
Registration: Data node registration information
blocks: Information about the blocks, which contains: block ID, block length, block generation timestamp, state of the block replica (For e.g. replica is finalized or waiting to be recovered etc.)
This information is used by the Name Node for:
Process first block report: If it is a first time report for the newly registered Data Node, it just adds all the valid replicas. It ignores all the invalid blocks, till the next block report.
For updating the information about blocks: The (Data Node -> Blocks) map is updated in the Name Node. The new block report is compared with the old report and information about successful blocks, corrupted blocks, invalidated blocks etc. is updated
The Datanodes are not directly accessible from outside the cluster, its in a private network. Hadoop cluster is prone to node failures and the NameNode keeps track of all the data on the different DataNodes. So, any query to the cluster is addressed by the NN and it provides the block address on the DN.

Can mapper output keys be routed to a specific node in Hadoop MR

I need to process some data in MR and load it into an external system that sits on the same physical machines as my MR nodes. Right now I run the job and read the output from HDFS and re-route individual records back out onto the desired nodes.
Is it possible to define some mapping such that records with key X always go straight to the desired node Y? Simply put, I want to control where hadoop routes post-sorted partitioned groups.
Not easily. The only way I know of to affect the physical location of a block of data on the fly is to implement a custom BlockPlacementPolicy. I'll just throw out some ideas for your use case.
A custom BlockPlacementPolicy can route blocks based on the file name
The file name of a partition can be modified using MultipleOutputs in MapReduce
Keys can be routed to specific partitions using a custom Partitioner
It seems like you can get the result you're looking for, but it won't be pretty.

How is data written to HDFS?

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.
According to the following schema :
whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).
However, I don't understand how this is managed.
Indeed, according to the same documentation :
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Is the schema presented above correct ? If so,
is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?
why does the client write to multiple nodes ?
If this schema is not correct, how is file creation working with HDFs ?
As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.
Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.
According to this article:
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP
handshake, ... Every tenth heartbeat is a Block Report,
where the Data Node tells the Name Node about all the blocks it has.
So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.
For your question:
why does the client write to multiple nodes ?
I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes

HDFS:How to distribute files of small sizes across?

I have very large number of small files to be stored in HDFS. Based on the file name I want to store them in different data nodes. This way I can achieve file names starting with certain alphabets to go into specific data nodes. How to do this in Hadoop?
Not a very good choice. Reasons :
Hadoop is not very good at handling very large number of small files.
Storing one complete file in a single node is against one of the fundamental principles of HDFS, distributed storage.
I would like to know what benefit will you get with this approach.
In response to your comment :
HDFS doesn't do any kind of sorting like HBase does. When you put a file into HDFS, it gets split into small blocks first and then gets stored(each block on a different node). So there is nothing like sending a whole file to a single node. Your file(blocks) reside on multiple nodes.
What you could do is create a directory hierarchy as per you needs and store files in those directories(in case your intention is to fetch the files directly based on their location). For example,
/dirA
/dirA/A.txt
/dirA/B.txt
/dirB
/dirB/P.txt
/dirB/Q.txt
/dirC
/dirC/Y.txt
/dirC/Z.txt
But, if you really want to send the blocks of a particular file to some specific nodes then you need to implement your own block placement policy and which is not very easy. See this for more details.

Resources