How to put the reduce partitions into designed machines in hadoop cluster? - hadoop

For example:
reduce results: part-00000, part-00001 ... part-00008,
the cluster has 3 datanodes and I want to
put the part-00000, part-00001 and part-00002 to the slave0
put the part-00003, part-00004 and part-00005 to the slave1
put the part-00006, part-00007 and part-00008 to the slave2
How can I do that?

It doesn't work like that. A file in HDFS is not stored in any specific datanode. Each file is composed of blocks and each block is replicated to multiple nodes (3 by default). So each file is actually stored in differed nodes, since the blocks that compose it are stored in different nodes.
Quoting the official documentation, which I advise you to read:
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Seeing the partition tag in your question, it may be worth stating that the Partitioner defines in which partition (not datanode), each key will end up. For example, knowing that you have 9 reduce tasks (9 partitions), you may wish to evenly split the workload of each such task. In order to do that, you can define that, e.g., the keys starting with the letter "s" should be sent to partition 0 and the keys starting with the letter "a" or "b" to partition 1, etc. (just a stupid example to illustrate what a partitioner does).

Related

Do blocks in HDFS have byte-offset information stored in Hadoop?

Consider I have a single File which is 300MB. The block size is 128MB.
So the input file is divided into the following chunks and placed in HDFS.
Block1: 128MB
Block2: 128MB
Block3: 64MB.
Now Does each block's data has byte offset information contained in it.
That is, do the blocks have the following offset information?
Block1: 0-128MB of File
Block2 129-256MB of File
Block3: 257MB-64MB of file
If so, how can I get the byte-offset information for Block2 (That is it starts at 129MB) in Hadoop.
This is for understanding purposes only. Any hadoop command-line tools to get this kind of meta data about the blocks?
EDIT
If the byte-offset info is not present, a mapper performing its map job on a block will start consuming lines from the beginning. If the offset information is present, the mapper will skip till it finds the next EOL and then starts processing the records.
So I guess byte offset information is present inside the blocks.
Disclaimer: I might be wrong on this one I have not read that much of the HDFS source code.
Basically, datanodes manage blocks which are just large blobs to them. They know the block id but that its. The namenode knows everything, especially the mapping between a file path and all the block ids of this file and where each block is stored. Each block id can be stored in one or more locations depending of its replication settings.
I don't think you will find public API to get the information you want from a block id because HDFS does not need to do the mapping this way. On the opposite you can easily know the blocks and their locations of a file. You can try explore the source code, especially the blockmanager package.
If you want to learn more, this article about the HDFS architecture could be a good start.
You can run hdfs fsck /path/to/file -files -blocks to get the list of blocks.
A Block does not contain offset info, only length. But you can use LocatedBlocks to get all blocks of a file and from this you can easily reconstruct each block what offset it starts at.

Write Path HDFS

Introduction
Follow-up question to this question.
A File has been provided to HDFS and has been subsequently replicated to three DataNodes.
If the same file is going to be provided again, HDFS indicates that the file already exists.
Based on this answer a file will be split into blocks of 64MB (depending on the configuration settings). A mapping of the filename and the blocks will be created in the NameNode. The NameNode knows in which DataNodes the blocks of a certain file reside. If the same file is provided again the NameNode knows that blocks of this file exists on HDFS and will indicate that the file already exits.
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename? Which process is responsible for this?
Which process is responsible for splitting a file into blocks?
Example Write path:
According to this documentation the Write Path of HBase is as follows:
Possible Write Path HDFS:
file provided to HDFS e.g. hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso /
FileName checked in FSImage whether it already exists. If this is the case the message file already exists is displayed
file split into blocks of 64MB (depending on configuration
setting). Question: Name of the process which is responsible for block splitting?
blocks replicated on DataNodes (replication factor can be
configured)
Mapping of FileName to blocks (MetaData) stored in EditLog located in NameNode
Question
How does the HDFS' Write Path look like?
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename?
No, it does not update the file. The name node only checks if the path (file name) already exists.
How does the HDFS' Write Path look like?
This is explained in detail in this paper: "The Hadoop Distributed File System" by Shvachko et al. In particular, read Section 2.C (and check Figure 1):
"When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1."
NOTE: A book chapter based on this paper is available online too. And a direct link to the corresponding figure (Fig. 1 on the paper and 8.1 on the book) is here.

Reading contents of blocks directly in a datanode

In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is dfs.datanode.data.dir, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.

Hadoop dfs replicate

Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.

How does CopyFromLocal command for Hadoop DFS work?

I'm a little confused on how the Hadoop Distributed File System is set up and how my particular setup affects it. I used this guide to set it up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ using two Virtual Machines on Virtual Box and have run the example (just a simple word count with txt file input). So far, I know that the datanode manages and retrieves the files on its node, while the tasktracker analyzes the data.
1) When you use the command -copyFromLocal, are you are copying files/input to the HDFS? Does Hadoop know how to divide the information between the slaves/master, and how does it do it?
2) In the configuration outlined in the guide linked above, are there technically two slaves (the master acts as both the master and a slave)? Is this common or is the master machine usually only given jobtracker/namenode tasks?
There are lot of questions asked here.
Question 2)
There are two machines
These machines are configured for HDFS and Map-Reduce.
HDFS configuration requires Namenode (master) and Datanodes (Slave)
Map-reduce requires Jobtracker (master) and Tasktracker (Slave)
Only one Namenode and Jobtracker is configured but you can have Datanode and Tasktracker services on both the machines. It is not the machine which acts as master and slave. It is just the services. You can have slave services also installed on machines which contains master services. It is good for simple development setup. In large scale deployment, you dedicate master services to separate machines.
Question 1 Part 2)
It is HDFS job to create file chunk and store on multiple data nodes in replicated manner. You don't have to worry about it.
Question 1 Part 1)
Hadoop file operations are patterned like typical Unix file operations - ls, put etc
Hadoop fs -put localefile /data/somefile --> will copy a localfile to HDFS at path /data/somefile
With put option you can also read from standard input and write to a HDFS file
copyFromLocal is similar to put option except that behavior is restricted to copying from local file system to HDFS
See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#copyFromLocal
1)
The client connects to the name node to register a new file in HDFS.
The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.
There is some more information in the Javadoc for org.apache.hadoop.hdfs.DFSClient.DFSOutputStream
2) Some production systems will be configured to make the master its own dedicated node (allowing the maximum possible memory allocation, and to avoid CPU contention), but if you have a smaller cluster, then a node which contains a name node and data node is acceptable

Resources