Let's assume one is using default block size (128 MB), and there is a file using 130 MB ; so using one full size block and one block with 2 MB. Then 20 MB needs to be appended to the file (total should be now of 150 MB). What happens?
Does HDFS actually resize the size of the last block from 2MB to 22MB? Or create a new block?
How does appending to a file in HDFS deal with conccurency?
Is there risk of dataloss ?
Does HDFS create a third block put the 20+2 MB in it, and delete the block with 2MB. If yes, how does this work concurrently?
According to the latest design document in the Jira issue mentioned before, we find the following answers to your question:
HDFS will append to the last block, not create a new block and copy the data from the old last block. This is not difficult because HDFS just uses a normal filesystem to write these block-files as normal files. Normal file systems have mechanisms for appending new data. Of course, if you fill up the last block, you will create a new block.
Only one single write or append to any file is allowed at the same time in HDFS, so there is no concurrency to handle. This is managed by the namenode. You need to close a file if you want someone else to begin writing to it.
If the last block in a file is not replicated, the append will fail. The append is written to a single replica, who pipelines it to the replicas, similar to a normal write. It seems to me like there is no extra risk of dataloss as compared to a normal write.
Here is a very comprehensive design document about append and it contains concurrency issues.
Current HDFS docs gives a link to that document, so we can assume that it is the recent one. (Document date is 2009)
And the related issue.
Hadoop Distributed File System supports appends to files, and in this case it should add the 20 MB to the 2nd block in your example (the one with 2 MB in it initially). That way you will end up with two blocks, one with 128 MB and one with 22 MB.
This is the reference to the append java docs for HDFS.
Related
In Hadoop, Considering a scenario if a bigfile is already loaded into the hdfs filesystem, using either hdfs dfs put or hdfs dfs CopyFromLocal command, the bigfile will be splitted into blocks(64 MB).
In this case, When a customRecordReader has to be created to read the bigfile, Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.
Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.
I think you might be confused about what a FileSplit actually is. Let's say your bigfile is 128MB and your block size is 64MB. bigfile will take up two blocks. You know this already. You will also (usually) get two FileSplits when the file is being processed in MapReduce. Each FileSplit maps to a block as it was previously loaded.
Keep in mind that the FileSplit class does not contain any of the file's actual data. It is simply a pointer to data within the file.
HDFS splits the files in blocks for storage purpose and may split the data across multiple blocks based on actual file size.
So in case you are writing a customRecordReader then you will have to tell your record reader where to start and stop reading the block so that you can process the data. Reading the data from the beginning of each block or stopping your read at the end of each block may give your mapper incomplete records.
You are comparing apples and Oranges. The full name of the FileSplit is org.apache.hadoop.mapred.FileSplit, emphasis on mapred. Is a MapReduce conspet, not a file system one. FileSplit is simply a specialization of an InputSplit:
InputSplit represents the data to be processed by an individual Mapper.
You are unnecessarily adding to the discussion HDFS concepts like blocks. MapReduce is unrelated to HDFS (they have synergy together, true). MapReduce can run on many other file systems, like local raw, S3, Azure blobs etc.
Whether a FileSplit happens to coincide with an HDFS block is pure coincidence, from your point of view (is not coincidence as the job is split to take advantage of HDFS block locality, but that is a detail).
You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this file?
a.) They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
b.) They would see the current state of the file, up to the last bit written by the command.
c.) They would see the current of the file through the last completed block.
d.) They would see no content until the whole file written and closed.
From what I understand about the hadoop fs -put command the answer is D, however some say it is C.
Could anyone provide a constructive explanation for either of the options?
Thanks xx
The reason why the the file will not be accessible until the whole file is written and closed (option D) is because, in order to access a file, the request is first sent to the NameNode, to obtain metadata relating to the different blocks that compose the file. This metadata will be written by the NameNode only after it receives confirmation that all blocks of the file were written successfully.
Therefore, even though the blocks are available, the user can't see the file until the metadata is updated, which is done after all blocks are written.
As soon as a file is created, it is visible in the filesystem namespace. Any content written to the file is not guaranteed to be visible, however:
Once more than a block's worth of data has been written, the first block will be visible to new readers. This is true of subsequent blocks, too: it is always the current block being written that is not visible to other readers. (From Hadoop Definitive Guide, Coherency Model).
So, I would go with Option C.
Also, take a look at this related question.
Seems both D and C are true as detailed by Chaos and Ashrith, respectively. I documented their results at https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought when playing with a 7.5 GB file.
In a nutshell, yes, the exact file name is NOT present until completed... AND... yes, you can actually read the file up to the last block written iF you realize the filename is temporarily suffixed with ._COPYING_.
Consider I have a single File which is 300MB. The block size is 128MB.
So the input file is divided into the following chunks and placed in HDFS.
Block1: 128MB
Block2: 128MB
Block3: 64MB.
Now Does each block's data has byte offset information contained in it.
That is, do the blocks have the following offset information?
Block1: 0-128MB of File
Block2 129-256MB of File
Block3: 257MB-64MB of file
If so, how can I get the byte-offset information for Block2 (That is it starts at 129MB) in Hadoop.
This is for understanding purposes only. Any hadoop command-line tools to get this kind of meta data about the blocks?
EDIT
If the byte-offset info is not present, a mapper performing its map job on a block will start consuming lines from the beginning. If the offset information is present, the mapper will skip till it finds the next EOL and then starts processing the records.
So I guess byte offset information is present inside the blocks.
Disclaimer: I might be wrong on this one I have not read that much of the HDFS source code.
Basically, datanodes manage blocks which are just large blobs to them. They know the block id but that its. The namenode knows everything, especially the mapping between a file path and all the block ids of this file and where each block is stored. Each block id can be stored in one or more locations depending of its replication settings.
I don't think you will find public API to get the information you want from a block id because HDFS does not need to do the mapping this way. On the opposite you can easily know the blocks and their locations of a file. You can try explore the source code, especially the blockmanager package.
If you want to learn more, this article about the HDFS architecture could be a good start.
You can run hdfs fsck /path/to/file -files -blocks to get the list of blocks.
A Block does not contain offset info, only length. But you can use LocatedBlocks to get all blocks of a file and from this you can easily reconstruct each block what offset it starts at.
In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is dfs.datanode.data.dir, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.
gurus!
A long time i can't found answer on following question: how hadoop splitting big file during writing.
Example:
1) Block size 64 Mb
2) File size 128 Mb (flat file, containing text).
When i writing file it's will be split at 2 part (file size / block size).
But... Could occurrence following
Block1 will be ended at
...
word300 word301 wo
and Block 2 will be start
rd302 word303
...
Write case will be
Block1 will be ended at
...
word300 word301
and Block 2 will be start
word302** word303
...
or can you link at the place where write about hadoop splitting algoritms.
Thank you in advance!
looks this wiki page, hadoop InputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, it ignores the content up to the first newline.
The file will be split arbitrarily based on bytes. So it will likely split it into something like wo and rd302.
This is not a problem you have to typically worried about and is how the system is designed. The InputFormat and RecordReader part of a MapReduce job deal with records split between record boundaries.