Split file during writing - hadoop

gurus!
A long time i can't found answer on following question: how hadoop splitting big file during writing.
Example:
1) Block size 64 Mb
2) File size 128 Mb (flat file, containing text).
When i writing file it's will be split at 2 part (file size / block size).
But... Could occurrence following
Block1 will be ended at
...
word300 word301 wo
and Block 2 will be start
rd302 word303
...
Write case will be
Block1 will be ended at
...
word300 word301
and Block 2 will be start
word302** word303
...
or can you link at the place where write about hadoop splitting algoritms.
Thank you in advance!

looks this wiki page, hadoop InputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit, it ignores the content up to the first newline.

The file will be split arbitrarily based on bytes. So it will likely split it into something like wo and rd302.
This is not a problem you have to typically worried about and is how the system is designed. The InputFormat and RecordReader part of a MapReduce job deal with records split between record boundaries.

Related

Hadoop : Why using FileSplit in the RecordReader Implementation

In Hadoop, Considering a scenario if a bigfile is already loaded into the hdfs filesystem, using either hdfs dfs put or hdfs dfs CopyFromLocal command, the bigfile will be splitted into blocks(64 MB).
In this case, When a customRecordReader has to be created to read the bigfile, Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.
Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.
I think you might be confused about what a FileSplit actually is. Let's say your bigfile is 128MB and your block size is 64MB. bigfile will take up two blocks. You know this already. You will also (usually) get two FileSplits when the file is being processed in MapReduce. Each FileSplit maps to a block as it was previously loaded.
Keep in mind that the FileSplit class does not contain any of the file's actual data. It is simply a pointer to data within the file.
HDFS splits the files in blocks for storage purpose and may split the data across multiple blocks based on actual file size.
So in case you are writing a customRecordReader then you will have to tell your record reader where to start and stop reading the block so that you can process the data. Reading the data from the beginning of each block or stopping your read at the end of each block may give your mapper incomplete records.
You are comparing apples and Oranges. The full name of the FileSplit is org.apache.hadoop.mapred.FileSplit, emphasis on mapred. Is a MapReduce conspet, not a file system one. FileSplit is simply a specialization of an InputSplit:
InputSplit represents the data to be processed by an individual Mapper.
You are unnecessarily adding to the discussion HDFS concepts like blocks. MapReduce is unrelated to HDFS (they have synergy together, true). MapReduce can run on many other file systems, like local raw, S3, Azure blobs etc.
Whether a FileSplit happens to coincide with an HDFS block is pure coincidence, from your point of view (is not coincidence as the job is split to take advantage of HDFS block locality, but that is a detail).

Reading of the broken line by Record Reader

I went through the cloudera blog and I got an article(Link below).Refer to the third point.
http://blog.cloudera.com/blog/2011/01/lessons-learned-from-clouderas-hadoop-developer-training-course/
As per my understanding, if there are 2 input splits, then the broken line will be read by the record reader of the first input split.
If I am getting it correct, can you tell me how it does that i.e how the record reader of the first split reads the broken line past the input split ?
As per my understanding, if there are 2 input splits, then the broken line will be read by the record reader of the first input split.
Yes, this is correct.
can you tell me how it does that i.e how the record reader of the first split reads the broken line past the input split
An InputSplit doesn't contain the raw data, but rather the information needed to extract the data. A FileInputSplit (which is what you're referring to) contains a path to the file as well as the byte offsets to read in the file. It is then up to the RecordReader to go out and read that data. This means that it can read past the end byte offset defined by the split.

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

How does HDFS with append works

Let's assume one is using default block size (128 MB), and there is a file using 130 MB ; so using one full size block and one block with 2 MB. Then 20 MB needs to be appended to the file (total should be now of 150 MB). What happens?
Does HDFS actually resize the size of the last block from 2MB to 22MB? Or create a new block?
How does appending to a file in HDFS deal with conccurency?
Is there risk of dataloss ?
Does HDFS create a third block put the 20+2 MB in it, and delete the block with 2MB. If yes, how does this work concurrently?
According to the latest design document in the Jira issue mentioned before, we find the following answers to your question:
HDFS will append to the last block, not create a new block and copy the data from the old last block. This is not difficult because HDFS just uses a normal filesystem to write these block-files as normal files. Normal file systems have mechanisms for appending new data. Of course, if you fill up the last block, you will create a new block.
Only one single write or append to any file is allowed at the same time in HDFS, so there is no concurrency to handle. This is managed by the namenode. You need to close a file if you want someone else to begin writing to it.
If the last block in a file is not replicated, the append will fail. The append is written to a single replica, who pipelines it to the replicas, similar to a normal write. It seems to me like there is no extra risk of dataloss as compared to a normal write.
Here is a very comprehensive design document about append and it contains concurrency issues.
Current HDFS docs gives a link to that document, so we can assume that it is the recent one. (Document date is 2009)
And the related issue.
Hadoop Distributed File System supports appends to files, and in this case it should add the 20 MB to the 2nd block in your example (the one with 2 MB in it initially). That way you will end up with two blocks, one with 128 MB and one with 22 MB.
This is the reference to the append java docs for HDFS.

Hadoop - how are map-reduce tasks know which part of a file to handle?

I've been starting to learn hadoop, and currently I'm trying to process log files that are not too well structured - in that the value I normally use for the M/R key is typiclly found at the top of the file (once). So basically my mapping function takes that value as key and then scans the rest of the file to aggregate the values needed to be reduced. So a [fake] log might look like this:
## log.1
SOME-KEY
2012-01-01 10:00:01 100
2012-01-02 08:48:56 250
2012-01-03 11:01:56 212
.... many more rows
## log.2
A-DIFFERENT-KEY
2012-01-01 10:05:01 111
2012-01-02 16:46:20 241
2012-01-03 11:01:56 287
.... many more rows
## log.3
SOME-KEY
2012-02-01 09:54:01 16
2012-02-02 05:53:56 333
2012-02-03 16:53:40 208
.... many more rows
I want to accumulate the 3rd column for each key. I have a cluster of several nodes running this job, and so I was bothered by several issues:
1. File Distribution
Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?
2. Block Assignment
For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.
3. File structure
What is the optimal file structure (if any) for M/R processing? I'd probably be far less worried if a typical log looked like this:
A-DIFFERENT-KEY 2012-01-01 10:05:01 111
SOME-KEY 2012-01-02 16:46:20 241
SOME-KEY 2012-01-03 11:01:56 287
A-DIFFERENT-KEY 2012-02-01 09:54:01 16
A-DIFFERENT-KEY 2012-02-02 05:53:56 333
A-DIFFERENT-KEY 2012-02-03 16:53:40 208
...
However, the logs are huge and it would be very costly (time) to convert them to the above format. Should I be concerned?
4. Job Distribution
Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.
Given that hadoop's HDFS works in 64Mb blocks (by default), and every file is distributed over the cluster, can I be sure that the correct key will be matched against the proper numbers? That is, if the block containing the key is in one node, and a block containing data for that same key (a different part of the same log) is on a different machine - how does the M/R framework match the two (if at all)?
How the keys and the values are mapped depends on the InputFormat class. Hadoop has a couple of InputFormat classes and custom InputFormat classes can also be defined.
If FileInputFormat is used then the key to the mapper is the file off-set and the value is the line in the input file. In most of cases the file off-set is ignored and the value which is a line in the input file is processed by the mapper. So, by default each line in the log file will be a value to to the mapper.
There might be case where related data in a log file as in the OP might be split across blocks, each block will be processed by a different mapper and Hadoop cannot relate them. One way it to let a single mapper process the complete file by using the FileInputFormat#isSplitable method. This is not an efficient approach if the file size is too large.
For text logs such as the ones described, how is each block's cutoff point decided? Is it after a row ends, or exactly at 64Mb (binary)? Does it even matter? This relates to my #1, where my concern is that the proper values are matched with the correct keys over the entire cluster.
Each block in HDFS by default is exactly 64MB size unless the file size is less than 64MB or the default block size has been modfied, record boundaries are not considered. Some part of the line in the input can be in one block and the rest in another. Hadoop understands record boundaries, so even if a record (line) is split across blocks, it will be still processed by a single mapper only. For this some data transfer might be required from the next block.
Are the jobs assigned such that only a single JobClient handles an entire file? Rather, how are the keys/values coordinated between all the JobClients? Again, I'm trying to guarentee that my shady log structure still yields correct results.
Not exactly clear what the query is. Would suggest to go through some tutorials and get back with queries.

Resources