Read this on apache documentation:
InputSplit represents the data to be processed by an individual Mapper.
Typically, it presents a byte-oriented view on the input and is the responsibility of RecordReader of the job to process this and present a record-oriented view.
Link - https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapred/InputSplit.html
Can somebody explain the difference between byte-oriented view and record-oriented view?
HDFS splits its blocks (byte-oriented view) so that each block is less than or equal to the block size configured. So it is considered to be not following a logical split. Means a part of last record may reside in one block and rest of it is in another block. This seems correct for storage. But At processing time, the partial records in a block cannot be processed as it is. So the record-oriented view comes into place. This will ensure to get the remaining part of the last record in the other block to make it a block of complete records. This is called input-split (record oriented view).
Related
I am learning Hadoop, and to begin with started with HDFS and MapReduce. I understood the basics of HDFS and MapReduce.
There is one particular point where I am not able to understand, which I am explaining below:
Large data set --> Stored in HDFS as Blocks, say for example B1, B2, B3.
Now, when we run a MR Job, each mapper works on a single block (assuming 1 mapper processes a block of data for simplicity)
1 Mapper ==> processes 1 block
I also read that the block is divided into Records and for a given block, same mapper is called for each records within that block (of data).
But what exactly is a Record?
For a given block, since it has to be "broken" down into records, how that block gets broken into records and what constituents a record.
In most of the examples, I have seen a record being a full line delimited by new line.
My doubt is what decides the "conditions" basis on which something can be treated as record.
I know there are many InputFormat in Hadoop, but my question is what are the conditions which decides something to be considered as a record.
Can anyone help me understand this in simple words.
You need to understand the concept of RecordReader.
Block is a hard bound number of bytes the data is stored on disk. So, by saying a block of 256 MB, means exactly 256 MB piece of data on the disk.
The mapper get 1 record from the block, process it; and get the next one - the onus of defining a record is on RecordReader.
Now what is a record? If I provide an analogy of block being a table, record is a row in the table.
Now think about this - How to process of a block data in mapper, after all you can not write a logic on a random byte of data. From a mapper perspective, you can only have a logic, if the input data "make some sense" or has a structure or a logical chunk of data (from the mapper logic perspective).
That logical chunk is called a record. By default one line of data is the logical chunk in the default implementation. But sometime, it does not make sense to have one line of data being a logical data. Sometime, there is no line at all (Say its a MP4 type of data and mapper need one song as input) !
Let's say you have a requirement in mapper which needs to work on 5 consecutive lines together. In that case you need to override the RecordReader with an implementation where 5 lines are one record and passed together to the mapper.
EDIT 1
Your understanding is on right path
InputFormat: opens the data source and splits the data into chunks
RecordReader: actually parses the chunks into Key/Value pairs.
For the JavaDoc of InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to extract input records from the logical InputSplit for processing by the Mapper.
From the 1st point, one block is not exactly the input to the mapper; it is rather an InputSplit. e.g. think about a Zip File (compressed with GZIP). A Zip file is a collection of ZipEntry (each compressed file). A zip file is a non-splitable from processing perspective. It means, the InputSplit for a Zip file will be of several blocks (in fact all the blocks used to store the particular ZIP file). This happens at the expense of data locality. i.e. even though the zip file is broken and stored in HDFS at different node, the whole file would be moved to the node running the mapper.
The ZipFileInputFormat provides the default record reader implementation ZipFileRecordReader, which has logic to read one ZipEntry (compressed file) for the mapper key-value.
You've already basically answered this for yourself, so hopefully my explanation can help.
A record is a MapReduce-specific term for a key-value pair. A single MapReduce job can have several different types of records - in the wordcount example then the mapper input record type is <Object, Text>, the mapper output/reducer input record type is <Text, IntWritable>, and the reducer output record type is also <Text, IntWritable>.
The InputFormat is responsible for defining how the block is split into individual records. As you identified, there are many InputFormats, and each is responsible for implementing code that manages how it splits the data into records.
The block itself has no concept of records as the records aren't created until the data is read the mapper. You could have two separate MapReduce jobs that read the same block but use different InputFormats. As far as the HDFS is concerned, it's just storing a single big blob of data.
There's no "condition" for defining how the data is split - you can make your own InputFormat and split the data however you want.
I am trying to learn MapReduce in some detail, in particular the following query.
As we know that data in HDFS is broken into blocks and typically Mapper works on a block at a time; we can have the situation in which a record gets spilled to another block; for example:
Dataset: "hello, how are \nyou doing"; this data might get spilled into two different blocks.
Block1:
hello, how a
Block2:
re
you doing
Now, if Mapper works on Block1 , how does mapper get the "full" record from block1 which has spilled to Block2?
Could anyone help me understand this?
It works on files, which could be stored on HDFS as more than one block. However, as far as the mapper is concered its working on a file and the blocks and where they split is irrelevant, it will just see the file and its complete contents.
Block is physical division of data and InputSplit is a Logical division of data.
Input Split is how the recordReader presents the data to the mappers.
When a data is stored there are chances of records being split across 2 blocks. InputSplit doesn’t contain actual data, but a reference to the data. InputSplit represents the data to be processed by an individual Mapper. Typically, it presents a byte-oriented view on the input and is the responsibility of RecordReader of the job to process this and present a record-oriented view. RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper and Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.
How the data is splited depends upon InputFormat. Default InputFormat is FileInputFormat which uses lineFeed for InputSplit.
See also: InputSplit
and RecordReader
I want to understand the definition of Record in MapReduce Hadoop, for data types other than Text.
Typically, for Text data a record is full line terminated by new line.
Now, if we want to process an XML data, how does this data get processed , that is , how would a Record definition be on which mapper would work?
I have read that there is concept of InputFormat and RecordReader, but I didn't get it well.
Can anyone help me understand what is the relationship between InputFormat, RecordReader for various types of data-sets (other than text) and how does the data gets converted into Records on which mapper works upon?
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
for more info :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/
example to explain the question -
i have a file of size 500MB (input.csv)
the file contains only one line (record) in it
so how the file will be stored in HDFS blocks and how the input splits would be computed ?
You probably will have to check this link: How does Hadoop process records split across block boundaries? Pay attention to the 'remote read' mentioned.
The single record mentioned in your question will be stored across many blocks. But if you use TextInputFormat to read, the mapper would have to perform remote-reads across the blocks to process the record.
Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.