Read text file as bytes, split on a character - hadoop

I'm trying to process large records in Hadoop that span multiple lines. Each records consists of this:
>record_id|record_name // Should be the key
JAKSJDUVUAKKSKJJDJBUUBLAKSJDJUBKAKSJSDJB // Should be the value
KSJGFJJASPAKJWNMFKASKLSKJHUUBNFNNAKSLKJD
JDKSKSKALSLDKSDKPBPBPKASJDKSALLKSADJSAKD
I want to read the file containing these records as bytes because reading it as a String is just too memory intensive, as a single record can be well over 100MB. I cannot split these records on anything but the > character that defines a new record in the file.
I've been looking for a default RecordReader and InputFormat that can do these steps for me, but I haven't been able to find it. I'm trying to write my own. But I have no examples/tutorials to follow on this subject.
How should I approach this?

Related

How Blocks gets converted into Records and what exactly is the definition of Record in Hadoop

I am learning Hadoop, and to begin with started with HDFS and MapReduce. I understood the basics of HDFS and MapReduce.
There is one particular point where I am not able to understand, which I am explaining below:
Large data set --> Stored in HDFS as Blocks, say for example B1, B2, B3.
Now, when we run a MR Job, each mapper works on a single block (assuming 1 mapper processes a block of data for simplicity)
1 Mapper ==> processes 1 block
I also read that the block is divided into Records and for a given block, same mapper is called for each records within that block (of data).
But what exactly is a Record?
For a given block, since it has to be "broken" down into records, how that block gets broken into records and what constituents a record.
In most of the examples, I have seen a record being a full line delimited by new line.
My doubt is what decides the "conditions" basis on which something can be treated as record.
I know there are many InputFormat in Hadoop, but my question is what are the conditions which decides something to be considered as a record.
Can anyone help me understand this in simple words.
You need to understand the concept of RecordReader.
Block is a hard bound number of bytes the data is stored on disk. So, by saying a block of 256 MB, means exactly 256 MB piece of data on the disk.
The mapper get 1 record from the block, process it; and get the next one - the onus of defining a record is on RecordReader.
Now what is a record? If I provide an analogy of block being a table, record is a row in the table.
Now think about this - How to process of a block data in mapper, after all you can not write a logic on a random byte of data. From a mapper perspective, you can only have a logic, if the input data "make some sense" or has a structure or a logical chunk of data (from the mapper logic perspective).
That logical chunk is called a record. By default one line of data is the logical chunk in the default implementation. But sometime, it does not make sense to have one line of data being a logical data. Sometime, there is no line at all (Say its a MP4 type of data and mapper need one song as input) !
Let's say you have a requirement in mapper which needs to work on 5 consecutive lines together. In that case you need to override the RecordReader with an implementation where 5 lines are one record and passed together to the mapper.
EDIT 1
Your understanding is on right path
InputFormat: opens the data source and splits the data into chunks
RecordReader: actually parses the chunks into Key/Value pairs.
For the JavaDoc of InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to extract input records from the logical InputSplit for processing by the Mapper.
From the 1st point, one block is not exactly the input to the mapper; it is rather an InputSplit. e.g. think about a Zip File (compressed with GZIP). A Zip file is a collection of ZipEntry (each compressed file). A zip file is a non-splitable from processing perspective. It means, the InputSplit for a Zip file will be of several blocks (in fact all the blocks used to store the particular ZIP file). This happens at the expense of data locality. i.e. even though the zip file is broken and stored in HDFS at different node, the whole file would be moved to the node running the mapper.
The ZipFileInputFormat provides the default record reader implementation ZipFileRecordReader, which has logic to read one ZipEntry (compressed file) for the mapper key-value.
You've already basically answered this for yourself, so hopefully my explanation can help.
A record is a MapReduce-specific term for a key-value pair. A single MapReduce job can have several different types of records - in the wordcount example then the mapper input record type is <Object, Text>, the mapper output/reducer input record type is <Text, IntWritable>, and the reducer output record type is also <Text, IntWritable>.
The InputFormat is responsible for defining how the block is split into individual records. As you identified, there are many InputFormats, and each is responsible for implementing code that manages how it splits the data into records.
The block itself has no concept of records as the records aren't created until the data is read the mapper. You could have two separate MapReduce jobs that read the same block but use different InputFormats. As far as the HDFS is concerned, it's just storing a single big blob of data.
There's no "condition" for defining how the data is split - you can make your own InputFormat and split the data however you want.

Record definition in MapReduce on different types of data-sets in Hadoop?

I want to understand the definition of Record in MapReduce Hadoop, for data types other than Text.
Typically, for Text data a record is full line terminated by new line.
Now, if we want to process an XML data, how does this data get processed , that is , how would a Record definition be on which mapper would work?
I have read that there is concept of InputFormat and RecordReader, but I didn't get it well.
Can anyone help me understand what is the relationship between InputFormat, RecordReader for various types of data-sets (other than text) and how does the data gets converted into Records on which mapper works upon?
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
for more info :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

Control the split size with Avro Input Format in Hadoop

I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.
My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.
I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?
Splits honor logical record boundaries and the min and max boundaries are in bytes - text input format won't break lines in a text file even though the split boundaries are defined in bytes.
To have each file in a split, you can either set the max split size to Long.MAX_VALUE or you can override the isSplitable method in your code and return false.

How to handle multiline record for inputsplit?

I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.
So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.
So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?
Thanks in advance
-JE
You need to know if the records are actually delimited by some known sequence of characters.
If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.
If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.
Try determining if the records are delimited. Perhaps posting a short example of a few records would help.
In your record reader you need to define an algorithm by which you can:
Determine if your in the middle of a record
How to scan over that record and read the next full record
This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources