how to perform ETL in map/reduce - hadoop

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?

Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?

You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Related

How Blocks gets converted into Records and what exactly is the definition of Record in Hadoop

I am learning Hadoop, and to begin with started with HDFS and MapReduce. I understood the basics of HDFS and MapReduce.
There is one particular point where I am not able to understand, which I am explaining below:
Large data set --> Stored in HDFS as Blocks, say for example B1, B2, B3.
Now, when we run a MR Job, each mapper works on a single block (assuming 1 mapper processes a block of data for simplicity)
1 Mapper ==> processes 1 block
I also read that the block is divided into Records and for a given block, same mapper is called for each records within that block (of data).
But what exactly is a Record?
For a given block, since it has to be "broken" down into records, how that block gets broken into records and what constituents a record.
In most of the examples, I have seen a record being a full line delimited by new line.
My doubt is what decides the "conditions" basis on which something can be treated as record.
I know there are many InputFormat in Hadoop, but my question is what are the conditions which decides something to be considered as a record.
Can anyone help me understand this in simple words.
You need to understand the concept of RecordReader.
Block is a hard bound number of bytes the data is stored on disk. So, by saying a block of 256 MB, means exactly 256 MB piece of data on the disk.
The mapper get 1 record from the block, process it; and get the next one - the onus of defining a record is on RecordReader.
Now what is a record? If I provide an analogy of block being a table, record is a row in the table.
Now think about this - How to process of a block data in mapper, after all you can not write a logic on a random byte of data. From a mapper perspective, you can only have a logic, if the input data "make some sense" or has a structure or a logical chunk of data (from the mapper logic perspective).
That logical chunk is called a record. By default one line of data is the logical chunk in the default implementation. But sometime, it does not make sense to have one line of data being a logical data. Sometime, there is no line at all (Say its a MP4 type of data and mapper need one song as input) !
Let's say you have a requirement in mapper which needs to work on 5 consecutive lines together. In that case you need to override the RecordReader with an implementation where 5 lines are one record and passed together to the mapper.
EDIT 1
Your understanding is on right path
InputFormat: opens the data source and splits the data into chunks
RecordReader: actually parses the chunks into Key/Value pairs.
For the JavaDoc of InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to extract input records from the logical InputSplit for processing by the Mapper.
From the 1st point, one block is not exactly the input to the mapper; it is rather an InputSplit. e.g. think about a Zip File (compressed with GZIP). A Zip file is a collection of ZipEntry (each compressed file). A zip file is a non-splitable from processing perspective. It means, the InputSplit for a Zip file will be of several blocks (in fact all the blocks used to store the particular ZIP file). This happens at the expense of data locality. i.e. even though the zip file is broken and stored in HDFS at different node, the whole file would be moved to the node running the mapper.
The ZipFileInputFormat provides the default record reader implementation ZipFileRecordReader, which has logic to read one ZipEntry (compressed file) for the mapper key-value.
You've already basically answered this for yourself, so hopefully my explanation can help.
A record is a MapReduce-specific term for a key-value pair. A single MapReduce job can have several different types of records - in the wordcount example then the mapper input record type is <Object, Text>, the mapper output/reducer input record type is <Text, IntWritable>, and the reducer output record type is also <Text, IntWritable>.
The InputFormat is responsible for defining how the block is split into individual records. As you identified, there are many InputFormats, and each is responsible for implementing code that manages how it splits the data into records.
The block itself has no concept of records as the records aren't created until the data is read the mapper. You could have two separate MapReduce jobs that read the same block but use different InputFormats. As far as the HDFS is concerned, it's just storing a single big blob of data.
There's no "condition" for defining how the data is split - you can make your own InputFormat and split the data however you want.

Record definition in MapReduce on different types of data-sets in Hadoop?

I want to understand the definition of Record in MapReduce Hadoop, for data types other than Text.
Typically, for Text data a record is full line terminated by new line.
Now, if we want to process an XML data, how does this data get processed , that is , how would a Record definition be on which mapper would work?
I have read that there is concept of InputFormat and RecordReader, but I didn't get it well.
Can anyone help me understand what is the relationship between InputFormat, RecordReader for various types of data-sets (other than text) and how does the data gets converted into Records on which mapper works upon?
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
for more info :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

Pass parameter from spark to input format

We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.

How to output multiple values with the same key in reducer?

I have a bunch of text files which are categorized and I would like to create a sequence file for each category in which the key is the category name and the value consists of all the textual content of all the files for the category.
I have a nosql database which has only two columns. Each row represents a file, the first column is the category name and the second one is the absolute address of the text file stored on the HDFS. My mapper reads the database and output pairs in which the key is the category and the value is the absolute address. In the reducer sides, I have the addresses of all the files for each category and I would like to create one sequence files for each category in which the key is the category name and the value consists of the all textual content of all the files belonging to that category.
A simple solution is to iterate through the pairs (in the reducer) and open files one by one and append their content to a String variable and at the end create a sequence file using MultipleOutputs. However as the file sizes may be large appending the content to a single String may not be possible. Is there any way to do this without using a String variable?
Then, since you have all the files in reducer, you can get the content of those files, and append using StringBuilder to save memory, and then discard that StringBuilder reference. If avoiding String is your question, StringBuilder is a quick way. The IO operaion involving the file access and reading is resource intensive. However the data itself, should be ok given the architecture of reducers in hadoop.
You can also think of using a combiner. However, that is mainly used to reduce the traffic between map and reduce. You can take advantage of preparing part of the sequence file, at combiner and then remaining at reducer level. ofcouse this is valid only if the content can be added as it comes and not based on specific order.

Resources