Hadoop Multi Output maintaining record order emitted from map - hadoop

I am trying to achieve Multiple Output from the reducer in hadoop. The files are created properly, the problem is the header and footer of the file does not comes in proper places(i..e, the ordering of the records that are emitted from the map are changed).I am having one mapper and multiple reducers.
I tried to add an index(like an integer) to each map records and remove it from the reducer keys, but it was giving as file already exists exception. I was using a custom comparator to sort the keys based on the index values.
Any ideas on what am i missing.

Related

How to map the value in Mapper

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

Custom Partitioner vs MultipleOutputFormat

I am new to map reduce, I would like to know what is the difference between creating multiple outputs based on certain conditions using a custom partitioner and MultipleOutputs concept in Map reduce.
Using a custom partitions, you will sent the data to a different reducer and each reducer will write one file with all the data processed by it.
part-r-00001, part-r-00002 . . .
With MiltipleOutputs each reducer will be available to write different files (Multiple outputs) with a custom name.
Tag1-r-00001, Tag2-r-00001, Tag1-r-00002, Tag2-r-00002 . . .
Customer partition is used to group related data together before the processing, and multiple outputs is to split the data in the output after the processing.
Using MultipleOutputs you will be able to identify the data without need to keep the track of the reducer number and in the future if you need to increment or reduce the number of reduers (as the data change), you still will be able to identify the old data by the prefix.

Filtering AVRO Data from 2 datasets

Use-case:
I had 2 dataset/fileset Machine (Parent) and Alerts (Child).
Their data is also stored in two avro files viz machine.avro and alert.avro.
Alert schema had machineId : column type int.
How can I filter data from machine if there is a dependency on alert too? (one-to-many).
e.g. get all machines where alert time is between 2 time-stamp.
Any e.g. with source will be great help...
Thanks in advance...
Got answer in another thread....
Mapping through two data sets with Hadoop
Posting comments from that thread...
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

Increasing mapper in pig

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this
LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .
But if i give where clause in the load function like
LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.
Can any one help me in increasing mapper
It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.
The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.
It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

Resources