Map Reduce Keep input ordering - hadoop

I tried to implement an application using hadoop which processes text files.The problem is that I cannot keep the ordering of the input text.Is there any way to choose the hash function?This problem could be easily solved by assigning a partition of the input to each mapper an then send the partition to the reducers.Is this possible with hadoop ?

The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which:
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
This is done using a construct called "secondary sort".
A simple google action for this term resulted in several points where you can continue.
I like this blog post : link

Related

Handling incremental data - Hadoop

We had 5 years of data in cluster and we are loading the data everyday. The data that gets added everyday might contain duplicate data , partially modified data etc ..
1 . How to handle duplicate data - should that be handled as part of highlevel programming interfaces pig, hive etc .. or any other alternatives.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
What is the best way to model the data, using which hadoop eco system components.
How to handle duplicate data
It's very hard to remove duplicates from HDFS raw data,
so I guess your approach is right: remove using pig or hive while loading those data.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
For this case, do you meaning that two records has the same key?
Then what kind of changes you want to capture?
When you say that, you need to remove duplicates and also the delta between two records when you know the key, you should have some criteria of which data to be removed in case of partial changed data.
In both scenarios, you can have a handle of the key and write logic to remove duplicates. Map reduce seems to be a good choice, given the parallelism, performance and ability to manage based on keys. Mostly your requirements could be handled in reducer
See if Sqoop-merge fits your use case.
From the doc:
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

Can mapper output keys be routed to a specific node in Hadoop MR

I need to process some data in MR and load it into an external system that sits on the same physical machines as my MR nodes. Right now I run the job and read the output from HDFS and re-route individual records back out onto the desired nodes.
Is it possible to define some mapping such that records with key X always go straight to the desired node Y? Simply put, I want to control where hadoop routes post-sorted partitioned groups.
Not easily. The only way I know of to affect the physical location of a block of data on the fly is to implement a custom BlockPlacementPolicy. I'll just throw out some ideas for your use case.
A custom BlockPlacementPolicy can route blocks based on the file name
The file name of a partition can be modified using MultipleOutputs in MapReduce
Keys can be routed to specific partitions using a custom Partitioner
It seems like you can get the result you're looking for, but it won't be pretty.

Filtering AVRO Data from 2 datasets

Use-case:
I had 2 dataset/fileset Machine (Parent) and Alerts (Child).
Their data is also stored in two avro files viz machine.avro and alert.avro.
Alert schema had machineId : column type int.
How can I filter data from machine if there is a dependency on alert too? (one-to-many).
e.g. get all machines where alert time is between 2 time-stamp.
Any e.g. with source will be great help...
Thanks in advance...
Got answer in another thread....
Mapping through two data sets with Hadoop
Posting comments from that thread...
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Want to compare two consecutive jobs on Hadoop

I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.
For example if first job did something like SortList(A) and some other job did SortList(A)+Group(result(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).
So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.
Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.
Here's a tutorial I found:
http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.

Resources