Pass the maximum key encountered across all mappers as parameter to the next job - hadoop

I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?

There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.

I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.

Related

Chaining Map Reduce Program

I have a situation, during one POC I want to create a nested MapReduce within one Job. Like a Map M1 O/P to Reducer R1 O/P then that R1 output goes to M2 and final output will come with either M2 or we can run R2 with M2 O/P.
Single Job ID - M1->R1->M2->R2...Final output will be in a single O/P file.
Can we do it without Oozie?
You can chain multiple jobs in your Driver class. First, create a job for first MapReduce, by defining all the required configuration. Then start the job as usual by calling:
job1.waitForCompletion(true);
This is wait until the job is finished. Now check the final status of the first job, whether failed or succeeded for appropriate next action.
If the first job is completed successfully, then launch the next MapReduce in the same way. First define the required parameters and launch the job with:
job2.waitForCompletion(true);
The important thing will be output path of the first will be input for the second job. This is serial (sequential) job chaining, because both the jobs will be running one after another.
You can also make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are two mappers and two or one reducers. You can have two map reduce jobs and for the second job you can use set the number of reducers to zero if you don't require reducers.

How to implement multiple reducers in a single MapReduce Job

I'm having a huge data set and I need to perform different functions for the same data.
I would like to have four output files. Since four operations are different, can I use four partitioners and four reducers to implement the same ? Is it possible or should I need to write four jobs to perform this ? Please help me !
First Approach
I think you should implement the code in a unique reduce method, and emit n keys depending the process performed. For example: You implement A,B,C and D techiniques, then, in your mapper you could implement this (pseudo-code):
dataA = ProcessA(key,value)
context.write("A", dataA)
dataB = ProcessB(key,value)
context.write("B", dataB)
dataC = ProcessC(key,value)
context.write("C", dataC)
dataD = ProcessD(key,value)
context.write("D", dataD)
You should be careful about data types of output. Also, the output key could be more complex.
Second Approach
You could generate N MapReduce applications in the same java project, and then you re-use the Map, and develop N reducers.
In job.setReducerClass in each main class you set each Reducer. The Map will be the same.
You just need to specify number of reducers in your MapReduce
job config. The default partitioner will distribute data to reducers based on hash of key modulus number of specified reducers.
To override behavior of default partitioner, you can implement your own custom partitioner specifying how your data should get across to the reducers.
---Edit to answer questions in the comments section---
How can i specify more than one reducer class in the Map-reduce driver
To set number of reducers, in job conf you can set it like below -
int numReducers = /*number of reducers you want*/;
job.setNumReduceTasks(numReducers);
Whether I should write four different Jobs for this. Or can I do this with a single Job
Hadoop MR jobs are I/O intensive, in your MR job design you should work on minimizing the I/O and parallel processing as much possible.
If your reducers need same input for generating all 4 outputs, it will be better to keep single job, but another consideration can be skewness of data for either output.
For example output1 has more processing time + most of incoming data is likely to be processed for output1.
If you have scenerio like time taken to process output1 is much higher then total time taken to process output2 + output3 + output4, then you should considering splitting processing of output1 in multiple steps.
However if we consider all 4 outouts have more or less equal processing times and consumes same data throughout,
It will be better to have some conditional processing logic in the reducer and let your custom partioner decide which data goes to which reducer.
Your custom partioner can have some check like this incoming data qualifies to be contributing to "GC content" so let it got to Reducer 3.
But if your incoming data needs to be processed for more then one output/distribution use conditional processing and to write multiple output files from same reducer use "MultipleOutputs".
You can google it up and find usage examples, it lets you write output to multiple folders/files at the same time from within a Mapper or Reducer.
Hadoop let's you specify the number of reducer tasks from the job driver job.setNumReduceTasks(num_reducers);. Since you want four outputs, you would specify int num_reducers = 4; Here's an example driver class.
public class run {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Run NB Count");
job.setJarByClass(NB_train_hadoop.class);
// set mappers, reducers, other stuff
job.setNumReduceTasks(num_reducers);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While this is handy, you have to understand that there is an optimal number of reducers you can choose which is dependent on the number of nodes in your cluster.
For example, running 4 Amazon m3.xlarge instances (1 master, 3 slaves, and 4 cores an instance), has the following relationship between wall time and number of reducer tasks used in the MapReduce job. You can see that more isn't necessarily better and if you use too many, well then you might as well crunch your data with your mother's hair curler because it would be faster that way.
Hope this is helpful!!

Set result from previous Reducer as configuration parameter

As part of the calculation logic , In a Mapreduce workflow i need to take the result from a reducer as parameter for the next reducer in the chain.
Path plc =new Path(args[1]+"/3");--> Outputpath from the previous reducer
Configuration c4= new Configuration();
c4.set("denom", GetLineC.extCount(plc));---> GetLineC.extCount is a function that returns a value
ControlledJob cJob4= new ControlledJob(c4);
Im using JobControl to create the dependency between the jobs and all the configuration. When the program is executed it gives "No such file or directory".In the flow when the control reaches this part the file will be present in this location. But since the configuration is instantiated in the beginning this error is showing up.
Is there a way to set the single line output from the previous reducer as a parameter directly?
Well, I think you mean previous job instead of previous reducer. If you're executing the two jobs using the same driver class, you already know the output of the last job, which is a directory. Clearly you're using only one reducer and it will write its output in a part-r-00000 file inside the output path. To set it as a configuration parameter to the next job, you will have to read this file manually.
Are you considering that in GetLineC.extCount(Path path)?

Hadoop mapper task detailed execution time

For a certain Hadoop MapReduce mapper task, I have already had the mapper task's complete execution time. In general, a mapper has three steps: (1)read input from HDFS or other source like Amazon S3; (2)process input data; (3)write intermediate result to local disk. Now, I am wondering if it's possible to know the time spent by each step.
My purpose is to get the result of (1) how long does it take for mappers to read input from HDFS or S3. The result just indicate how fast a mapper could read. It's more like a I/O performance for a mapper; (2) how long does it take for the mapper to process these data, it's more like the computing capability of the task.
Anyone has any idea for how to acquire these results?
Thanks.
Just implement a read-only mapper that does not emit anything. This will then give an indication of how long it takes for each split to be read (but not processed).
You can as a further step define a variable passed to the job at runtime (via the job properties) which allows you to do just one of the following (by e.g. parsing the variable against an Enum object and then switching on the values):
just read
just read and process (but not write/emit anything)
do it all
This of course assumes that you have access to the mapper code.

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources