Set result from previous Reducer as configuration parameter - hadoop

As part of the calculation logic , In a Mapreduce workflow i need to take the result from a reducer as parameter for the next reducer in the chain.
Path plc =new Path(args[1]+"/3");--> Outputpath from the previous reducer
Configuration c4= new Configuration();
c4.set("denom", GetLineC.extCount(plc));---> GetLineC.extCount is a function that returns a value
ControlledJob cJob4= new ControlledJob(c4);
Im using JobControl to create the dependency between the jobs and all the configuration. When the program is executed it gives "No such file or directory".In the flow when the control reaches this part the file will be present in this location. But since the configuration is instantiated in the beginning this error is showing up.
Is there a way to set the single line output from the previous reducer as a parameter directly?

Well, I think you mean previous job instead of previous reducer. If you're executing the two jobs using the same driver class, you already know the output of the last job, which is a directory. Clearly you're using only one reducer and it will write its output in a part-r-00000 file inside the output path. To set it as a configuration parameter to the next job, you will have to read this file manually.
Are you considering that in GetLineC.extCount(Path path)?

Related

WebSphere Liberty Java Batch Can we Pass Batchlet step Output to Chunk step as a Input Parameter runtime

In WebSphere Liberty Java Batch,
Is it possible to pass first Step Output to Next step as input parameter.
e.g. First step is Batchlet and second step is chunk. Once first step completes its execution output should be passed to second step runtime..
I'm guessing you are thinking of this in z/OS JCL terms where a step would write output to a temporary dataset that gets passed to a subsequent step. JSR-352 doesn't get into dataset (or file) allocation. That's up to the application code. So you could certainly have a step that wrote output into a file (or dataset) and a later step could certainly read from that same file (or dataset) if it knew the same. You could make the name into a job property that was provided as a property to the batchlet and reader. You could even externalize the value of the job property as a job parameter.
But nothing is going to delete the file for you at the end of the job (like a temporary dataset would get deleted). You'll need to clean up the file yourself.
Is that what you were asking?
You can use the JobContext user data: JobContext.set/getTransientUserData().
This does not however allow you to populate a batch property (via #Inject #BatchProperty) in a parallel way to the manner in which you can supply values from XML via substitution with job parameters.
We have raised an issue to consider an enhancement for the next revision of the Batch specification to allow a property value to be set dynamically from an earlier portion of the execution.
In the meantime there is also the possibility to use CDI bean scopes to share information across steps, but this also is not integrated with batch property injection.

Chaining Map Reduce Program

I have a situation, during one POC I want to create a nested MapReduce within one Job. Like a Map M1 O/P to Reducer R1 O/P then that R1 output goes to M2 and final output will come with either M2 or we can run R2 with M2 O/P.
Single Job ID - M1->R1->M2->R2...Final output will be in a single O/P file.
Can we do it without Oozie?
You can chain multiple jobs in your Driver class. First, create a job for first MapReduce, by defining all the required configuration. Then start the job as usual by calling:
job1.waitForCompletion(true);
This is wait until the job is finished. Now check the final status of the first job, whether failed or succeeded for appropriate next action.
If the first job is completed successfully, then launch the next MapReduce in the same way. First define the required parameters and launch the job with:
job2.waitForCompletion(true);
The important thing will be output path of the first will be input for the second job. This is serial (sequential) job chaining, because both the jobs will be running one after another.
You can also make use of job control where in you can execute a number of map reduce jobs in a sequence. In your case there are two mappers and two or one reducers. You can have two map reduce jobs and for the second job you can use set the number of reducers to zero if you don't require reducers.

Pass the maximum key encountered across all mappers as parameter to the next job

I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?
There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.
I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.

Multiple mappers in hadoop

I am trying to run 2 independent mappers on the same input file in a hadoop program using one job. I want the output of both the mappers to go into a single reducer. I face an issue with running multiple mappers. I was using MultipleInputs class. It was working fine by running both the mappers but yesterday i noticed that it is running only one map function that is the second MultipleInputs statement seems to overwrite the first one. I dont find any change done to the code to show this different behavior suddenly :( Please help me in this.. The main function is :
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "mapper accepting whole file at once");
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(TestMultipleInputs.class);
job.setMapperClass(Map2.class);
job.setMapperClass(Map1.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(NLinesInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
** MultipleInputs.addInputPath(job, new Path("Rec"), NLinesInputFormat.class, Map1.class);
MultipleInputs.addInputPath(job, new Path("Rec"), NLinesInputFormat.class, Map2.class);**
FileOutputFormat.setOutputPath(job,new Path("testMulinput"));
job.waitForCompletion(true);
}
Whichever Map class is used in the last MultipleInputs statement gets executed. Like in here the Map2.class gets executed.
You wont be able to read from the same file at the same time with two separate Mappers (at least not without some devilishly hack-ish trickery which you should probably avoid).
In any case, you cant have two Mapper classes set for the same job - the latter call to setMapperClass(class) will always overwrite the former. If you need two Mappers to run simultaneously, you'll need to make two separate jobs, and ensure that there are enough mappers available on your cluster to run them both simultaneously (if there aren't any available after the first job starts the second job will have to wait for it to finish, running sequentially rather than simultaneously.)
However, due to the lack of a guarantee that the Mappers will run concurrently, ensure that the functionality of your MapReduce jobs is not reliant on their concurrent execution.
Both mappers can't read the same file at the same time.
Solution (Workaround):
Create a duplicate of the input file (in this case let duplicate rec file be rec1). Then feed mapper1 with rec and mapper2 with rec1.
Both mappers are executed parallel so you don't need to worry about reducer output because both mappers output will be shuffled so that equal keys from both the files go to same reducer.
so output is what you want.
Hope this helps to others who are facing similar issue.

Setting parameter in MapReduce Job configuration

Is there any way to set a parameter in job configuration from Mapper and is accessible from Reducer.
I tried the below code
In Mapper: map(..) : context.getConfiguration().set("Sum","100");
In reducer: reduce(..) : context.getConfiguration().get("Sum");
But in reducer value is returned as null.
Is there any way to implement this or any thing missed out from my side?
As far as I know, this is not possible. The job configuration is serialized to XML at run-time by the jobtracker, and is copied out to all task nodes. Any changes to the Configuration object will only affect that object, which is local to the specific task JVM; it will not change the XML at every node.
In general, you should try to avoid any "global" state. It is against the MapReduce paradigm and will generally prevent parallelism. If you absolutely must pass information between the Map and Reduce phase, and you cannot do it via the usual Shuffle/Sort step, then you could try writing to the Distributed Cache, or directly to HDFS.
If you are using the new API your code should ideally work. Have you created this "Sum" property at the start of the job creation? For example like this
Configuration conf = new Configuration();
conf.set("Sum", "0");
Job job = new Job(conf);
If not you better use
context.getConfiguration().setIfUnset("Sum","100");
In your mapper class to fix the issue. This is the only thing I can see.

Resources