In Hadoop MapReduce the intermediate output (map output) is saved in the local disk. I would like to know if it is possible to start a job just with the reduce phase, that reads the mapoutput from the local disk, partition the data and execute the reduce tasks?
There is a basic implementation of Mapper called IdentityMapper , which essentially passes all the key-value pairs to a Reducer.
Reducer reads the outputs generated by the different mappers as pairs and emits key value pairs.
The Reducer’s job is to process the data that comes from the mapper.
If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
You can't run just reducers without any mappers..
Map reduce works on data which is in HDFS. So I dont think you can write reducer only map reduce to read from local disk
If you use Hadoop Streaming, you can just add:
-mapper "/bin/sh -c \"cat\""
Related
I am running map only (number of reducers = 0) map reduce streaming job. At the end of the mapper code, I write the output of map explicitly. However at times some of my mapper tasks fails.
I was expecting to see the output of completed mappers in HDFS. My logic was as there is no reducer so output should be directly written to HDFS. However I couldn't see any data in output HDFS folder in case even a single mapper fails. Why this happens? Is there any flaw in my understanding?
For a given MR job, i need to produce two output files.
One file should be the output of Mapper
Another file should be the output of Reducer (which is just an aggregation of above Mapper)
Can I have the both the mapper and reducer output be written in a single job?
EDIT:
In Job 1 (Only Mapper phase) Output contains 20 fields in a single row, which has to be written to hdfs(file1).
In Job 2 (Mapper n reducer) Mapper takes input from Job1 output, deletes few fields to bring into a standard format(only 10 fields) and pass it to reducer which writes file2.
I need both file1 and file2 in hdfs... Now My doubt is, whether in Job1 mapper can I write data into hdfs as file1, then modify the same data and pass it to reducer.
PS : As of now I am using 2 jobs with chaining mechanism. First job contains only mapper, seconds job contains mapper and reducer.
You could perhaps use the MultipleOutputs class to define one output for the mapper and (optionally) one for the reducer. For the mapper, you will have to write things twice: once for the output file (using MultipleOutputs) and once for emitting pairs to the reducer (as usual).
Then, you could also take advantage of ChainMapper class, to define the following workflow in a single job:
Mapper 1 (file 1) -> Mapper 2 -> Reducer (file 2)
To be honest, I 've never used this logic, but you can give it a try. Good luck!
I want to configure a mapreduce action in oozie workflow for a existing mapreduce jar(with mapper, reducer and sometimes combiner also) file, such that only reducer/combiner is run on the input files.
All MapReduce jobs must run the map phase, however you can have the mappers pass the data straight through by either:
In the old MR API using the IdentityMapper
In the new MR API by not specifying the mapper class at all, which will default to the base Mapper class that acts as an identity mapper
Map operation generally take input as key and value pair. and it will return same key and value pair as output. If map will return non key-value pair output, that time how Reducer will process that output.
Please any one assist on this would be appreciated
I am not sure about Java MapReduce, but in Hadoop Streaming if the mappers do not produce any output the reducers will not be run.
You can test it by creating 2 small python scripts:
A mapper that simply consumes the input without producing anything
#!/usr/bin/python
input()
A reducer that crashes as soon at it is started
#!/usr/bin/python
sys.exit("some error message")
If you launch it, the MapReduce job will complete without any error
I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?
There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.
I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.