Mapreduce: writing from both mapper and reducer in single job - hadoop

I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?

When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer.
By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.

Related

Map task results incase of no reducer

While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is
What is the reason that map task results being stored in local file system ?
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.
2) In case of map only job, final output is stored in hdfs.
1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS)
This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization
This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.
Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases
2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.
What is the reason that map task results being stored in local file system ?
Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.
One more important point from Apache tutorial page :
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
You can more details about this query from Apache tutorial page.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.

Map Reduce program which Caches results and computes automatically when changes affect input dataset

I have a set of input files which are going through changes. Is there any way by which we can run a Map reduce program which caches results. Also, whenever there is any change to the input files the Map Reduce program automatically runs again and the resultset is altered according to changes to input files? Can we use MR to approach this dynamically ?
Let me give you a fair idea that can be done as i can not give code over here
you can do one thing that use flume for the changes in the file and use mapreduce job as the flume sink.
So whenever the content of the file changes flume agent will be triggered and your mapreduce job as the sink of flume will be executed.
this way you can achieve your goal
cheers
Map Reduce is in the realm of batch processing and is not real time, also HDFS is append only file system, if one record out of a billion had changed, than the whole dataset or part file has to be re-written. Not good for near realtime processing and can get very compute intensive if the changes can not be cached in the Mapper and you need to use the Reduce side join.
For the problem you have described it will be better to use a combination of Kafka, Storm and HBase or just HBase depending on how the changes to the file are generated.

What if we only have one reducer

As we know that Hadoop tend to lanunch reducer on the machines that the corresponding mapper is run. What if we have 100 mappers and 1 reducer. We know that the mapper stores data on local disk ,will all the mapped data be transfered to the single reducer?
Yes, if the reducer is only one, all the data will be transferred to that reducer.
Each mapper initially stores its output in its local buffer(100mb default), and when the buffer is filled to a certain percentage defined by io.sort.spill.percent, the result will be spilled on to disk defined by mapred.local.dir.
These files are copied on to the reducer during copy phase, in which output of each mapper is copied by mapred.reduce.parallel.copies parallel threads.(default 5)
If you fix reducer number to one (by job.setNumReduceTasks(1) or -Dmapred.reduce.tasks=1) then all data from mappers will be transferred to one reducer that will process all keys.
If you have only 1 reducer then all the data get tranferred to that reducer and all the output will be stored in HDFS as a single file.
If you are not giving no of reducers then the default no of reducer that run is one.
You can set no of reducers using job.setNumReduceTasks(__) and if you are using ToolRunner you can set no of reducers through command line itself.
-Dmapred.reduce.tasks=4

Mapper writing to HDFS and Reducer writing to HBase table

I have a map reduce job (say Job1) in which the mapper extends
Mapper < Object, Object, KeySet, ValueSet >
Lets say I want to do summation of all values in ValueSet in the reduction step.
After reducing (key, Iterable), I want to write the final reduced values to HBase table instead of HDFS, in reducer of Job1 The table in HBase will be used for future jobs.
I know I can write a mapper only Job2 which reads the reduced file in HDFS (written by Job1) and import the data to HBase table, but I want to avoid two redundant I/O operations.
I don't want to change the Mapper class of Job1 to write to HBase because there are only specific values that I want to write to the HBase table, others I want to continue writing to HDFS.
Has anyone tried something similar and can provide pointers?
I've looked at HBase mapreduce: write into HBase in Reducer but my question is different since I don't want to write anything to HBase in mapper.

Write Reducer output of a Mapreduce job to a single File

I have written a map-reduce job for the data in HBase. It contains multiple mappers and just a single reducer. The Reducer method takes in the data supplied from the mapper and do some analytic on it. After the processing is complete for all the data in HBase I wanted to write the data back to a file in HDFS through the single Reducer. Presently I am able to write the data to HDFS every time I get new one but unable to figure how to write the final conclusion to HDFS only at last.
So, if you trying to write a final result from a single reducer to HDFS, you can try any one of the approaches below -
Use Hadoop API FileSystem's create() function to write to HDFS from the reducer.
Emit a single key and value from reducer after the final calculation
Override Reducers cleanup() function and do point (1) there.
Details on 3:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Reducer.html#cleanup-org.apache.hadoop.mapreduce.Reducer.Context-
Hope this helps.

Resources