is there a way to collect output from reducer in hadoop? - hadoop

Is there a way to collect the output from a reducer and prevent it from writing to file? I'd like to sort the reduced output before writing to file.

No, there is no way to do it. MapReduce job must be finished writing result to file.
If I understand correctly you want to sort reducers output in a certain way instead of default sorting by keys passed to reducer.
You have 2 possible ways for this:
Change output key on Map phase to another one by which your data
should be sorted on Reduce phase.
if 1st way is impossible you can
sort reducers output by another MapReduce job or different tools.
You can start sorting job right after the main job from the same
driver specifying the output directory of the main job as the input
directory for the sorting job.

Related

Map task results incase of no reducer

While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is
What is the reason that map task results being stored in local file system ?
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.
2) In case of map only job, final output is stored in hdfs.
1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS)
This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization
This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.
Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases
2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.
What is the reason that map task results being stored in local file system ?
Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.
One more important point from Apache tutorial page :
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
You can more details about this query from Apache tutorial page.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.

hadoop job output files

I currently have one hadoop oozie job running. The output files are automatically generated. The expected number of output files is just ONE; however, there are two output files called part-r-00000 and part-r-00001. Sometimes, the first one(part-r-00000) has data, and the second one (part-r-00001) doesn't. Sometimes, the second one has, and the first one doesn't. Can anyone tell me why? Also, How to set the output file to part-r-00000?
In Hadoop, the output files are a product of the Reducers (or Mappers if it's a map-side only job, in which case it will be a part-m-xxxxx file). If your job uses two reducers, that means that after each has finished with its portion, it will write to the output directory in the form of part-r-xxxxx, where the numbers denote which reducer wrote it out.
That said, you cannot specify a single output file, but only the directory. To get all of the files from the output directory into a single file, use:
hdfs dfs -getmerge <src> <localdst> [addnl]
Or if you're using an older version of hadoop:
hadoop fs -getmerge <src> <localdst> [addnl]
See the shell guide for more info.
As to why one of your output files is empty, data is passed from Mappers to Reducers based on the grouping comparator. If you specify two reducers, but there is only one group (as identified by the grouping comparator), data will not be written from one reducer. Alternatively, if some logic within the reducer prevents a writing operation, that's another reason data may not be written from one reducer.
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether this file was generated by a map or reduce task
yyyyy is the mapper or reducer task number (zero based)
The number of tasks has nothing to do with the number of physical nodes in the cluster. For map task output the number of tasks is given by the input splits. Usually the reducer task are set with job.setNumReduceTasks() or passed as input parameter.
A job which has 100 reducers will have files named part-r-00000 to part-r-00100, one for each reducer task.
A map only job with 100 input splits will have files named part-m-00000 to part-m-00100, one for each reducer task.
The number of files output is dependent on the number of mappers and reducers. In your case, the number of files and names of files indicates that your output came from 2 reducers.
To limit the number of mappers or reducers is dependent on your language (Hive, Java, etc), but each has a property that you can set to limit these. See here for Java MapReduce jobs.
Files can be empty if that particular mapper or reducer task had no resulting data on the given data node.
Finally, I don't think you want to limit your mappers and reducers. This will defeat the point of using Hadoop. If you're aiming to read all files as one, make sure they are consolidated in a given directory and pass the directory as the file name. The files will be treated as one.

Apache Sqoop-1 reducer phase

I have gone through the sqoop documentation and did not find the information on why sqoop-1 does not have reducer phase. Can someone please explain this.
The purpose of the Reducer is to aggregate the input values and return a single output value.
Look at the simple example of WordCount in MapReduce. The Reducer is used to aggregate the number of occurrences of a single word.
Since the nature of a Sqoop job is to fetch the input records from the given RDBMS and put the records into the given output directory in HDFS or into a Hive table, the job does not require any aggregation and therefore no Reduce phase is needed.
Reduce phase is not needed when all tasks can be executed in parallel.
Sqoop does not need reducer because it imports/exports data between RDBMS and HDFS file system (or Hive tables.).
since RDBMS consists of structured data there is not need shuffle or sort and aggregation can be done in mapper it self.

Mapreduce: writing from both mapper and reducer in single job

I have a need to send only selected records from mapper to reducer and rest filter record to write to hdfs from mapper itself. Reducer will write the records send to reducer. My job is processing huge data in 20TBs, it uses 30K mappers, so I believe I cannot write from mapper's cleanup method as well, because to load that data from 30K mapper's output files(30k files) will be a another problem for the next job. I am using CDH4. Has anyone implemented a similar scenario with any other different approach?
When you want to write the data to HDFS, is it through java client and to HDFS? If yes, then you can write conditional logic to write to HDFS and write to output location, from where reducer picks up. Records not meeting the condition can then use mapper to write to output location, and later be picked up by reducer.
By default the output location is also a HDFS location, but you have to see which way you want the data to be in HDFS as per your case.

Map Reduce program which Caches results and computes automatically when changes affect input dataset

I have a set of input files which are going through changes. Is there any way by which we can run a Map reduce program which caches results. Also, whenever there is any change to the input files the Map Reduce program automatically runs again and the resultset is altered according to changes to input files? Can we use MR to approach this dynamically ?
Let me give you a fair idea that can be done as i can not give code over here
you can do one thing that use flume for the changes in the file and use mapreduce job as the flume sink.
So whenever the content of the file changes flume agent will be triggered and your mapreduce job as the sink of flume will be executed.
this way you can achieve your goal
cheers
Map Reduce is in the realm of batch processing and is not real time, also HDFS is append only file system, if one record out of a billion had changed, than the whole dataset or part file has to be re-written. Not good for near realtime processing and can get very compute intensive if the changes can not be cached in the Mapper and you need to use the Reduce side join.
For the problem you have described it will be better to use a combination of Kafka, Storm and HBase or just HBase depending on how the changes to the file are generated.

Resources