when does data to output file is written in a map reduce architecture?How can i use processed reducer output data in reducer? - hadoop

I am using hadoop version:1.0.0
After processing each reducer input key i am collecting the output.But it is not written to actual output file. I am trying to use processed intermediate output for processing further input keys.How can i do this?
Could you please suggest me how to use that intermediate data.When does mapreduce write data to output file?.

What you ask is something against the MR paradigm. And, as any deviation from the concept has consiquences.
Technically data is passed to the OutputFormat and it is his discretion to push it to output. I think it is written during the job, but You might have some delay before seeing it.
I think it would be easier for you to exolicitely accumulate processed data in reducer and use it, although this solution has inhenrent problem. You can face out of memory if there is enought keys.
I would suggest using two MR jobs, or some other techinques to make reducer stateless or at least limit amount of data it can accumulate.

Related

How to go through the OutputFormat.RecordWriter write(key,value) twice in Hadoop

I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Is the input file to a MapReduce program mandatory?

I am working on a use case where I generate random data using a map reduce program and I do not require any input file in HDFS. If I don't give input path MR program doesn't work. So, currently I have a dummy input file. Is there any way to avoid this?
Usually MR programs have some sort of data for processing. But, there might be scenarios like Random Generation where is there is no data to be processed. Checkout the TeraGen program for the random number generation which takes number of rows and the output directory as input. Also, I haven't tried the DataGenerator, but it seems interesting.

Resources