Hadoop streaming and AMAZON EMR - hadoop

I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).
The mapper is an R script, that splits the line into words and spits it back to the stream.
cat(wordList[i],"\t1\n")
I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum
cat("LongValueSum:",wordList[i],"\t1\n")
and specify the reducer to be "aggregate"
The questions I have now are the following:
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?
How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
How are these combiners and reducer specified in an AMAZON EMR set up?
I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right?
The aggregate reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]
That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer.
With the streaming interface you always receive K,V value pairs; you'll never receive (K,list(values))
How are other aggregate reducers used.
Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each
I believe an issue of this kind has been filed and fixed
What issue are you thinking of?
not sure what version AMAZON EMR is hosting
EMR is based on Hadoop 0.20.2
Is it sufficient to specify the java class name for each of these options?
Do you mean in the context of streaming? or the aggregate framework?
[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

Related

How to go through the OutputFormat.RecordWriter write(key,value) twice in Hadoop

I have a situation where I need to go through the key/value pairs of my OutputFormat twice. In essence:
OutputFormat.getRecordWriter() // returns RecordWriteType1
... and when all those are complete across all machines
OutputFormat.getRecordWriter() // return RecordWriterType2
The typing of both RecordWriterType1/2 are the same. Is there a way to do this?
Thank you,
Marko.
Unfortunately you cannot simply run over the reducer data twice.
You do have some options to possibly work around:
Use an identity reducer to output the sorted data to HDFS, then run two jobs over the data with identity mappers - wasteful but simple if you don't have that much data
As above, but you could use map only jobs and the key comparator to emulate the reducer function as you know the input is already sorted (you'll need to make sure the split size is set sufficiently large to ensure all data from the first reducer output file is processed in a single mapper and not split over 2+ mapper instances
You could write the reducer key/values to local disk in your reducer, and then in the clean up method of the reducer, opening the local file up and process as detailed in the second option (using the group comparator to detemine key boundary).
If you dig through the source for ReduceTask, you may even be able to 'abuse' the merged sorted segments on local disk and run over the data again, but this option is pure unadulterated hackery...

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Control the Reducer Result Output File/Bucket

I have an application where I would like to make my reducers (I have several of them for a map/reduce job) to record their outputs into different files on the HDFS depending on the key coming to them for processing. So if the reducer sees a key of say type A, apply the reduce the logic but tell Hadoop to put the result into the hdfs file belonging to type A result and so on. Obviously multiple reducers can will be outputting different portion of the type A result and each reducer can end up working on any type like A or B but tell hadoop to write the result into the type A bucket or something
Is this possible?
MultipleOutputs is almost what you are looking for (assuming you are at least at version 0.21). In my own work I have used a clone of this class modified to be more flexible about naming conventions to send output to different folders/files based on anything I want, including aspects of the input records (keys or values). As is, the class has some draconian restrictions on what names you can give to the outputs.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

hadoop streaming ensuring one key per reducer

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.
The key values can change and are text strings.
Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.
Except, this doesn't seem to work because the keys don't get mapped to specific reducers.
The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.
I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.
Any suggestions would be much appreciated.
Examples:
mapper output:
csv2\tfieldA,fieldB,fieldC
csv1\tfield1,field2,field3,field4
csv3\tfieldRed,fieldGreen
...
the problem is that if i have 3 reducers i end up with key distribution like this:
reducer1 reducer2 recuder3
csv1 csv2
csv3
one reducer gets two different key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.
I'm pretty sure MultipleOutputFormat [1] can be used under streaming. That'll solve most of your problems.
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html
If you are stuck with streaming, and can't include any external jars for a custom partitioner, then this is probably not going to work the way you want it to without some hacks.
If these are absolute requirements, you can get around this, but it's messy.
Here's what you can do:
Hadoop, by default, uses a hashing partitioner, like this:
key.hashCode() % numReducers
So you can pick keys such that they hash to 1, 2, and 3 (or three numbers such that x % 3 = 1, 2, 3). This is a nasty hack, and I wouldn't suggest it unless you have no other options.
If you want custom output to different csv files you can direct write (with API) to hdfs. As you know hadoop passes with key and associated value list to single reduce task. In reduce code, check , while key is same write to same file. If another key comes, create new file manually and write into it. It does not matter how many reducers you have

Resources