hadoop input format for hadoop streaming. Wikihadoop Input Format - hadoop

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa?
I am asking this because I found a special Input Format for the wikipedia dump files, the wikihadoop InputFormat. And there it is written that it is an Input Format for hadoop streaming? Why only for hadoop streaming? And not for hadoop?
Bests

As far as I know, there is no difference in how inputs are processed between Hadoop streaming jobs and regular MapReduce jobs written in Java.
The inheritance tree for StreamWikiDumpInputFormat is...
* InputFormat
* FileInputFormat
* KeyValueTextInputFormat
* StreamWikiDumpInputFormat
And since it eventually implements InputFormat, it can be used in regular MapReduce jobs

No..Type of MR job(streaming or java) is not the criteria for using(or developing) an InputFormat. An InputFormat is just an InputFormat and will work for both streaming and java MR jobs. It is type of the data, which you are going to process, based on which you use(or develop) an InputFormat. Hadoop natively provides different types of InputFormats which are normally sufficient to handle your needs. But sometimes your data is in such a state that none of these InputFormats are able to handle it.
Having said that, it is still possible to process that data using MR, and this is where you end up writing your own custom InputFormat, as the one you have specified above.
And I don't know why they have emphasized on Hadoop Streaming so much. It's just a Java class which does everything an InputFormat should do and implements everything which makes it eligible to do so. #climbage has made a very valid point regarding the same. So, it can be used with any MR job, streaming or java.

There is no difference between usual input formats and the one which were developed for a hadoop streaming.
When the author says that the format is developed for Hadoop Streaming the only thing she meant that her input format produces objects with a meaningfull toString methods. That's it.
For example, when I develop a input format for usage in Hadoop Streaming I try to avoid BinaryWritable and use Text instead.

Related

StreamInputFormat for mapreduce job

I have an application that connects to a remote system and transfers data from it using sftp protocol. I want to use a mapreduce job to do the same. I would need a input format that reads from an input stream . I have been going through the docs for HStreamInputFormat and StreamInputFormat but my hadoop-2.0 doesnt seem to support these classes . How do I proceed ? Any links to tutorials or examples of reading from input streams using input formats ?
If those SteamInputFormats doesn't support your need then you better off writing your own InputFormat with your customized needs. Please read this tutorial to learn how to write your own custom InputFormat and RecordReader.

Hadoop Streaming with SequenceFile (on AWS)

I have a large number of Hadoop SequenceFiles which I would like to process using Hadoop on AWS. Most of my existing code is written in Ruby, and so I would like to use Hadoop Streaming along with my custom Ruby Mapper and Reducer scripts on Amazon EMR.
I cannot find any documentation on how to integrate Sequence Files with Hadoop Streaming, and how the input will be provided to my Ruby scripts. I'd appreciate some instructions on how to launch jobs (either directly on EMR, or just a normal Hadoop command line) to make use of SequenceFiles and some information on how to expect the data to be provided to my script.
--Edit: I had previously referred to StreamFiles rather than SequenceFiles by mistake. I think the documentation for my data was incorrect, but apologies. The answer is easy with the change.
The answer to this is to specify the input format as a command line argument to Hadoop.
-inputformat SequenceFileAsTextInputFormat
The chances are that you want the SequenceFile as text, but there is also SequenceFileAsBinaryInputFormat if thats more appropriate.
Not sure if this is what you're asking for, but the command to use ruby map reduce scripts with the hadoop command line would look something like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
You can (and should) use a combiner with big data sets. Add it with the -combiner option. The combiner output will feed directly into your mapper (but no guarantee how many times this will be called, if at all). Otherwise your input is split (according to the standard hadoop protocal) and feeds directly into your mapper. The example is from O'Reily's Hadoop: The Definitive Guide 3rd Edition. It has some very good information on streaming, and a section dedicated to streaming with ruby.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Can Apache Pig load data from STDIN instead of a file?

I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.

Hadoop streaming and AMAZON EMR

I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).
The mapper is an R script, that splits the line into words and spits it back to the stream.
cat(wordList[i],"\t1\n")
I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum
cat("LongValueSum:",wordList[i],"\t1\n")
and specify the reducer to be "aggregate"
The questions I have now are the following:
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?
How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
How are these combiners and reducer specified in an AMAZON EMR set up?
I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right?
The aggregate reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]
That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer.
With the streaming interface you always receive K,V value pairs; you'll never receive (K,list(values))
How are other aggregate reducers used.
Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each
I believe an issue of this kind has been filed and fixed
What issue are you thinking of?
not sure what version AMAZON EMR is hosting
EMR is based on Hadoop 0.20.2
Is it sufficient to specify the java class name for each of these options?
Do you mean in the context of streaming? or the aggregate framework?
[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

Resources