StreamInputFormat for mapreduce job - hadoop

I have an application that connects to a remote system and transfers data from it using sftp protocol. I want to use a mapreduce job to do the same. I would need a input format that reads from an input stream . I have been going through the docs for HStreamInputFormat and StreamInputFormat but my hadoop-2.0 doesnt seem to support these classes . How do I proceed ? Any links to tutorials or examples of reading from input streams using input formats ?

If those SteamInputFormats doesn't support your need then you better off writing your own InputFormat with your customized needs. Please read this tutorial to learn how to write your own custom InputFormat and RecordReader.

Related

hadoop input format for hadoop streaming. Wikihadoop Input Format

I wonder whether there is any differences between the InputFormats for hadoop and hadoop streaming. Does the Input Formats for hadoop streaming work also for hadoop and vice versa?
I am asking this because I found a special Input Format for the wikipedia dump files, the wikihadoop InputFormat. And there it is written that it is an Input Format for hadoop streaming? Why only for hadoop streaming? And not for hadoop?
Bests
As far as I know, there is no difference in how inputs are processed between Hadoop streaming jobs and regular MapReduce jobs written in Java.
The inheritance tree for StreamWikiDumpInputFormat is...
* InputFormat
* FileInputFormat
* KeyValueTextInputFormat
* StreamWikiDumpInputFormat
And since it eventually implements InputFormat, it can be used in regular MapReduce jobs
No..Type of MR job(streaming or java) is not the criteria for using(or developing) an InputFormat. An InputFormat is just an InputFormat and will work for both streaming and java MR jobs. It is type of the data, which you are going to process, based on which you use(or develop) an InputFormat. Hadoop natively provides different types of InputFormats which are normally sufficient to handle your needs. But sometimes your data is in such a state that none of these InputFormats are able to handle it.
Having said that, it is still possible to process that data using MR, and this is where you end up writing your own custom InputFormat, as the one you have specified above.
And I don't know why they have emphasized on Hadoop Streaming so much. It's just a Java class which does everything an InputFormat should do and implements everything which makes it eligible to do so. #climbage has made a very valid point regarding the same. So, it can be used with any MR job, streaming or java.
There is no difference between usual input formats and the one which were developed for a hadoop streaming.
When the author says that the format is developed for Hadoop Streaming the only thing she meant that her input format produces objects with a meaningfull toString methods. That's it.
For example, when I develop a input format for usage in Hadoop Streaming I try to avoid BinaryWritable and use Text instead.

working with big scientific data on Hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks
Ayush
Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.
For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.
If you do not find any java code and can do in other languages then you can use hadoop streaming.
SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Is it possible to use Pig streaming (StreamToPig) in a way that handles multiple lines as a single input tuple?

I'm streaming data in a pig script through an executable that returns an xml fragment for each line of input I stream to it. That xml fragment happens to span multiple lines and I have no control whatsoever over the output of the executable I stream to
In relation to Use Hadoop Pig to load data from text file w/ each record on multiple lines?, the answer was suggesting writing a custom record reader. The problem is, this works fine if you want to implement a LoadFunc that reads from a file, but to be able to use streaming, it has to implement StreamToPig. StreamToPig allows you to only read one line at a time as far as I understood
Does anyone know how to handle such a situation?
If you are absolutely sure, then one option is to manage it internally to the streaming solution. That is to say, you build up the tuple yourself, and when you hit whatever your desired size is, you do the processing and return a value. In general, evalfuncs in pig have this issue.

Can Apache Pig load data from STDIN instead of a file?

I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.

Resources