Hadoop Streaming Job with no input file - hadoop

Is it possible to execute a Hadoop Streaming job that has no input file?
In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.
We have 2 use cases in mind.
1)
I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.

According to the docs this is not possible. The following are required parameters for execution:
input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName
It looks like providing a dummy input file is the way to go currently.

Related

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

How to control the number of hadoop streaming output files

Here is the detail:
The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output
In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB.
What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.
Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?
Thanks in advance!
It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.
You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.
This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Correlating input files to output files

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.
Input files are like:
MyDir/file1
MyDir/file2
MyDir/file3
Output file are like:
MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002
I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.
I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?
One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.
With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).
If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.

Resources