Hadoop - sharing files between multiple jobs in a chain - hadoop

I have written a map-reduce application that consists of two map-reduce phases.
binary input file -> m1-> r1 -> m2 -> r2 -> text output
The input file to my application contains a small chunk of data (<1k) that is needed by the second reducer (r2). I have written a custom record reader that extracts this data, but then how do I pass this along to the next job? It seems like this is a job for DistributedCache, but it appears that DistributedCache cache files are scoped to a single job's scratch space. What is the best way to share small data between different jobs in the same chain?

Try hadoop with -files option
I had similar issue in past and -files option worked for me.
Have a look here

Related

Single or multiple files per mapper in hadoop?

Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour
Typical Mapreduce jobs follow one input split per mapper by default.
If the file size is larger than the split size (i.e., it has more
than one input split), then it is multiple mappers per file.
It is one file per mapper if the file is not splittable like a Gzip
file or if the process is Distcp where file is the finest level of granularity.
If you go to the definition of FileInputFormat you will see that on the top it has three methods:
addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say
addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.
addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job
Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.

How does MapReduce process multiple input files?

So I'm writing a MR job to read hundreds of files from an input folder. Since all the files are compressed, so instead of using the default TextInputFormat, I was using the WholeFileReadFormat from an online code source.
So my question is that does the Mapper process multiple input files in sequence? I mean, if I have three files A B C, and since I'm reading the whole file content as the map input value, will mapreduce process the files in the order of, say, A->B->C, which means, only after doing with A, Mapper will start to process B?
Actually, I'm kind of confused on the concept of Map job and Map task. In my understanding the Map job is just the same thing as Mapper. And a mapper job contains several map tasks, in my case, each map task will read in a single file. But what I don't understand is that I think map tasks are executed in parallel, so I think all the input files should be processed in parallel, which turns out to be a paradox....
Can any one please explain it to me?

Hadoop Streaming Job with no input file

Is it possible to execute a Hadoop Streaming job that has no input file?
In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.
We have 2 use cases in mind.
1)
I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.
According to the docs this is not possible. The following are required parameters for execution:
input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName
It looks like providing a dummy input file is the way to go currently.

How to control the number of hadoop streaming output files

Here is the detail:
The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output
In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB.
What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.
Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?
Thanks in advance!
It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.
You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.
This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

Variable/looping sequence of jobs

I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence.
E.g.:
Job 1: Map source data into X levels.
Job 2: MapReduce Level1 -> appends to Level2
Job 3: MapReduce Level2 -> appends to LevelN
Job N: MapReduce LevelN -> appends to LevelN+1
And so on until the final level. The key is that each level must include its own specific source data as well as the results of the previous level.
I've looked at pig, hive, hamake, and cascading, but have yet to see clear support for something like this.
Does anyone know an efficient way of accomplishing this? Right now I'm leaning towards writing a wrapper for hamake that will generate the hamake file based on parameters (the number of levels is known at runtime but could change with each run).
Thanks!
Oozie http://yahoo.github.com/oozie/ is an Open Source server that Yahoo released to manage Hadoop & Pig workflow like you are asking
Cloudera has it in their latest distro with very good documentation https://wiki.cloudera.com/display/DOC/Oozie+Installation
here is a video http://sg.video.yahoo.com/watch/5936767/15449686 from Yahoo
You should be able to generate the pig code for this pretty easily using Piglet, the Ruby Pig DSL:
http://github.com/iconara/piglet

Resources