How can I avoid unnecessarily repeating the map step in chained hadoop - hadoop

I have two chained mapreduce steps (within a much larger branched workflow). The first groups by id and in a very small number of cases produces a new object with a different id (maybe a few thousand out of hundreds of millions of input objects). The second again groups everything, including the new objects, by id and produces a bunch of stuff I care about.
It seems really wasteful to read/shuffle all the data again when everything except the new objects is already on grouped same server and ordered by id. Is there a way to just shuffle the new stuff to the current reducers and have them start the list again?
I'm using Hadoop streaming so any answer that works with that would be ideal, but I'm also interested in general answers.

If the new objects are produced by reducers, then you can't do this with MapReduce in a single pass. Consider using spark instead; it is better for iterative tasks.
If the new objects are produced by mappers, AND the first stage reducers are just pass-through, you should be able to do this in one step: The mappers in the first stage should emit both the original and new records (there's no rule that says mappers have to be 1:1. The mapper can produce more or fewer records than are input)

Related

Hadoop: Run two M/R jobs on same data or ChainMap with a barrier of synchronization

I have a problem which requires me to filter a large amount of data, tens of terabytes, in an iterative process. Due to the size, I would like to do the computation in 2 consecutive map phases so that the data doesn't need to be retransfered across the network.
So the steps in the algorithm are 1) analyze all data and make a decision, 2) rerun on the same data and do a filtering process based on the decision from 1.
I figure there are two ways to solve this, but each seems to have large issues.
1) Solution, ChainMapper. Problem: The first mapper needs to complete entirely before the second starts.
2) Solution, two jobs. Problem: the data gets retransfered across the network as data is deleted between jobs.
I'm sure there is something I'm missing, but I could really use some help!
Thanks
Given your clarifications: you can't use ChainMapper, but it is exactly because it does not operate by applying mapper 1 to all keys, waiting, then applying mapper 2. It applies a chain of maps to each input key. Some will finish phase 1 and 2 before others even start. But you are right that it doesn't cause more data to go across the network; here it's not even written to disk!
Since you need phase 1 to finish, you really need to finish the Map phase before doing anything else with phase 2. Do phase 1 in the Mapper, phase 2 in the Reducer. That's simplest.
Strangely, it might be faster to have two Map/Reduces, but without a Reducer. The Reducer can be a no-op, Reducer.class. Call setNumReduceTasks(0). You avoid the shuffle phase this way. It won't copy data around to reducers but just dump to HDFS.
Your next mappers will spawn on top of the HDFS data, in general. No extra transfer there.
I don't think you're going to avoid some data transfer here to reorganize and remarshall data but I think it's unlikely to dominate your calculation.

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about chaining mappers here, I'm talking about running different mappers in parallel, not sequentially.
This is called a join.
You want to use the mappers and reducers in the mapred.* packages (older, but still supported). The newer packages (mapreduce.*) only allow for one mapper input. With the mapred packages, you use the MultipleInputs class to define the join:
MultipleInputs.addInputPath(jobConf,
new Path(countsSource),
SequenceFileInputFormat.class,
CountMapper.class);
MultipleInputs.addInputPath(jobConf,
new Path(dictionarySource),
SomeOtherInputFormat.class,
TranslateMapper.class);
jobConf.setJarByClass(ReportJob.class);
jobConf.setReducerClass(WriteTextReducer.class);
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(WordInfo.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
I will answer your question with a question, 2 answers, and an anti-recommendation.
The question is: what benefit do you see in running the heterogeneous map jobs in parallel, as opposed to running them in series, outputting homogeneous results that can be properly shuffled? Is the idea to avoid passing over the same records twice, once with an identity map?
The first answer is to schedule both mapper-only jobs simultaneously, each on half your fleet (or whatever ratio best matches the input data size), outputting homogeneous results, followed by a reducer-only job that performs the join.
The second answer is to create a custom InputFormat that is able to recognize and transform both flavors of the heterogeneous input. This is extremely ugly, but it will allow you to avoid the unnecessary identity map of the first suggestion.
The anti-recommendation is to not use the deprecated Hadoop APIs from Chris' answer. Hadoop is very young, but the APIs are stabilizing around the "new" flavor. You will arrive at version lock-in eventually.

Query related to Hadoop's map-reduce

Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?
Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.
Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework:
I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.
I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.

How to ensure that MapReduce tasks are independent of each other?

I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is common to have data that is quite interrelated, with state conditions between tasks, etc.
If the data IS related it is your job to ensure that the information is passed along. MapReduce breaks up the data and processes it regardless of any (not implemented) relations:
Map just reads data in blocks from the input files and passes them to the map-function one "record" at a time. Default-record is a line (but can be modified).
You can annotate the data in Map with its origin but what you can basically do with Map is: categorize the data. You emit a new key and new values and MapReduce groups by the new key. So if there are relations between different records: choose the same (or similiar *1) key for emitting them, so they are grouped together.
For Reduce the data is partitioned/sorted (that is where the grouping takes places) and afterwards the reduce-function receives all data from one group: one key and all its associated values. Now you can aggregate over the values. That's it.
So you have an over-all group-by implemented by MapReduce. Everything else is your responsibility. You want a cross product from two sources? Implement it for example by introducing artifical keys and multi-emitting (fragment and replicate join). Your imagination is the limit. And: you can always pass the data through another job.
*1: similiar, because you can influence the choice of grouping later on. normally it is group be identity-function, but you can change this.

Hadoop one Map and multiple Reduce

We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

Resources