Hadoop MapReduce with already sorted files - hadoop

I'm working with Hadoop MapReduce. I've got data in HDFS and data in each file is already sorted. Is it possible to force MapReduce not to resort the data after map phase? I've tried to change the map.sort.class to no-op, but it didn't work (i.e. the data wasn't sorted as I'd expected). Does anyone tried doing something similar and managed to achieve it?

I think it depends on what style result you want, sorted result or unsorted result?
If you need result be sorted, I think hadoop is not suitable to do this work. There are two reasons:
INPUT DATA will be stored in different chunk(if big enough) and partitioned into multi-splits. Each one split will be mapped to one map task and all output of map tasks will gathered(after processes of partitioned/sorted/combined/copied/merged) as reduce's input. It is hard to keep keys in order among these stages.
Sort function exists not only after map process in map task. When do merge process during reduce task, there is sort option,too.
If you do not need result be sorted,I think this patch may be what you want:
Support no sort dataflow in map output and reduce merge phrase : https://issues.apache.org/jira/browse/MAPREDUCE-3397

Related

How do you determine how to split keys among reducer tasks in a MapReduce implementation?

So I read the MapReduce paper, and am attempting to implement a simplified MapReduce according to this.
I understand that the number of mappers is determined by the input. But how do you dynamically determine the keys the reducer will operate over if you don't know the intermediate output in advance?
For example, let's say we have this MapReduce task that determines the number of a's and b's:
It's natural that there's one Reducer to count a's and one Reducer to count b's.
Mapper1 would produce this:
But for this to work I need to know advance that this is split alphabetically. What if all the keys are numeric? Heck, what if they're even emojis? How do I dynamically determine the number of reducers and the subset of data that they operate on? I feel like I'm missing something super obvious.

Why do we need the "map" part in MapReduce?

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.
Consider the following pseudocode:
result = my_list.map(my_mapper).reduce(my_reducer);
This could be shortened to
result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));
How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?
Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.
It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.
Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.
Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.
At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.
In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.

Hadoop map only job

My situation is like the following:
I have two MapReduce jobs.
First one is MapReduce job which produces output sorted by key.
Then second Map only job will extract some part of the data and just collect it.
I have no reducer in second job.
Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.
First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.
A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.
If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well.
The "hard" solution is what the "Tera sort" benchmark uses.
Read this SO question for more insight into how that works:
How does the MapReduce sort algorithm work?
No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Hadoop and Cassandra processing rows in sorted order

I want to fill a Cassandra database with a list of strings that I then process using Hadoop. What I want to do it run through all the strings in order using a Hadoop cluster and record how much overlap there is between each string in order to find the Longest Common Substring.
My question is, will the InputFormat object allow me to read out the data in a sorted order or will my strings be read out "randomly" (according to how Cassandra decides to distribute them) throughout every machine in the cluster? Is the MapReduce process designed to process each row by itself w/out the intent of looking at two rows consecutively like I'm asking for?
First of all, the Mappers will read the data in whatever order they get it from the InputFormat. I'm not a Cassandra expert, but I don't expect that will be in sorted order.
If you want sorted order, you should use an identity mapper (one that does nothing) whose output key is the string itself. Then they will be sorted before passed to the reduce step. But it gets a little more complicated since you can have more than one reducer. With only one reducer, everything is globally sorted. With more than one, each reducer's input is sorted, but the input across reducers might not be sorted. That is, adjacent strings might not go to the same reducer. You would need a custom partitioner to handle that.
Lastly, you mentioned that you're doing longest common substring- are you looking for the longest substring among each pair of strings? Among consecutive pairs of strings? Among all strings? Each of these possibilities will affect how you need to structure your MapReduce job.

Parallel reducing with Hadoop mapreduce

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.

Resources