Hadoop map only job - hadoop

My situation is like the following:
I have two MapReduce jobs.
First one is MapReduce job which produces output sorted by key.
Then second Map only job will extract some part of the data and just collect it.
I have no reducer in second job.
Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.

First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.
A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.
If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well.
The "hard" solution is what the "Tera sort" benchmark uses.
Read this SO question for more insight into how that works:
How does the MapReduce sort algorithm work?

No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Related

How does map-reduce work..Did i get it right?

I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here.
Thank you.
The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing).
Then, there is a Mapper for every input split which takes every input split and sort it by key and value.
Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values.
Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process.
We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.
Your understanding is slightly wrong specially how mapper works.
I got a very nice pictorial image to explain in simple term
It is similar to the wordcount program, where
Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)
Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)
One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
The reducer iterates through the count and sum them up to produce the final count and emit it against the word.
The Below diagram shows how one single inputsplit of wordcount program works:
Similar QA - Simple explanation of MapReduce?
Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures

In hadoop what is meant by ability to preserve state across mapper reducer multiple inputs?

The heading of the question explains everything what my question is.
I have been reading through multiple texts, answers where I came across this line
Through use of the combiner and by taking advantage of the ability to
preserve state across multiple inputs, it is often possible to
substantially reduce both the number and size of key-value pairs that
need to be shuffled from the mappers to the reducers.
I am not able to understand this concept. An elaborate answer and explanation with an example would be really helpful. How to develop an intuition to understand such concepts?
If you already feel comfortable with the "reducer" concept, a combiner concept will be easy. A combiner can be seen as a mini-reducer on the map phase. What i mean by that? Lets see an example: suppose that you are doing the classic wordcount problem, you know that for every word a key-value pair is emited by the mapper. Then the reducer will take as input this key-value pairs and summaryze them.
Supose that a mapper collects some key-value pairs like:
<key1,1>,
<key2,1>,
<key1,1>,
<key3,1>,
<key1,1>
If you are not using a combiner, this 4 key-value pairs will be sent to the reducer. but using a combiner we could perform a pre-reduce in the mapper, so the output of the mapper will be:
<key1,3>,
<key2,1>,
<key3,1>
In this simple example by using a combiner, you reduced the total number of key-value pairs from 5 to 3, which will give you less network traffic and better performance in the shuffle phase.

hadoop mapreduce two sorts

I'm new to Hadoop mapreduce. I went through some of the tutorials and noticed that the output of the mapper is sorted while in the reducer side, we also have a shuffle & sort phase.
So why do we have two sorts there? What are the purposes of them?
Thanks!
Mapper : It arranges the input data from a source into key value pairs for further processing.
Reducer: Aggregation Logic is written here.
Role of Shuffler is shuffle and sort and it passes output from mapper to reducer.This is done internally by MR framework.But,we can implement our own custom shuffler in using MR Api and Java.
Refer this example of WordCount:
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
Refer this also.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Why do we need the "map" part in MapReduce?

The programming model MapReduce consists of 2 procedures, map and reduce. Why do we need the map part, when we can simply do the mapping inside reduce function.
Consider the following pseudocode:
result = my_list.map(my_mapper).reduce(my_reducer);
This could be shortened to
result = my_list.reduce(lambda x : my_reducer(my_mapper(x)));
How can the 1st approach be more preferred than the 2nd one, while the 1st approach requires one more pass through the data? Is my code example oversimplifying?
Well, if you refer to Hadoop style MapReduce it is actually map-shuffle-reduce where the shuffle is a reason for map and reduce to be separated. At a little bit higher you can think about data locality. Each key-value pair passed through map can generate zero or more key-value pairs. To be able to reduce these you have to ensure that all values for a given key are available on a single reduce, hence the shuffle. What is important pairs emitted from a single input pair can be processed by different reducers.
It is possible to use patterns like map-side aggregations or combiners but at the end of the day it is still (map)-reduce-shuffle-reduce.
Assuming data locality is not an issue, higher order functions like map and reduce provide an elegant abstraction layer. Finally it is a declarative API. Simple expression like xs.map(f1).reduce(f2) describe only what not how. Depending on a language or context these can be eagerly or lazily evaluated, operations can be squashed, in more complex scenario reordered and optimized in many different ways.
Regarding your code. Even if signatures were correct it wouldn't really reduce number of times you pass over the data. Moreover if you push map into aggregation then arguments passed to aggregation function are not longer of the same type. It means either sequential fold or much more complex merging logic.
At a high level, map reduce is all about processing in parallel. Even though the reducer work on map output, in practical terms, each reducer will get only part of data, and that is possible only in first approach.
In your second approach, your reducer actually needs entire output of mapper, which beats the idea of parallelism.

Hadoop MapReduce with already sorted files

I'm working with Hadoop MapReduce. I've got data in HDFS and data in each file is already sorted. Is it possible to force MapReduce not to resort the data after map phase? I've tried to change the map.sort.class to no-op, but it didn't work (i.e. the data wasn't sorted as I'd expected). Does anyone tried doing something similar and managed to achieve it?
I think it depends on what style result you want, sorted result or unsorted result?
If you need result be sorted, I think hadoop is not suitable to do this work. There are two reasons:
INPUT DATA will be stored in different chunk(if big enough) and partitioned into multi-splits. Each one split will be mapped to one map task and all output of map tasks will gathered(after processes of partitioned/sorted/combined/copied/merged) as reduce's input. It is hard to keep keys in order among these stages.
Sort function exists not only after map process in map task. When do merge process during reduce task, there is sort option,too.
If you do not need result be sorted,I think this patch may be what you want:
Support no sort dataflow in map output and reduce merge phrase : https://issues.apache.org/jira/browse/MAPREDUCE-3397

Resources