Filtering output in hadoop - hadoop

I'm new with Hadoop and playing around with the WordCount example.
I ran into an issue that is confusing me. If I take word count from a text file and I want to, for example, filter it in a such way that only words longer than 5 letters are in the output, do I have to run 2 jobs to do this?
The first job to do the word count and second job to filter the words shorter than 5 letters?
Or can I just write logic into reducer that does not write the word into the result file if there are less then 5 occurrences? Would this result an invalid output if there are multiple instances of the reducer running?

Simple answer is you don't need to jobs.
You can achieve this with a single job. The logic you have described into the last of the problem is absolutely correct.
In the MapReduce framework, all the data (values) related to a single keys always passed to the same Reducer. So even if multiple reducers are running for your job will not affect the output.
PS:
only words longer than 5 letters are in the output
This is from second paragraph of your problem. I am assuming that you mean 5 occurrence of a word not the length of the word.
But you want only words with more than 5 length, that you can filter in Mapper itself. So there will be less data for sort-shuffle phase (data sorting and transfer over network) and less data process for Reducer.

One MapReduce job should be enough.
The best-practices say that you should filter and project data in the mapper if it is possible.
In you case, your filter condition only depends on the input data (characters in the input word), then you could filter the input in the mapper side and only send to the reducer words with more than 5 letters. Improving the performance of your job. It doesn't make sense send the data to the reducer to drop it. Although it should work too.

Related

How does map-reduce work..Did i get it right?

I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here.
Thank you.
The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing).
Then, there is a Mapper for every input split which takes every input split and sort it by key and value.
Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values.
Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process.
We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.
Your understanding is slightly wrong specially how mapper works.
I got a very nice pictorial image to explain in simple term
It is similar to the wordcount program, where
Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)
Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)
One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
The reducer iterates through the count and sum them up to produce the final count and emit it against the word.
The Below diagram shows how one single inputsplit of wordcount program works:
Similar QA - Simple explanation of MapReduce?
Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures

Why does MapReduce bother mapping every value to 1 in the map step?

I'm trying to figure out MapReduce and so far I think I'm gaining an okay understanding.
However, one thing confuses me. In every example and explanation of MapReduce I can find, the map step maps all values to 1. For instance, in the most common example (counting occurrences of words in a string), the Map section splits up each word and then maps it to the value 1.
The Reduce section then combines/reduces like words, adding up the amount of times they occur so that they map to N instead of 1 (N being how many times the word appears).
What I don't understand is: why even bother mapping them to 1 in the first place? It seems like they will ALWAYS map to 1. Why not just split them apart, and then in the Reduce step, do the mapping there, and sum everything up at the same time?
I'm sure there must be a good reason that I just can't think of. Thanks!
(this question is about MapReduce as a concept in general, not necessarily about Hadoop or any other specific technology or implementation)
The output of the mapper is decided based on the use case you wanted to have. In word count, we want the mapper to separate the individual words and output the number of occurrences for each word. The mapper is called for every key value pair(input split) in the input. Here its for each line. Key is the offset and value is the entire sentence. There would be grouping performed before the reducer is invoked. so all the words are grouped and each occurence(1 here) is counted. It is not a hard rule to emit 1 as mapper output. If you have noticed the data set example in Hadoop : Definitive guide, they have the year and temperature emitted as mapper output. The usecase is to group based on the years and find the max/min temperature. You can think of this as group parameter, for a basic understanding sake. Happy learning

Hadoop map only job

My situation is like the following:
I have two MapReduce jobs.
First one is MapReduce job which produces output sorted by key.
Then second Map only job will extract some part of the data and just collect it.
I have no reducer in second job.
Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.
First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.
A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.
If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well.
The "hard" solution is what the "Tera sort" benchmark uses.
Read this SO question for more insight into how that works:
How does the MapReduce sort algorithm work?
No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Hadoop and Cassandra processing rows in sorted order

I want to fill a Cassandra database with a list of strings that I then process using Hadoop. What I want to do it run through all the strings in order using a Hadoop cluster and record how much overlap there is between each string in order to find the Longest Common Substring.
My question is, will the InputFormat object allow me to read out the data in a sorted order or will my strings be read out "randomly" (according to how Cassandra decides to distribute them) throughout every machine in the cluster? Is the MapReduce process designed to process each row by itself w/out the intent of looking at two rows consecutively like I'm asking for?
First of all, the Mappers will read the data in whatever order they get it from the InputFormat. I'm not a Cassandra expert, but I don't expect that will be in sorted order.
If you want sorted order, you should use an identity mapper (one that does nothing) whose output key is the string itself. Then they will be sorted before passed to the reduce step. But it gets a little more complicated since you can have more than one reducer. With only one reducer, everything is globally sorted. With more than one, each reducer's input is sorted, but the input across reducers might not be sorted. That is, adjacent strings might not go to the same reducer. You would need a custom partitioner to handle that.
Lastly, you mentioned that you're doing longest common substring- are you looking for the longest substring among each pair of strings? Among consecutive pairs of strings? Among all strings? Each of these possibilities will affect how you need to structure your MapReduce job.

Parallel reducing with Hadoop mapreduce

I'm using Hadoop's MapReduce. I have a a file as an input to the map function, the map function does something (not relevant for the question). I'd like my reducer to take the map's output and write to two different files.
The way I see it (I want an efficient solution), there are two ways in my mind:
1 reducer which will know to identify to different cases and write to 2 different contexts.
2 parallel reducers, which each one will know to identify his relevant input, ignore the other one's and this way each one will write to a file (each reducer will write to a different file).
I'd prefer the first solution, due to the fact it means I'll go over map's output only once instead of twice parallel - but if the first isn't supported in some way - I'll be glad to hear a solution for the second suggestion.
*Note: These two final files are supposed to be separated, no need into joining them at this point.
The Hadoop API has a feature for creating multiple outputs called MultipleOutputs which makes your preferred solution possible.
If you know at the map stage which file the record must go to, you can tag your map output with special key specifying which file it should go to. For eg, if a record R1 must go to file 1, you would output <1, R1>.. (1 is the key.. a symbolic representation for file1 and R1 is the value) If a Record R2 must go to file 2, your map output would be <2, R2>.
Then if you configure the map reduce job to use only 2 reducers.. it will guarantee that all records tagged with <1, _> will be sent to 1 reducer and <2, _> will be sent to the other.
This would be better than your preferred solution since you are still going thru your map output only once.. and at the same time, it would be in parallel.

Resources