I am new to Hadoop.
Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?
Please help me.
I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.
Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:
Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...
Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.
This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.
Hope you have started learning mapreduce with Wordcount example..
Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.
Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,
http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html
The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.
Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...
Related
I'm trying to understand how map-reduce actually work. please read what i written below and tell me if there's any missing parts or incorrect things in here.
Thank you.
The data is first splitted into what is called input splits(which is a logical kind of group which we define the size of it as our needs of record processing).
Then, there is a Mapper for every input split which takes every input split and sort it by key and value.
Then, there is the shuffling process which takes all of the data from the mappers (key-values) and merges all the same keys with its values(output it's all the keys with its list of values). The shuffling process occurs in order to give the reducer an input of a 1 key for each type of key with its summed values.
Then, the Reducer merges all the key value into one place(page maybe?) which is the final result of the MapReduce process.
We only have to make sure to define the Map(which gives output of key-value always) and Reduce(final result- get the input key-value and can be count,sum,avg,etc..) step code.
Your understanding is slightly wrong specially how mapper works.
I got a very nice pictorial image to explain in simple term
It is similar to the wordcount program, where
Each bundle of chocolates are the InputSplit, which is handled by a mapper. So we have 3 bundles.
Each chocolate is a word. One or more words (making a sentence) is a record input to single mapper. So, within one inputsplit, there may be multiple records and each record is input to single mapper.
mapper count occurrence of each of the word (chocolate) and spit the count. Note that each of the mapper is working on only one line (record). As soon as it is done, it picks next record from the inputsplit. (2nd phase in the image)
Once map phase is finished, sorting and shuffling takes place to make a bucket of same chocolates counts. (3rd phase in the image)
One reducer get one bucket with key as name of the chocolate (or the word) and a list of counts. So, there are as many reducer as many distinct words in whole input file.
The reducer iterates through the count and sum them up to produce the final count and emit it against the word.
The Below diagram shows how one single inputsplit of wordcount program works:
Similar QA - Simple explanation of MapReduce?
Also, this post explain Hadoop - HDFS & Mapreduce in very simple way https://content.pivotal.io/blog/demystifying-apache-hadoop-in-5-pictures
I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).
I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).
How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.
Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
Is there a way in Hadoop to ensure that every reducer gets only one key that is output by the mapper ?
This question is a bit unclear for me. But I think I have a pretty good idea what you want.
First of all if you do nothing special every time a reduce is called it gets only one single key with a set of one or more values (via an iterator).
My guess is that you want to ensure that every reducer gets exactly one 'key-value pair'.
There are essentially two ways of doing that:
Ensure in the mapper that all keys that are output are unique. So for each key there is only one value.
Force the reducer to do this by forcing a group comparator that simply classifies all keys as different.
So if I understand your question correctly. You should implement a GroupComparator that simply states that all keys are different and should therefor be sent to a different reducer call.
Because of other answers in this question I'm adding a bit more detail:
There are 3 methods used for comparing keys (I pulled these code samples from a project I did using the 0.18.3 API):
Partitioner
conf.setPartitionerClass(KeyPartitioner.class);
The partitioner is only to ensure that "things that must be the same end up on the same partition". If you have 1 computer there is only one partition, so this won't help much.
Key Comparator
conf.setOutputKeyComparatorClass(KeyComparator.class);
The key comparator is used to SORT the "key-value pairs" in a group by looking at the key ... which must be different somehow.
Group Comparator
conf.setOutputValueGroupingComparator(GroupComparator.class);
The group comparator is used to group keys that are different, yet must be sent o the same reducer.
HTH
You can get some control over which keys get sent to which reducers by implementng the Partitioner interface
From the Hadoop API docs:
Partitioner controls the partitioning
of the keys of the intermediate
map-outputs. The key (or a subset of
the key) is used to derive the
partition, typically by a hash
function. The total number of
partitions is the same as the number
of reduce tasks for the job. Hence
this controls which of the m reduce
tasks the intermediate key (and hence
the record) is sent for reduction.
The following book does a great job of describing partitioning, key sorting strategies and tradeoffs along with other issues in map reduce algorithm design: http://www.umiacs.umd.edu/~jimmylin/book.html
Are you sure you want to do this? Can you elaborate your problem, so that I can understand
why you want to do this.
You have to do two things, as mentioned in earlier answers
Write a partitioner such that each key gets associated with an unique reducer.
Ensure that that the number of reducer slots in your cluster is more than or equal
to the number of unique keys you will have
Pranab
My guess is same as above, just you can sort the keys if possible and try to assign it reducer based on your partitioning criteria, refer youtube mapreduce ucb 61a lecture-34, they talk about this stuff.
I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.