Hadoop sort input order - hadoop

If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?
My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.
Any advice much appreciated; this is proving to be a difficult query to Google for.
Example
Input format
A First
A Another
A Third
B First
C First
C Another
Desired output
A First,Another,Third
B First
C First,Another
To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.

No, you have no guarantee that the values will be in that order using the standard data flow in Hadoop (i.e the standard sorter, partitioner, grouper). The only thing which is guaranteed is the order of the keys (A, B, C).
In order to achieve what you want you have to write your own sorter and to include the values (First, Second, Third) in the key => the new keys will be:
"A First"
"A Second"
...
But, the problem in this case is that these keys will end up in different partitions (it's very likely that the standard hash partitioner will distribute "A first" to one partition and "A second" to another one) so , to avoid this problem you should also plug in your own partitioner which will use only the first part of the key (i.e A) to do the partitioning.
You should aslo define the grouper, otherwise the "A first","A second" will not be passed together to the same reduce call.
So the output of your map function should be :
"A First" First
"A Second" Second
...
In other words, the values output by the mapper should be let as they are. Otherwise you won't be able to get the values in the reducer.

One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.
With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag).
If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"

Related

Hadoop strange behaviour: reduce function doesn't get all values for a key

In my Hadoop project, I am reading lines of text file with a number of names for each line. The first name represents my username, and the rest are a list of friends.
Then I am creating pairs of (username, friend) , in the map function, each pair has a key "Key[name1][name2]" where name1,2 are the username and the friend name ordered alphabetically.
Normally, after reading the line of userA and line of userB , and they both have each other in their friends list, I would get 2 identic keys with different values, which in this case is: KeyUserAUserB : "UserA,UserB" and KeyUserAUserB : "UserB,UserA".
What I expect in the reduce function is to get, at one point, KeyUserAUserB as a key and a pair of "UserA,UserB","UserB,UserA" as values . So the values iterator would have 2 elements.
However, in the reducer function, I get twice KeyUserAUserB with a single value respectively. This is not what I am expecting from Hadoop....
I also noticed in my userlogs , I have 4 "m" folders, and in the first 2 of them I have the logs which helped me identify the above. In both "m" logs the output (System.out) of the map function is intertwined with the output of reduce function . I don't know if that has anything to do with my anomaly, but I expected the reduce output to stay in the "r" folder.
Also, for the above example, one log for KeyUserAUserB is printed in one "m" log file, and the other KeyUserAUserB in the other... Although for some cases it happens that a KeyUserAUserB comes to the reducer with both values, i found at least one case when it never comes with both values (and also those 2 pairs key-value with identical key reside in different "m" log files).
Another thing I noticed, the output collect from the Reduce function doesn't send the values directly to the output file, but passes them again as an input for the the same Reduce function...
What do you think about this behavior, what can be the possible causes?
Finally. The whole unexpected behavior is because I am using a combiner class = the reducer class. After commenting that line, everything worked as expected.

Hadoop GroupingComparator class purpose

I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join.
My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected. See my code below.
public class StringSimilarityGroupingComparator extends WritableComparator {
protected StringSimilarityGroupingComparator() {
super(JoinKeyTagPairWritable.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2) {
JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
StringSimilarityMatcher.NAME_MATCH);
return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
.getJoinKey().compareTo(k2.getJoinKey());
}
This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class?
EDIT:
I'm aware that I need to write a custom partitioner to guarantee that similar keys are sent to the same reducer, but as I'm dealing with a relatively small database the job can run fine with only one reducer.
To clarify the problem I'm facing I ran the job with an identity reducer to expose which keys are been grouped together, I'm emitting the key and the dataset tag. Here is a sample of the output:
Ricardo 0
Ricardo 1
Ricardo 1
Ricardo Beguer 1
END OF REDUCE METHOD
Ricardo Castro 1
END OF REDUCE METHOD
Ricardo S.(Gueguel) 1
Ricardo Silva 1
END OF REDUCE METHOD
Ricardo tsubasa 1
Ricardo! 1
RicardoRoale 1
END OF REDUCE METHOD
All these names are matching using my algorithm, but they are not been grouped together. I'm not understanding why this is happening, since I don't know how MapReduce uses my grouping comparator class to group keys.
The dataset tagged with 0 is the left database of the the join, hence, I expect all the similar names from dataset 1 to be grouped with a name from dataset 0.
Can you define how MapReduce does this grouping? Is it after sort and iteratively?
I've seen many people talk about set-similarity (e.g. this paper) when dealing with the problem of matching similar names, but this approach seems simpler and also efficient, since names are not large strings and the matching is done by the grouping comparator class and only one job is needed.
Thanks in advance!
You didn't describe the way your solution wasn't working correctly, but from what you've shown I can make a few suggestions.
The first issue I see is that you don't guarantee that similar names get sent to the same reducer. For example, I'd hope that "Chris" and "Christopher" would compare as being the same in your name matcher, but you don't guarantee that keys "Chris" get sent to the same reducer as keys "Christopher". If you use the default partitioner then it's quite possible that "Chris" with hashcode 65087095 gets assigned a different reducer than does "Christopher" with hashcode 1731528407.
I suggest, for correctness and performance both, that you try to normalize each name in the mapper, so where your mappers might write:
"Christopher" -> value
you'd instead have them write:
"Chris" -> "Christopher" + original value
Where "Chris" is the normalized form of all similar names ("Chris", "Christopher", "Christophe", etc.). In this way, the default partitioners and groupers would work correctly and you'd get the grouping you want with the passed key/value data you want.
You may also be dealing with a more difficult issue and that is that a name like "Chris" might actually be similar to two names that aren't themselves similar, like "Christopher" and "Christine". If this is the case then things get really gnarly. A solution is still possible, but you may need more information (e.g. gender) to normalize the name or you may have to accept missed matches. I can elaborate if this is the situation you've got.
--EDIT--
To address your clarification... There are two sorters used on the key/value pairs before they're passed to the reducer. The first sorts the keys and if no grouper is specified then the reducer is called according to unique key values. If a grouper is specified then the grouper is only used to compare "adjacent keys" (per the first sort) to see if they should be passed into the same reducer call.
For example, say A1 and A2 are to be considered the same key (e.g. similar names) but B is not similar to A1 or A2. If the sorter were to sort the keys as
A1 A2 B
then the reducer would be called twice, once for A1 and A2 and again for B. If, however, the sort managed to produce the key sequence:
A1 B A2
then the reducer would be called three times. The grouper comparator only compares A1 and B and then B and A2.
This means that the grouper sort really should compare strings the same way as the sorter does, but only with more matches.
To use your example above, the grouper seems to be comparing "Ricardo Beguer" and "Ricardo Castro" and finding them not similar. Even though "Ricardo" might be similar to "Ricardo Castro", those two are never compared.
Can you test all names against each other to see if any pair is not similar?
I think Chris is right. The main rule that you likely are violating is
If A < B (via sort) then A <= B (via grouper)
If A = B (via sort) then A = B (via grouper)

"Map" and "Reduce" functions in Hadoop's MapReduce

I've been looking at this word count example by hadoop:
http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code
And I'm a little confused about the Map function. In the map function shown, it takes in a "key" of type LongWritable, but this parameter is never used in the body of the Map function. What does the application programmer expect Hadoop to pass in for this key? Why does a map function require a key if it simply parses values from a line of text or something. Can someone give me an example where both a key and a value is required for input? I only see map as V1 -> (K2, V2).
Another question: In the real implementation of hadoop, are their multiple reduction steps? If so, how does hadoop apply the same reduction function multiple times if the function is (K2, V2) -> (K3, V3)? If another reduction is performed, it needs to take in type (K3, V3)...
Thank you!
There's a key there because the map() method is always passed a key and a value (and a context). It's up to you as to whether you actually use the key and/or value. In this case, the key represents a line number from the file being read. The word count logic doesn't need that. The map() method just uses the value, which in the case of a text file is a line of the file.
As to your second question (which really should be its own stack overflow question), you may have any number of map/reduce jobs in a hadoop workflow. Some of those jobs will read as input pre-existing files and others will read the output of other jobs. Each job will have one or more mappers and a single reducer.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

two file comparison in hdfs

I want to write a map reduce to compare two large file in hdfs. any thoughts how to achieve that. Or if there is nay other way to do the comparison because file size is very large, so thought map-reduce would be an ideal approach.
Thanks for your help.
You may do this in 2 steps.
First make the line number to be the part of text files:
Say initial file looks like:
I am awesome
He is my best friend
Now, convert this to something like this:
1,I am awesome
2,He is my best friend
This may well be done by a MapReduce job itself or some other tool.
2. Now write a MapReduce step where in mapper emit the line number as the key and rest of the actual sentence as value. Then in reducer just compare the values. As and when it doesn't match emit out the line number (the key) and the payloads, whatever you may want here. Also if the count of the values is just 1 then also it is a mismatch.
EDIT: Better approach
Better still what you can do is, just emit the complete line read at a time in the mapper as the key and make the value a number, say 1. So taking my above example your mapper output would be as follows:
< I am awesome,1 >
< He is my best friend,1 >
And in reducer just check the count of values, if it isn't 2, you have a mismatch.
But there is one catch in this approach, if there is a possibility of exactly same line occurring at two different places then instead of checking for the length of values for a given key in reducer, you should be checking it to be a multiple of 2.
One possible solution could be, put line number as count in map job.
There are two files like below :
File 1:
I am here --Line 1
I am awesome -- Line 2
You are my best friend -- Line 3
File 2 also similar kind
Now your map job output should be like , < I am awesome, 2>...
Once you done with Map job for both the file, you have two record(key,value) which has same value to reduce.
At the time of reduce, you can either compare the counter or generate the output as , and so on. If the line is exist in the different location too than out put could be which indicates that this line is mismatch.
I have a solution for comparing files with keys. In your case if you know that your ID's are unique, you could emit the ID's as keys in the map, the entire record as value. Lets say your file has ID,Line1 then emit as key and as value from mapper.
In the shuffle and sort phase, the ID's will be sorted and you will get an iterator with data from both the files. ie, the records from both files with same ID will end up in same iterator.
Then in the reducer, compare both the values from the iterator and if they match move on with next record. Else, if they do not match write them to an output.
I have done this and it worked like a charm.
Scenario - No matching key
If there is no matching ID between two files, they will have only one iterator value.
Scenario 2 - Duplicate keys
If the files have duplicate keys, the iterator will have more than 2 values.
Note: You should compare the values only when the iterator has only 2 values.
**Tip:**The iterator will not have values in order always. To identify the value from a particular file, in the mapper add a small indicator at the end of the line like Line1;file1
Line1;file2
Then on the reducer you will be able to identify which value belongs to which mapper.

Resources