Hadoop GroupingComparator class purpose - hadoop

I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join.
My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected. See my code below.
public class StringSimilarityGroupingComparator extends WritableComparator {
protected StringSimilarityGroupingComparator() {
super(JoinKeyTagPairWritable.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2) {
JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
StringSimilarityMatcher.NAME_MATCH);
return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
.getJoinKey().compareTo(k2.getJoinKey());
}
This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class?
EDIT:
I'm aware that I need to write a custom partitioner to guarantee that similar keys are sent to the same reducer, but as I'm dealing with a relatively small database the job can run fine with only one reducer.
To clarify the problem I'm facing I ran the job with an identity reducer to expose which keys are been grouped together, I'm emitting the key and the dataset tag. Here is a sample of the output:
Ricardo 0
Ricardo 1
Ricardo 1
Ricardo Beguer 1
END OF REDUCE METHOD
Ricardo Castro 1
END OF REDUCE METHOD
Ricardo S.(Gueguel) 1
Ricardo Silva 1
END OF REDUCE METHOD
Ricardo tsubasa 1
Ricardo! 1
RicardoRoale 1
END OF REDUCE METHOD
All these names are matching using my algorithm, but they are not been grouped together. I'm not understanding why this is happening, since I don't know how MapReduce uses my grouping comparator class to group keys.
The dataset tagged with 0 is the left database of the the join, hence, I expect all the similar names from dataset 1 to be grouped with a name from dataset 0.
Can you define how MapReduce does this grouping? Is it after sort and iteratively?
I've seen many people talk about set-similarity (e.g. this paper) when dealing with the problem of matching similar names, but this approach seems simpler and also efficient, since names are not large strings and the matching is done by the grouping comparator class and only one job is needed.
Thanks in advance!

You didn't describe the way your solution wasn't working correctly, but from what you've shown I can make a few suggestions.
The first issue I see is that you don't guarantee that similar names get sent to the same reducer. For example, I'd hope that "Chris" and "Christopher" would compare as being the same in your name matcher, but you don't guarantee that keys "Chris" get sent to the same reducer as keys "Christopher". If you use the default partitioner then it's quite possible that "Chris" with hashcode 65087095 gets assigned a different reducer than does "Christopher" with hashcode 1731528407.
I suggest, for correctness and performance both, that you try to normalize each name in the mapper, so where your mappers might write:
"Christopher" -> value
you'd instead have them write:
"Chris" -> "Christopher" + original value
Where "Chris" is the normalized form of all similar names ("Chris", "Christopher", "Christophe", etc.). In this way, the default partitioners and groupers would work correctly and you'd get the grouping you want with the passed key/value data you want.
You may also be dealing with a more difficult issue and that is that a name like "Chris" might actually be similar to two names that aren't themselves similar, like "Christopher" and "Christine". If this is the case then things get really gnarly. A solution is still possible, but you may need more information (e.g. gender) to normalize the name or you may have to accept missed matches. I can elaborate if this is the situation you've got.
--EDIT--
To address your clarification... There are two sorters used on the key/value pairs before they're passed to the reducer. The first sorts the keys and if no grouper is specified then the reducer is called according to unique key values. If a grouper is specified then the grouper is only used to compare "adjacent keys" (per the first sort) to see if they should be passed into the same reducer call.
For example, say A1 and A2 are to be considered the same key (e.g. similar names) but B is not similar to A1 or A2. If the sorter were to sort the keys as
A1 A2 B
then the reducer would be called twice, once for A1 and A2 and again for B. If, however, the sort managed to produce the key sequence:
A1 B A2
then the reducer would be called three times. The grouper comparator only compares A1 and B and then B and A2.
This means that the grouper sort really should compare strings the same way as the sorter does, but only with more matches.
To use your example above, the grouper seems to be comparing "Ricardo Beguer" and "Ricardo Castro" and finding them not similar. Even though "Ricardo" might be similar to "Ricardo Castro", those two are never compared.
Can you test all names against each other to see if any pair is not similar?

I think Chris is right. The main rule that you likely are violating is
If A < B (via sort) then A <= B (via grouper)
If A = B (via sort) then A = B (via grouper)

Related

Can I use mapreduce with a pair of Keys and a pair of values?

My question is theoretical,
I'm trying to make a design for a mapreduce example in Big data processing.
The case which I have requires a pair of keys to be mapped to a pair of values.
for example if we have below text:
"Bachelors in Engineering has experience of 5 years"
I am trying to count the words Engineering & Experience in a way where I would have a value for each word separately.
So my key would be (Engineering,Experience) and my value would be (1,1) as per the above given text example.
Note that there is a relationship between both key values in my homework, therefore I want them both in one set of a key-value to determine if both keys are mentioned in one text file, or only one key is mentioned, or none is mentioned.
Please let me know if above case is possible to do in map-reduce of big data or not..
Having a string key of "(Engineering,Experience)" is no different than just having a String of one of those words.
If you want to have some more custom type, then you will want to subclass the Writable and maybe the WritableComparable interfaces.
Simlarly, for the value, you could put the entire tuple as Text and parse it later, or you can create your own Writable subclass that can store two integers.
Thanks for the Answer, but I figured I could use "Engineering Experience" as a string for the key.

Hashing table design in C

I have a design issue regarding HASH function.
In my program I am using a hash table of size 2^13, where the slot is calculated based on the value of the node(the hash key) which I want to insert.
Now, say my each node has two value |A|B| however I am inserting value into hash table using A.
Later on, I want to search a particular node which B not A.
Is it possible to that way? Is yes, could you highlight some design approaches?
The constraint is that I have to use A as the hash key.
Sorry, I can't share the code. Small example:
Value[] = {Part1, Part2, Part3};
insert(value)
check_for_index(value.part1)
value.part1 to be used to calculate the index of the slot.
Once slot is found then insert the "value"
Later on,
search_in_hash(part2)
check_for_index("But here I need the value.part1 to check for slot index")
So, how can I relate the part1, part2 & part3 such that I later on I can find the slot by either part2 or part3
If the problem statement is vague kindly let me know.
Unless you intend to do a search element-by-element (in which case you don't need a hash, just a plain list), then what you basically ask is - can I have a hash such that hash(X) == hash(Y), but X!=Y, so that you could map to a location using part1 and then map to the same one using part2 or 3. That completely goes against what hashing stands for.
What you should do is (as viraptor also suggested), create 3 structures, each hashed using a different part of the value, and push the full value to all 3. Then when you need to search use the proper hash by the part you want to search by.
for e.g.:
value[] = {part1, part2, part3};
hash1.insert(part1, value)
hash2.insert(part2, value)
hash3.insert(part3, value)
then
hash2.search_in_hash(part2)
or
hash3.search_in_hash(part3)
The above 2 should produce the exact same values.
Also make sure that all data manipulations (removing values, changing them), is done on all 3 structures simultaneously. For e.g. -
value = hash2.search_in_hash(part2)
hash1.remove(value.part1)
hash2.remove(part2) // you can assert that part2 == value.part2
hash3.remove(value.part3)

two file comparison in hdfs

I want to write a map reduce to compare two large file in hdfs. any thoughts how to achieve that. Or if there is nay other way to do the comparison because file size is very large, so thought map-reduce would be an ideal approach.
Thanks for your help.
You may do this in 2 steps.
First make the line number to be the part of text files:
Say initial file looks like:
I am awesome
He is my best friend
Now, convert this to something like this:
1,I am awesome
2,He is my best friend
This may well be done by a MapReduce job itself or some other tool.
2. Now write a MapReduce step where in mapper emit the line number as the key and rest of the actual sentence as value. Then in reducer just compare the values. As and when it doesn't match emit out the line number (the key) and the payloads, whatever you may want here. Also if the count of the values is just 1 then also it is a mismatch.
EDIT: Better approach
Better still what you can do is, just emit the complete line read at a time in the mapper as the key and make the value a number, say 1. So taking my above example your mapper output would be as follows:
< I am awesome,1 >
< He is my best friend,1 >
And in reducer just check the count of values, if it isn't 2, you have a mismatch.
But there is one catch in this approach, if there is a possibility of exactly same line occurring at two different places then instead of checking for the length of values for a given key in reducer, you should be checking it to be a multiple of 2.
One possible solution could be, put line number as count in map job.
There are two files like below :
File 1:
I am here --Line 1
I am awesome -- Line 2
You are my best friend -- Line 3
File 2 also similar kind
Now your map job output should be like , < I am awesome, 2>...
Once you done with Map job for both the file, you have two record(key,value) which has same value to reduce.
At the time of reduce, you can either compare the counter or generate the output as , and so on. If the line is exist in the different location too than out put could be which indicates that this line is mismatch.
I have a solution for comparing files with keys. In your case if you know that your ID's are unique, you could emit the ID's as keys in the map, the entire record as value. Lets say your file has ID,Line1 then emit as key and as value from mapper.
In the shuffle and sort phase, the ID's will be sorted and you will get an iterator with data from both the files. ie, the records from both files with same ID will end up in same iterator.
Then in the reducer, compare both the values from the iterator and if they match move on with next record. Else, if they do not match write them to an output.
I have done this and it worked like a charm.
Scenario - No matching key
If there is no matching ID between two files, they will have only one iterator value.
Scenario 2 - Duplicate keys
If the files have duplicate keys, the iterator will have more than 2 values.
Note: You should compare the values only when the iterator has only 2 values.
**Tip:**The iterator will not have values in order always. To identify the value from a particular file, in the mapper add a small indicator at the end of the line like Line1;file1
Line1;file2
Then on the reducer you will be able to identify which value belongs to which mapper.

Hadoop sort input order

If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?
My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.
Any advice much appreciated; this is proving to be a difficult query to Google for.
Example
Input format
A First
A Another
A Third
B First
C First
C Another
Desired output
A First,Another,Third
B First
C First,Another
To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.
No, you have no guarantee that the values will be in that order using the standard data flow in Hadoop (i.e the standard sorter, partitioner, grouper). The only thing which is guaranteed is the order of the keys (A, B, C).
In order to achieve what you want you have to write your own sorter and to include the values (First, Second, Third) in the key => the new keys will be:
"A First"
"A Second"
...
But, the problem in this case is that these keys will end up in different partitions (it's very likely that the standard hash partitioner will distribute "A first" to one partition and "A second" to another one) so , to avoid this problem you should also plug in your own partitioner which will use only the first part of the key (i.e A) to do the partitioning.
You should aslo define the grouper, otherwise the "A first","A second" will not be passed together to the same reduce call.
So the output of your map function should be :
"A First" First
"A Second" Second
...
In other words, the values output by the mapper should be let as they are. Otherwise you won't be able to get the values in the reducer.
One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.
With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag).
If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"

MapReduce sorted iterator

I am reading the source code of MapRedcue to gain more understanding MapReduce's internal mechanism. And I have problem when trying to understand how data produced in map phase are merged and sent to reduce function for further processing. The source code looks too complicated and I just want to know its concepts.
What I want to know is how the values (as parameter Iterator) are sorted before passing to reduce() function. Within MapTask.runOldReducer() it will create ReduceValuesIterator by passing RawKeyValueIterator, where Merger.merge() will get called and lots of actions will be performed (e.g. collect segments). After reading code, it seems to me it only tries to sort by key and the values accompanied with that key will be aggregated/ collected without being removed. For instance, map() may produce
Key Value
http://www.abcfood.com/aLink object A
http://www.abcfood.com/bLink object B
http://www.abcfood.com/cLink object C
Then in reduce(),
Key will be http://www.abcfood.com/ and Values will contain object A, object B, and object C.
So it is sorted by the key http://www.abcfood.com/? Is this correct? Or what is it sorted and then passed to reduce function?
Many thanks.
assuming this is your input :
Key Value
http://www.example.com/asd object A
http://www.abcfood.com/aLink object A
http://www.abcfood.com/bLink object B
http://www.abcfood.com/cLink object C
http://www.example.com/t1 object X
the reducer will get this : (there is no guarantee on order of values)
Key Values
http://www.abcfood.com/ [ "object A", "object C", "object B" ]
http://www.example.com/ [ "object X", "object A" ]
So is there any possibility to get ordered values in reducer?
I need to work with sorted values (calculate difference between values passed with key). I've met the problem :)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
I understand that it's bad to COPY values in reducer and then order them. I can get memory overflow. Il'll be better to sort values is some way BEFORE passing KEY + Interable to reducer.

Resources