I am reading the source code of MapRedcue to gain more understanding MapReduce's internal mechanism. And I have problem when trying to understand how data produced in map phase are merged and sent to reduce function for further processing. The source code looks too complicated and I just want to know its concepts.
What I want to know is how the values (as parameter Iterator) are sorted before passing to reduce() function. Within MapTask.runOldReducer() it will create ReduceValuesIterator by passing RawKeyValueIterator, where Merger.merge() will get called and lots of actions will be performed (e.g. collect segments). After reading code, it seems to me it only tries to sort by key and the values accompanied with that key will be aggregated/ collected without being removed. For instance, map() may produce
Key Value
http://www.abcfood.com/aLink object A
http://www.abcfood.com/bLink object B
http://www.abcfood.com/cLink object C
Then in reduce(),
Key will be http://www.abcfood.com/ and Values will contain object A, object B, and object C.
So it is sorted by the key http://www.abcfood.com/? Is this correct? Or what is it sorted and then passed to reduce function?
Many thanks.
assuming this is your input :
Key Value
http://www.example.com/asd object A
http://www.abcfood.com/aLink object A
http://www.abcfood.com/bLink object B
http://www.abcfood.com/cLink object C
http://www.example.com/t1 object X
the reducer will get this : (there is no guarantee on order of values)
Key Values
http://www.abcfood.com/ [ "object A", "object C", "object B" ]
http://www.example.com/ [ "object X", "object A" ]
So is there any possibility to get ordered values in reducer?
I need to work with sorted values (calculate difference between values passed with key). I've met the problem :)
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
I understand that it's bad to COPY values in reducer and then order them. I can get memory overflow. Il'll be better to sort values is some way BEFORE passing KEY + Interable to reducer.
Related
I have a lot of javascript objects like:
var obj1 = {"key1" : value1, "key2" : value2, ...}
var obj2 = {"key3" : value3, "key4" : value4, ...}
and so on...
Following are the two approaches :
Store each object as Redis Hash i.e. one-to-one mapping.
Have one Redis Hash(bucketing can be done for better performance), store each object as stringified object in each key of hash i.e. for each object having a key value pair in the Redis Hash. Parse the object when we need to use the object.
1) -> Takes more space than 2) but has better performance than 2)
2) -> Takes less space than 1) but has worse performance than 1)
Is there a way to determine which approach would be better in the long run?
Update: This data is used on the client side (AngularJS), so all parsing of stringified JSON is done in the frontend.
This would probably be solved by deciding which method minimises the number of steps required to extract the required data from redis.
Case 1: Lots of nested objects
If your objects have a lot of nesting, ie objects within objects, like this,
obj = {key1:{key2:value1, key:3{key4:value2}}}
You should probably stringify and store them.
Because Redis does not allow nesting of data structures. You can't store a hash within another hash.
And storing the name of hash2 as a key within hash1 and querying hash2 after getting hash1 and so on is unnecessarily complex and has a lot of queries. In this case all you have to do is get the entire string from Redis and JSON.parse it. and you can get whatever data you want from the Object.
Case 2: No nested objects.
But on the other hand, if there is no nesting of objects and you store it as a string, you have to JSON.parse() every time you get the data from Redis. And parsing JSON is blocking and is CPU intensive. Node.js: does JSON.parse block the event loop?
Redis documentation also says that hashes are encoded in a very small space, so you should try representing your data using hashes every time it is possible. http://redis.io/topics/memory-optimization
So, in this case, you could probably go ahead and store them all as individual hashes as querying a particular value will be a lot easier.
---------Update---------
Even if the JSON parsing is done on the client, try not to do an extra computation needlessly :)
But nested objects are easier to store and query as a string. Otherwise, you'll have to query more than one hash table. In this case storing as stringified object might just be better for performance.
Redis stores small hashes very efficiently. So much that storing multiple small hashmaps is more memory efficient than one big hashmap.
the number of keys deciding about the encoding to use can be found in redis.conf
hash-max-zipmap-entries 512
also the value of each key should be hash-max-zipmap-value 64
So, you can now decide on the basis of nesting of your objects, number of Hash Keys below which Redis is more memory efficient and the value assigned to your keys.
Do go through http://redis.io/topics/memory-optimization
I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join.
My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected. See my code below.
public class StringSimilarityGroupingComparator extends WritableComparator {
protected StringSimilarityGroupingComparator() {
super(JoinKeyTagPairWritable.class, true);
}
public int compare(WritableComparable w1, WritableComparable w2) {
JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1;
JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2;
StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher(
StringSimilarityMatcher.NAME_MATCH);
return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1
.getJoinKey().compareTo(k2.getJoinKey());
}
This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class?
EDIT:
I'm aware that I need to write a custom partitioner to guarantee that similar keys are sent to the same reducer, but as I'm dealing with a relatively small database the job can run fine with only one reducer.
To clarify the problem I'm facing I ran the job with an identity reducer to expose which keys are been grouped together, I'm emitting the key and the dataset tag. Here is a sample of the output:
Ricardo 0
Ricardo 1
Ricardo 1
Ricardo Beguer 1
END OF REDUCE METHOD
Ricardo Castro 1
END OF REDUCE METHOD
Ricardo S.(Gueguel) 1
Ricardo Silva 1
END OF REDUCE METHOD
Ricardo tsubasa 1
Ricardo! 1
RicardoRoale 1
END OF REDUCE METHOD
All these names are matching using my algorithm, but they are not been grouped together. I'm not understanding why this is happening, since I don't know how MapReduce uses my grouping comparator class to group keys.
The dataset tagged with 0 is the left database of the the join, hence, I expect all the similar names from dataset 1 to be grouped with a name from dataset 0.
Can you define how MapReduce does this grouping? Is it after sort and iteratively?
I've seen many people talk about set-similarity (e.g. this paper) when dealing with the problem of matching similar names, but this approach seems simpler and also efficient, since names are not large strings and the matching is done by the grouping comparator class and only one job is needed.
Thanks in advance!
You didn't describe the way your solution wasn't working correctly, but from what you've shown I can make a few suggestions.
The first issue I see is that you don't guarantee that similar names get sent to the same reducer. For example, I'd hope that "Chris" and "Christopher" would compare as being the same in your name matcher, but you don't guarantee that keys "Chris" get sent to the same reducer as keys "Christopher". If you use the default partitioner then it's quite possible that "Chris" with hashcode 65087095 gets assigned a different reducer than does "Christopher" with hashcode 1731528407.
I suggest, for correctness and performance both, that you try to normalize each name in the mapper, so where your mappers might write:
"Christopher" -> value
you'd instead have them write:
"Chris" -> "Christopher" + original value
Where "Chris" is the normalized form of all similar names ("Chris", "Christopher", "Christophe", etc.). In this way, the default partitioners and groupers would work correctly and you'd get the grouping you want with the passed key/value data you want.
You may also be dealing with a more difficult issue and that is that a name like "Chris" might actually be similar to two names that aren't themselves similar, like "Christopher" and "Christine". If this is the case then things get really gnarly. A solution is still possible, but you may need more information (e.g. gender) to normalize the name or you may have to accept missed matches. I can elaborate if this is the situation you've got.
--EDIT--
To address your clarification... There are two sorters used on the key/value pairs before they're passed to the reducer. The first sorts the keys and if no grouper is specified then the reducer is called according to unique key values. If a grouper is specified then the grouper is only used to compare "adjacent keys" (per the first sort) to see if they should be passed into the same reducer call.
For example, say A1 and A2 are to be considered the same key (e.g. similar names) but B is not similar to A1 or A2. If the sorter were to sort the keys as
A1 A2 B
then the reducer would be called twice, once for A1 and A2 and again for B. If, however, the sort managed to produce the key sequence:
A1 B A2
then the reducer would be called three times. The grouper comparator only compares A1 and B and then B and A2.
This means that the grouper sort really should compare strings the same way as the sorter does, but only with more matches.
To use your example above, the grouper seems to be comparing "Ricardo Beguer" and "Ricardo Castro" and finding them not similar. Even though "Ricardo" might be similar to "Ricardo Castro", those two are never compared.
Can you test all names against each other to see if any pair is not similar?
I think Chris is right. The main rule that you likely are violating is
If A < B (via sort) then A <= B (via grouper)
If A = B (via sort) then A = B (via grouper)
I have a list of objects that are returned as a sequence, I would like to retrieve the keys of each object so as to be able to display the object correctly. At the moment I try data?first?keys which seems to get something like the queries that return the objects (Not sure how to explain that last sentence either but img below shows what I'm trying to explain).
The objects amount of objects returned are correct (7) but displaying the keys for each object is my aim. The macro that attempts this is here (from the apache ofbiz development book chapter 8).
Seems like it my sequence is a list of hashes and as explained by Daniel Dekany this post:
The original problem is that, someHash[key] expects a
string as key. Because, the hash type of FTL, by definition, maps
string keys to arbitrary values. It's not the same as Java's Map.
(Note that to further complicate the matters, in FTL
someSequenceOrString[index] expects an integer index. So, the [] thing
is used for that too.) Now someBeanWrappedMap(key) has technically
nothing to do with all the []-s, it's just a method call, so it
accepts all kind of keys. If you have a Map with non-string keys, you
must use that.
Thanks D Dekany if you're on stack, this ended my half day frustration with the ftl template.
I am currently sorting values by key the following way
thrust::sort_by_key(thrust::device_ptr<int>(keys),
thrust::device_ptr<int>(keys + numKeys),
thrust::device_ptr<int>(values);
which sorts the "values" array according to "keys".
Is there a way to leave the the "values" array untouched and instead store the result of sorting "values" in a separate array?
Thanks in advance.
There isn't a direct way to do what you are asking. You have two options to functionally achieve the same thing.
The first is make a copy of the values array before the call, leaving you with a sorted and unsorted version of the original data. So your example becomes
thrust::device_vector<int> values_sorted(thrust::device_ptr<int>(values),
thrust::device_ptr<int>(values + numKeys));
thrust::sort_by_key(thrust::device_ptr<int>(keys),
thrust::device_ptr<int>(keys + numKeys),
values_sorted.begin());
The second alternative is not to pass the values array to the sort at all. Thrust has a very useful permutation iterator which allows for seamless permuted access to an array without modifying the order in which that array is stored (so an iterator based gather operation, if you will). To do this, create an index vector and sort that by key instead, then instantiate a permutation iterator with that sorted index, something like
typedef thrust::device_vector<int>::iterator iit;
thrust::device_vector<int> index(thrust::make_counting_iterator(int(0)),
thrust::make_counting_iterator(int(numKeys));
thrust::sort_by_key(thrust::device_ptr<int>(keys),
thrust::device_ptr<int>(keys + numKeys),
index.begin());
thrust::permutation_iterator<iit,iit> perm(thrust::device_ptr<int>(values),
index.begin());
Now perm will return values in the keys sorted order held by index without ever changing the order of the original data.
[standard disclaimer: all code written in browser, never compiled or tested. Use at own risk]
If the input to my job is the fileset [a, b, c, d], is the input to the sort strictly [map(a.0), map(a.1), map(b.0), map(b.1), map(c.0), map(c.1), map(d.0), map(d.1)]?
My motivation is having a series of files (which will of course be broken up into blocks) whose rows are [key, value]; where each of key and value are a simple string. I wish to concatenate these values together in the reducer per key in the order they are present in the input, despite there not being an explicit order-defining field.
Any advice much appreciated; this is proving to be a difficult query to Google for.
Example
Input format
A First
A Another
A Third
B First
C First
C Another
Desired output
A First,Another,Third
B First
C First,Another
To reiterate, I'm uncertain if I can rely on getting First-Third in the correct order given files are being stored in separate blocks.
No, you have no guarantee that the values will be in that order using the standard data flow in Hadoop (i.e the standard sorter, partitioner, grouper). The only thing which is guaranteed is the order of the keys (A, B, C).
In order to achieve what you want you have to write your own sorter and to include the values (First, Second, Third) in the key => the new keys will be:
"A First"
"A Second"
...
But, the problem in this case is that these keys will end up in different partitions (it's very likely that the standard hash partitioner will distribute "A first" to one partition and "A second" to another one) so , to avoid this problem you should also plug in your own partitioner which will use only the first part of the key (i.e A) to do the partitioning.
You should aslo define the grouper, otherwise the "A first","A second" will not be passed together to the same reduce call.
So the output of your map function should be :
"A First" First
"A Second" Second
...
In other words, the values output by the mapper should be let as they are. Otherwise you won't be able to get the values in the reducer.
One solution to this issue is to make use the TextInputFormat's byte offset in the file as part of a composite key, and use a secondary sort to make sure the values are sent to the reducer in order. That way you can make sure the reducer sees input partioned by the key you want in the order it came in the file. If you have multiple input files, then this approach will not work as each new file will reset the byte counter.
With the streaming API you'll need to pass -inputformat TextInputFormat -D stream.map.input.ignoreKey=false to the job so that you actually get the byte offsets as the key (by default the PipeMapper won't give you keys if the inputformat is TextInputFormat.. even if you explicitly set the TextInputFormat flag so you need to set the additional ignoreKey flag).
If you're emitting multiple keys from a mapper, be sure to set the following flags so your output is partitioned on the first key and sorted on the first and second in the reducer:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
-D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options="-k1,1"
-D mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator"
-D mapreduce.partition.keycomparator.options="-k1 -k2n"