Creating composite key class for Secondary Sort - hadoop

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. Can anyone tell me, what are the steps for the same.

Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). From the java lang reference
The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
You can also implement a RawComparator so that you can compare them without deserializing for some faster performance.
However, I recommend (as I always do) to not write things like Secondary Sort yourself. These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. E.g. if you were using Hive, all you need to write is:
SELECT ...
FROM my_table
ORDER BY month, carrier;
The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive.
EDIT: Forgot to mention about Grouping comparators, see Matt's answer

Your compareTo() implementation is incorrect. You need to sort first on uniqueCarrier, then on month to break equality:
#Override
public int compareTo(CompositeKey other) {
if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
return this.getMonth().compareTo(other.getMonth());
} else {
return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
}
}
One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier). This allows me to call write and readFields directly on them, and also use their compareTo. Less code to write is always good...
Speaking of less code, you don't have to call the parent constructor for your composite key.
Now for what is left to be done:
My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier. This method is called by the default Hadoop partitionner to distribute work across reducers.
I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier, and that sorting behaves according to CompositeKey compareTo():
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator()
{
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
Then, tell your Driver to use those two:
job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);
Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further.

Related

How to work inline with custom IEqualityComparer<T> parameters

Several time I needed to call linq distincts from different IEnumerables.
These distincts often need criteria that I use just once through the software.
I found really annoying the constraint to create a class that implements the IEqualityComparer with the codebase to perform the distinct, so I thought to cover the gap creating a generic class that allows to point to a lambda expression passed as a parameter of the distinct.
In order to pass a custom IEqualityComparer parameters I developed the following class:
public class InlineComparer<T>
{
private class LambdaBasedComparer : IEqualityComparer<T>
{
public LambdaBasedComparer(Func<T, int> getHashCode)
{
fGetHashCode = getHashCode;
}
public bool Equals(T x, T y)
{
return x?.GetHashCode() == y?.GetHashCode();
}
private Func<T, int> fGetHashCode { get; set; }
public int GetHashCode(T obj)
{
return fGetHashCode(obj);
}
}
public static IEqualityComparer<T> GetComparer(Func<T, int> getHashCode)
{
return new LambdaBasedComparer(getHashCode);
}
}
What do you think about it? I hope it may be helpful!
Of course, a complete implementation of this helper makes use of this class into an extension method similar to "IEnumerable.Distinct(Func getHashCode)", but I wanted to highlight the possibility to work with lambdas to pass the distinct code.
Your equality comparer will declare two objects to be equal if they return the same value for GetHashCode(). Of course, when defining your own Equality comparer you are free to define the concept of equality any way you want, as long as your equality is reflexive, symmetric and transitive (x==x; if x==y then y==x; if x==y and y==x, then x==z).
Your equality comparer fits these rules, so you can use it.
However! Will it be a useful comparer?
You want to use a special equality comparer instead of the default equality comparer because you want some special definition of (un)equality of two objects.
Normally, during your design process you should first define equality of your objects. If you've done that and you want to use your LambdaBasedComparer you'll have to create a Hash Function that will return different values for different objects.
Normally hash functions have one requirement: two equal objects should return the same hash value. There is no requirement upon two different objects.
There are only Int32.MaxValue different Hash values, so if you've designed a class with more than this value possible unequal instances you can't use your comparer. An easy example: try to create a LambdaBasedComparer<long> that uses normal equality.
But even if your class can only create half of the Int32.MaxValue instances, it will be very difficult to create a proper hash function that will generate unique hash values for different objects.
Finally your equality will not be very intuitive if you use it to compare derived classes. Consider class Person and derived classes Employee and Customer.
IEqualityComparer<Person> personComparer = new LambdBasedComparer<Person>(...);
Person p = new Person(...);
Person e = new Employee(...);
Person c = new Customer(...);
Now I can say that a certain Person who isn't an Employee can equal one of your Employees. But would you ever say that Employees will equal Customers?
Summarized: you think that you have a simple solution for your comparers, but it will be very difficult to create a proper hash function for your definition of equality, and it will probably be even more difficult to test this hash function

JAVA 8 Extract predicates as fields or methods?

What is the cleaner way of extracting predicates which will have multiple uses. Methods or Class fields?
The two examples:
1.Class Field
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty)
.forEach(System.out::println);
}
private IntPredicate isOverFifty = number -> number > 50;
2.Method
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty())
.forEach(System.out::println);
}
private IntPredicate isOverFifty() {
return number -> number > 50;
}
For me, the field way looks a little bit nicer, but is this the right way? I have my doubts.
Generally you cache things that are expensive to create and these stateless lambdas are not. A stateless lambda will have a single instance created for the entire pipeline (under the current implementation). The first invocation is the most expensive one - the underlying Predicate implementation class will be created and linked; but this happens only once for both stateless and stateful lambdas.
A stateful lambda will use a different instance for each element and it might make sense to cache those, but your example is stateless, so I would not.
If you still want that (for reading purposes I assume), I would do it in a class Predicates let's assume. It would be re-usable across different classes as well, something like this:
public final class Predicates {
private Predicates(){
}
public static IntPredicate isOverFifty() {
return number -> number > 50;
}
}
You should also notice that the usage of Predicates.isOverFifty inside a Stream and x -> x > 50 while semantically the same, will have different memory usages.
In the first case, only a single instance (and class) will be created and served to all clients; while the second (x -> x > 50) will create not only a different instance, but also a different class for each of it's clients (think the same expression used in different places inside your application). This happens because the linkage happens per CallSite - and in the second case the CallSite is always different.
But that is something you should not rely on (and probably even consider) - these Objects and classes are fast to build and fast to remove by the GC - whatever fits your needs - use that.
To answer, it's better If you expand those lambda expressions for old fashioned Java. You can see now, these are two ways we used in our codes. So, the answer is, it all depends how you write a particular code segment.
private IntPredicate isOverFifty = new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
private IntPredicate isOverFifty() {
return new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
}
1) For field case you will have always allocated predicate for each new your object. Not a big deal if you have a few instances, likes, service. But if this is a value object which can be N, this is not good solution. Also keep in mind that someMethod() may not be called at all. One of possible solution is to make predicate as static field.
2) For method case you will create the predicate once every time for someMethod() call. After GC will discard it.

Avoiding duplicate code when performing operation on different object properties

I have recently run into a problem which has had me thinking in circles. Assume that I have an object of type O with properties O.A and O.B. Also assume that I have a collection of instances of type O, where O.A and O.B are defined for each instance.
Now assume that I need to perform some operation (like sorting) on a collection of O instances using either O.A or O.B, but not both at any given time. My original solution is as follows.
Example -- just for demonstration, not production code:
public class O {
int A;
int B;
}
public static class Utils {
public static void SortByA (O[] collection) {
// Sort the objects in the collection using O.A as the key. Note: this is custom sorting logic, so it is not simply a one-line call to a built-in sort method.
}
public static void SortByB (O[] collection) {
// Sort the objects in the collection using O.B as the key. Same logic as above.
}
}
What I would love to do is this...
public static void SortAgnostic (O[] collection, FieldRepresentation x /* some non-bool, non-int variable representing whether to chose O.A or O.B as the sorting key */) {
// Sort by whatever "x" represents...
}
... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me. Perhaps I am incorrect on that (and I am sure someone will correct me if that statement is wrong :D), but that is my current thought nonetheless.
Question: What is the best way to implement this method? The logic that I have to implement is difficult to break down into smaller methods, as it is already fairly optimized. At the root of the issue is the fact that I need to perform the same operation using different properties of an object. I would like to stay away from using codes/flags/etc. in the method signature if possible so that the solution can be as robust as possible.
Note: When answering this question, please approach it from an algorithmic point of view. I am aware that some language-specific features may be suitable alternatives, but I have encountered this problem before and would like to understand it from a relatively language-agnostic viewpoint. Also, please do not constrain responses to sorting solutions only, as I have only chosen it as an example. The real question is how to avoid code duplication when performing an identical operation on two different properties of an object.
"The real question is how to avoid code duplication when performing an identical operation on two different properties of an object."
This is a very good question as this situation arises all the time. I think, one of the best ways to deal with this situation is to use the following pattern.
public class O {
int A;
int B;
}
public doOperationX1() {
doOperationX(something to indicate which property to use);
}
public doOperationX2() {
doOperationX(something to indicate which property to use);
}
private doOperationX(input ) {
// actual work is done here
}
In this pattern, the actual implementation is performed in a private method, which is called by public methods, with some extra information. For example, in this case, it can be
doOperationX(A), or doOperationX(B), or something like that.
My Reasoning: In my opinion this pattern is optimal as it achieves two main requirements:
It keeps the public interface descriptive and clear, as it keeps operations separate, and avoids flags etc that you also mentioned in your post. This is good for the client.
From the implementation perspective, it prevents duplication, as it is in one place. This is good for the development.
A simple way to approach this I think is to internalize the behavior of choosing the sort field to the class O itself. This way the solution can be language-agnostic.
The implementation in Java could be using an Abstract class for O, where the purpose of the abstract method getSortField() would be to return the field to sort by. All that the invocation logic would need to do is to implement the abstract method to return the desired field.
O o = new O() {
public int getSortField() {
return A;
}
};
The problem might be reduced to obtaining the value of the specified field from the given object so it can be use for sorting purposes, or,
TField getValue(TEntity entity, string fieldName)
{
// Return value of field "A" from entity,
// implementation depends on language of choice, possibly with
// some sort of reflection support
}
This method can be used to substitute comparisons within the sorting algorithm,
if (getValue(o[i], "A")) > getValue(o[j], "A"))
{
swap(i, j);
}
The field name can then be parametrized, as,
public static void SortAgnostic (O[] collection, string fieldName)
{
if (getValue(collection[i], fieldName)) > getValue(collection[j], fieldName))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, "A").
Some languages allow you to express the field in a more elegant way,
public static void SortAgnostic (O[] collection, Expression fieldExpression)
{
if (getValue(collection[i], fieldExpression)) >
getValue(collection[j], fieldExpression))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, entity => entity.A).
And yet another option can be passing a pointer to a function which will return the value of the field needed,
public static void SortAgnostic (O[] collection, Function getValue)
{
if (getValue(collection[i])) > getValue(collection[j]))
{
swap(i, j);
}
...
}
which given a function,
TField getValueOfA(TEntity entity)
{
return entity.A;
}
and passing it like SortAgnostic(collection, getValueOfA).
"... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me"
That is why you should use available tools like frameworks or other typo of code libraries that provide you requested solution.
When some mechanism is common that mean it can be moved to higher level of abstraction. When you can not find proper solution try to create own one. Think about the result of operation as not part of class functionality. The sorting is only a feature, that why it should not be part of your class from the beginning. Try to keep class as simple as possible.
Do not worry premature about the sense of having something small just because it is small. Focus on the final usage of it. If you use very often one type of sorting just create a definition of it to reuse it. You do not have to necessary create a utill class and then call it. Sometimes the base functionality enclosed in utill class is fair enough.
I assume that you use Java:
In your case the wheal was already implemented in person of Collection#sort(List, Comparator).
To full fill it you could create a Enum type that implement Comparator interface with predefined sorting types.

Can we sort an IList partially?

IList<A_Desc,A_premium,B_Desc,B_Premium>
Can I sort two columns A_Desc,A_premium...based on A_Desc ?
And let B_Desc,B_Premium be remain in same order before sorting
First off, a list can only be of one type, and only has one "column" of data, so you actually want two lists and a data type that holds "desc" and "premium". "desc" sounds like a String to me; I don't know what Premium is, but I'll pretend it's a double for lack of better ideas. I don't know what this data is supposed to represent, so to me, it's just some thingie.
public class Thingie{
public String desc;
public double premium;
}
That is, of course, a terrible way to define the class- I should instead have desc and premium be private, and Desc and Premium as public Properties with Get and Set methods. But this is the fastest way for me to get the point across.
It's more canonical to make Thingie implement IComparable, and compare itself to other Thingie objects. But I'm editing an answer I wrote before I knew you needed to write a custom type, and had the freedom to just make it implement IComparable. So here's the IComparer approach, which lets you sort objects that don't sort themselves by telling C# how to sort them.
Implement an IComparer that operates over your custom type.
public class ThingieSorter: IComparer<Thingie>{
public int Compare(Thingie t1, Thingie t2){
int r = t1.desc.CompareTo(t2);
if(r != 0){return r;}
return t1.premium.CompareTo(t2);
}
}
C# doesn't require IList to implement Sort- it might be inefficient if it's a LinkedList. So let's make a new list, based on arrays, which does sort efficiently, and sort it:
public List<Thingie> sortedOf(IList<Thingie> list){
List<Thingie> ret = new List<Thingie>(list);
ret.sort(new ThingieSorter());
return ret;
}
List<Thingie> implements the interface IList<Thingie>, so replacing your original list with this one shouldn't break anything, as long as you have nothing holding onto the original list and magically expecting it to be sorted. If that's happening, refactor your code so it doesn't grab the reference until after your list has been sorted, since it can't be sorted in place.

Why isn't .Except (LINQ) comparing things properly? (using IEquatable)

I have two collections of my own reference-type objects that I wrote my own IEquatable.Equals method for, and I want to be able to use LINQ methods on them.
So,
List<CandyType> candy = dataSource.GetListOfCandy();
List<CandyType> lollyPops = dataSource.GetListOfLollyPops();
var candyOtherThanLollyPops = candy.Except( lollyPops );
According to the documentation of .Except, not passing an IEqualityComparer should result in EqualityComparer.Default being used to compare objects. And the documentation for the Default comparer is this:
"The Default property checks whether type T implements the System.IEquatable generic interface and if so returns an EqualityComparer that uses that implementation. Otherwise it returns an EqualityComparer that uses the overrides of Object.Equals and Object.GetHashCode provided by T."
So, because I implement IEquatable for my object, it should use that and work. But, it doesn't. It doesn't work until I override GetHashCode. In fact, if I set a break point, my IEquatable.Equals method never gets executed. This makes me think that it's going with plan B according to its documentation. I understand that overriding GetHashCode is a good idea, anyway, and I can get this working, but I am upset that it is behaving in a way that isn't in line with what its own documentation stated.
Why isn't it doing what it said it would? Thank you.
After investigation, it turns out things aren't quite as bad as I thought. Basically, when everything is implemented properly (GetHashCode, etc.) the documentation is correct, and the behavior is correct. But, if you try to do something like implement IEquatable all by itself, then your Equals method will never get called (this seems to be due to GetHashCode not being implemented properly). So, while the documentation is technically wrong, it's only wrong in a fringe situation that you'd never ever want to do (if this investigation has taught me anything, it's that IEquatable is part of a whole set of methods you should implement atomically (by convention, not by rule, unfortunately)).
Good sources on this are:
Is there a complete IEquatable implementation reference?
MSDN Documentation: IEquatable<T>.Equals(T) Method
SYSK 158: IComparable<T> vs. IEquatable<T>
The interface IEqualityComparer<T> has these two methods:
bool Equals(T x, T y);
int GetHashCode(T obj);
A good implementation of this interface would thus implement both. The Linq extension method Except relies on the hash code in order to use a dictionary or set lookup internally to figure out which objects to skip, and thus requires that proper GetHashCode implementation.
Unfortunately, when you use EqualityComparer<T>.Default, that class does not provide a good GetHashCode implementation by itself, and relies on the object in question, the type T, to provide that part, when it detects that the object implements IEquatable<T>.
The problem here is that IEquatable<T> does not in fact declare GetHashCode so it's much easier to forget to implement that method properly, contrasted with the Equals method that it does declare.
So you have two choices:
Provide a proper IEqualityComparer<T> implementation that implements both Equals and GetHashCode
Make sure that in addition to implementing IEquatable<T> on your object, implement a proper GetHashCode as well
Hazarding a guess, are these different classes? I think by default IEquatable only works with the same class. So it might by falling back to the Object.Equal method.
I wrote a GenericEqualityComparer to be used on the fly for these types of methods: Distinct, Except, Intersect, etc.
Use as follows:
var results = list1.Except(list2, new GenericEqualityComparer<MYTYPE>((a, b) => a.Id == b.Id // OR SOME OTHER COMPARISON RESOLVING TO BOOLEAN));
Here's the class:
public class GenericEqualityComparer<T> : EqualityComparer<T>
{
public Func<T, int> HashCodeFunc { get; set; }
public Func<T, T, Boolean> EqualityFunc { get; set; }
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc)
{
EqualityFunc = equalityFunc;
HashCodeFunc = null;
}
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc, Func<T, int> hashCodeFunc) : this(equalityFunc)
{
HashCodeFunc = hashCodeFunc;
}
public override bool Equals(T x, T y)
{
return EqualityFunc(x, y);
}
public override int GetHashCode(T obj)
{
if (HashCodeFunc == null)
{
return 1;
}
else
{
return HashCodeFunc(obj);
}
}
}
I ran into this same problem, and debugging led me to a different answer than most. Most people point out that GetHashCode() must be implemented.
However, in my case - which was LINQ's SequenceEqual() - GetHashCode() was never called. And, despite the fact that every object involved was typed to a specific type T, the underlying problem was that SequenceEqual() called T.Equals(object other), which I had forgotten to implement, rather than calling the expected T.Equals(T other).

Resources