How to work inline with custom IEqualityComparer<T> parameters - linq

Several time I needed to call linq distincts from different IEnumerables.
These distincts often need criteria that I use just once through the software.
I found really annoying the constraint to create a class that implements the IEqualityComparer with the codebase to perform the distinct, so I thought to cover the gap creating a generic class that allows to point to a lambda expression passed as a parameter of the distinct.
In order to pass a custom IEqualityComparer parameters I developed the following class:
public class InlineComparer<T>
{
private class LambdaBasedComparer : IEqualityComparer<T>
{
public LambdaBasedComparer(Func<T, int> getHashCode)
{
fGetHashCode = getHashCode;
}
public bool Equals(T x, T y)
{
return x?.GetHashCode() == y?.GetHashCode();
}
private Func<T, int> fGetHashCode { get; set; }
public int GetHashCode(T obj)
{
return fGetHashCode(obj);
}
}
public static IEqualityComparer<T> GetComparer(Func<T, int> getHashCode)
{
return new LambdaBasedComparer(getHashCode);
}
}
What do you think about it? I hope it may be helpful!
Of course, a complete implementation of this helper makes use of this class into an extension method similar to "IEnumerable.Distinct(Func getHashCode)", but I wanted to highlight the possibility to work with lambdas to pass the distinct code.

Your equality comparer will declare two objects to be equal if they return the same value for GetHashCode(). Of course, when defining your own Equality comparer you are free to define the concept of equality any way you want, as long as your equality is reflexive, symmetric and transitive (x==x; if x==y then y==x; if x==y and y==x, then x==z).
Your equality comparer fits these rules, so you can use it.
However! Will it be a useful comparer?
You want to use a special equality comparer instead of the default equality comparer because you want some special definition of (un)equality of two objects.
Normally, during your design process you should first define equality of your objects. If you've done that and you want to use your LambdaBasedComparer you'll have to create a Hash Function that will return different values for different objects.
Normally hash functions have one requirement: two equal objects should return the same hash value. There is no requirement upon two different objects.
There are only Int32.MaxValue different Hash values, so if you've designed a class with more than this value possible unequal instances you can't use your comparer. An easy example: try to create a LambdaBasedComparer<long> that uses normal equality.
But even if your class can only create half of the Int32.MaxValue instances, it will be very difficult to create a proper hash function that will generate unique hash values for different objects.
Finally your equality will not be very intuitive if you use it to compare derived classes. Consider class Person and derived classes Employee and Customer.
IEqualityComparer<Person> personComparer = new LambdBasedComparer<Person>(...);
Person p = new Person(...);
Person e = new Employee(...);
Person c = new Customer(...);
Now I can say that a certain Person who isn't an Employee can equal one of your Employees. But would you ever say that Employees will equal Customers?
Summarized: you think that you have a simple solution for your comparers, but it will be very difficult to create a proper hash function for your definition of equality, and it will probably be even more difficult to test this hash function

Related

Creating composite key class for Secondary Sort

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. Can anyone tell me, what are the steps for the same.
Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). From the java lang reference
The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
You can also implement a RawComparator so that you can compare them without deserializing for some faster performance.
However, I recommend (as I always do) to not write things like Secondary Sort yourself. These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. E.g. if you were using Hive, all you need to write is:
SELECT ...
FROM my_table
ORDER BY month, carrier;
The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive.
EDIT: Forgot to mention about Grouping comparators, see Matt's answer
Your compareTo() implementation is incorrect. You need to sort first on uniqueCarrier, then on month to break equality:
#Override
public int compareTo(CompositeKey other) {
if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
return this.getMonth().compareTo(other.getMonth());
} else {
return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
}
}
One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier). This allows me to call write and readFields directly on them, and also use their compareTo. Less code to write is always good...
Speaking of less code, you don't have to call the parent constructor for your composite key.
Now for what is left to be done:
My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier. This method is called by the default Hadoop partitionner to distribute work across reducers.
I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier, and that sorting behaves according to CompositeKey compareTo():
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator()
{
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
Then, tell your Driver to use those two:
job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);
Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further.

Avoiding duplicate code when performing operation on different object properties

I have recently run into a problem which has had me thinking in circles. Assume that I have an object of type O with properties O.A and O.B. Also assume that I have a collection of instances of type O, where O.A and O.B are defined for each instance.
Now assume that I need to perform some operation (like sorting) on a collection of O instances using either O.A or O.B, but not both at any given time. My original solution is as follows.
Example -- just for demonstration, not production code:
public class O {
int A;
int B;
}
public static class Utils {
public static void SortByA (O[] collection) {
// Sort the objects in the collection using O.A as the key. Note: this is custom sorting logic, so it is not simply a one-line call to a built-in sort method.
}
public static void SortByB (O[] collection) {
// Sort the objects in the collection using O.B as the key. Same logic as above.
}
}
What I would love to do is this...
public static void SortAgnostic (O[] collection, FieldRepresentation x /* some non-bool, non-int variable representing whether to chose O.A or O.B as the sorting key */) {
// Sort by whatever "x" represents...
}
... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me. Perhaps I am incorrect on that (and I am sure someone will correct me if that statement is wrong :D), but that is my current thought nonetheless.
Question: What is the best way to implement this method? The logic that I have to implement is difficult to break down into smaller methods, as it is already fairly optimized. At the root of the issue is the fact that I need to perform the same operation using different properties of an object. I would like to stay away from using codes/flags/etc. in the method signature if possible so that the solution can be as robust as possible.
Note: When answering this question, please approach it from an algorithmic point of view. I am aware that some language-specific features may be suitable alternatives, but I have encountered this problem before and would like to understand it from a relatively language-agnostic viewpoint. Also, please do not constrain responses to sorting solutions only, as I have only chosen it as an example. The real question is how to avoid code duplication when performing an identical operation on two different properties of an object.
"The real question is how to avoid code duplication when performing an identical operation on two different properties of an object."
This is a very good question as this situation arises all the time. I think, one of the best ways to deal with this situation is to use the following pattern.
public class O {
int A;
int B;
}
public doOperationX1() {
doOperationX(something to indicate which property to use);
}
public doOperationX2() {
doOperationX(something to indicate which property to use);
}
private doOperationX(input ) {
// actual work is done here
}
In this pattern, the actual implementation is performed in a private method, which is called by public methods, with some extra information. For example, in this case, it can be
doOperationX(A), or doOperationX(B), or something like that.
My Reasoning: In my opinion this pattern is optimal as it achieves two main requirements:
It keeps the public interface descriptive and clear, as it keeps operations separate, and avoids flags etc that you also mentioned in your post. This is good for the client.
From the implementation perspective, it prevents duplication, as it is in one place. This is good for the development.
A simple way to approach this I think is to internalize the behavior of choosing the sort field to the class O itself. This way the solution can be language-agnostic.
The implementation in Java could be using an Abstract class for O, where the purpose of the abstract method getSortField() would be to return the field to sort by. All that the invocation logic would need to do is to implement the abstract method to return the desired field.
O o = new O() {
public int getSortField() {
return A;
}
};
The problem might be reduced to obtaining the value of the specified field from the given object so it can be use for sorting purposes, or,
TField getValue(TEntity entity, string fieldName)
{
// Return value of field "A" from entity,
// implementation depends on language of choice, possibly with
// some sort of reflection support
}
This method can be used to substitute comparisons within the sorting algorithm,
if (getValue(o[i], "A")) > getValue(o[j], "A"))
{
swap(i, j);
}
The field name can then be parametrized, as,
public static void SortAgnostic (O[] collection, string fieldName)
{
if (getValue(collection[i], fieldName)) > getValue(collection[j], fieldName))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, "A").
Some languages allow you to express the field in a more elegant way,
public static void SortAgnostic (O[] collection, Expression fieldExpression)
{
if (getValue(collection[i], fieldExpression)) >
getValue(collection[j], fieldExpression))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, entity => entity.A).
And yet another option can be passing a pointer to a function which will return the value of the field needed,
public static void SortAgnostic (O[] collection, Function getValue)
{
if (getValue(collection[i])) > getValue(collection[j]))
{
swap(i, j);
}
...
}
which given a function,
TField getValueOfA(TEntity entity)
{
return entity.A;
}
and passing it like SortAgnostic(collection, getValueOfA).
"... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me"
That is why you should use available tools like frameworks or other typo of code libraries that provide you requested solution.
When some mechanism is common that mean it can be moved to higher level of abstraction. When you can not find proper solution try to create own one. Think about the result of operation as not part of class functionality. The sorting is only a feature, that why it should not be part of your class from the beginning. Try to keep class as simple as possible.
Do not worry premature about the sense of having something small just because it is small. Focus on the final usage of it. If you use very often one type of sorting just create a definition of it to reuse it. You do not have to necessary create a utill class and then call it. Sometimes the base functionality enclosed in utill class is fair enough.
I assume that you use Java:
In your case the wheal was already implemented in person of Collection#sort(List, Comparator).
To full fill it you could create a Enum type that implement Comparator interface with predefined sorting types.

Class: Immutability vs Not Extensible

I was reading that there are many reasons for making a class final in SO threads and also in an arcticle
Two of which were
1. To remove extensibility
2. to make class immutable.
Does making a class immutable have the characteristic along with it as being final ( it's methods )? I don't see the difference between the two?
Immutable object does not allow to change his state. Final class does not allow to inherit itself. For example class Foo (see below) is immutable (the state, ie _name is never changed ) and class Bar is mutable (rename method allows to change the state):
final class Foo
{
private String _name;
public Foo(string name)
{
_name = name;
}
public String getName()
{
return _name;
}
}
final class Bar
{
private String _name;
public Bar(string name)
{
_name = name;
}
public String getName()
{
return _name;
}
public void rename(string newName)
{
_name = newName;
}
}
It can sometimes be useful to recognize types as "verifiably deeply immutable", meaning that static analysis can demonstrate that (1) once an instance is constructed, none of its properties will ever change, and (2) every object instance to which it holds a reference is verifiably deeply immutable. Classes which are open to extension cannot be verifiably deeply immutable, because a static analyzer would have no way of knowing whether a mutable subclass might be created, and a reference to that mutable subclass stored within what's supposed to be a verifiably deeply immutable object.
On the other hand, it can sometimes be useful to have abstract (and thus extensible) classes which are specified to be deeply immutable. The abstract class would have no way of forcing derived classes to immutable, but any mutable derived classes should be considered "broken". The situation would be somewhat analogous to the requirement that two object instances which report themselves as "equal" to each other should report the same hash code. It's possible to design classes which violate that requirement, but any errant hash-table behavior that results is the fault of the broken hash-code function, rather than the hash table.
For example, one might have an abstract ImmutableMatrix property with a method to read the element at a given (row,column) location. One possible implementation would be to back an NxM ImmutableMatrix with an array of N*M elements. On the other hand, it may also be useful to define some subclasses like ImmutableDiagonalMatrix, with an array of N elements, where Value(R,C) would yield 0 for R!=C, and Arr[R] for R==C. If a significant fraction of the arrays one is using will be diagonal arrays, one could save a lot of memory for each such instance. While leaving the class extensible would leave open the possibility that someone might extend it in a fashion which is open to mutation, it would also leave open the possibility that a programmer who knew that many of the arrays a program used would fit some particular form could design a class to optimally store that form.

Can we sort an IList partially?

IList<A_Desc,A_premium,B_Desc,B_Premium>
Can I sort two columns A_Desc,A_premium...based on A_Desc ?
And let B_Desc,B_Premium be remain in same order before sorting
First off, a list can only be of one type, and only has one "column" of data, so you actually want two lists and a data type that holds "desc" and "premium". "desc" sounds like a String to me; I don't know what Premium is, but I'll pretend it's a double for lack of better ideas. I don't know what this data is supposed to represent, so to me, it's just some thingie.
public class Thingie{
public String desc;
public double premium;
}
That is, of course, a terrible way to define the class- I should instead have desc and premium be private, and Desc and Premium as public Properties with Get and Set methods. But this is the fastest way for me to get the point across.
It's more canonical to make Thingie implement IComparable, and compare itself to other Thingie objects. But I'm editing an answer I wrote before I knew you needed to write a custom type, and had the freedom to just make it implement IComparable. So here's the IComparer approach, which lets you sort objects that don't sort themselves by telling C# how to sort them.
Implement an IComparer that operates over your custom type.
public class ThingieSorter: IComparer<Thingie>{
public int Compare(Thingie t1, Thingie t2){
int r = t1.desc.CompareTo(t2);
if(r != 0){return r;}
return t1.premium.CompareTo(t2);
}
}
C# doesn't require IList to implement Sort- it might be inefficient if it's a LinkedList. So let's make a new list, based on arrays, which does sort efficiently, and sort it:
public List<Thingie> sortedOf(IList<Thingie> list){
List<Thingie> ret = new List<Thingie>(list);
ret.sort(new ThingieSorter());
return ret;
}
List<Thingie> implements the interface IList<Thingie>, so replacing your original list with this one shouldn't break anything, as long as you have nothing holding onto the original list and magically expecting it to be sorted. If that's happening, refactor your code so it doesn't grab the reference until after your list has been sorted, since it can't be sorted in place.

Why isn't .Except (LINQ) comparing things properly? (using IEquatable)

I have two collections of my own reference-type objects that I wrote my own IEquatable.Equals method for, and I want to be able to use LINQ methods on them.
So,
List<CandyType> candy = dataSource.GetListOfCandy();
List<CandyType> lollyPops = dataSource.GetListOfLollyPops();
var candyOtherThanLollyPops = candy.Except( lollyPops );
According to the documentation of .Except, not passing an IEqualityComparer should result in EqualityComparer.Default being used to compare objects. And the documentation for the Default comparer is this:
"The Default property checks whether type T implements the System.IEquatable generic interface and if so returns an EqualityComparer that uses that implementation. Otherwise it returns an EqualityComparer that uses the overrides of Object.Equals and Object.GetHashCode provided by T."
So, because I implement IEquatable for my object, it should use that and work. But, it doesn't. It doesn't work until I override GetHashCode. In fact, if I set a break point, my IEquatable.Equals method never gets executed. This makes me think that it's going with plan B according to its documentation. I understand that overriding GetHashCode is a good idea, anyway, and I can get this working, but I am upset that it is behaving in a way that isn't in line with what its own documentation stated.
Why isn't it doing what it said it would? Thank you.
After investigation, it turns out things aren't quite as bad as I thought. Basically, when everything is implemented properly (GetHashCode, etc.) the documentation is correct, and the behavior is correct. But, if you try to do something like implement IEquatable all by itself, then your Equals method will never get called (this seems to be due to GetHashCode not being implemented properly). So, while the documentation is technically wrong, it's only wrong in a fringe situation that you'd never ever want to do (if this investigation has taught me anything, it's that IEquatable is part of a whole set of methods you should implement atomically (by convention, not by rule, unfortunately)).
Good sources on this are:
Is there a complete IEquatable implementation reference?
MSDN Documentation: IEquatable<T>.Equals(T) Method
SYSK 158: IComparable<T> vs. IEquatable<T>
The interface IEqualityComparer<T> has these two methods:
bool Equals(T x, T y);
int GetHashCode(T obj);
A good implementation of this interface would thus implement both. The Linq extension method Except relies on the hash code in order to use a dictionary or set lookup internally to figure out which objects to skip, and thus requires that proper GetHashCode implementation.
Unfortunately, when you use EqualityComparer<T>.Default, that class does not provide a good GetHashCode implementation by itself, and relies on the object in question, the type T, to provide that part, when it detects that the object implements IEquatable<T>.
The problem here is that IEquatable<T> does not in fact declare GetHashCode so it's much easier to forget to implement that method properly, contrasted with the Equals method that it does declare.
So you have two choices:
Provide a proper IEqualityComparer<T> implementation that implements both Equals and GetHashCode
Make sure that in addition to implementing IEquatable<T> on your object, implement a proper GetHashCode as well
Hazarding a guess, are these different classes? I think by default IEquatable only works with the same class. So it might by falling back to the Object.Equal method.
I wrote a GenericEqualityComparer to be used on the fly for these types of methods: Distinct, Except, Intersect, etc.
Use as follows:
var results = list1.Except(list2, new GenericEqualityComparer<MYTYPE>((a, b) => a.Id == b.Id // OR SOME OTHER COMPARISON RESOLVING TO BOOLEAN));
Here's the class:
public class GenericEqualityComparer<T> : EqualityComparer<T>
{
public Func<T, int> HashCodeFunc { get; set; }
public Func<T, T, Boolean> EqualityFunc { get; set; }
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc)
{
EqualityFunc = equalityFunc;
HashCodeFunc = null;
}
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc, Func<T, int> hashCodeFunc) : this(equalityFunc)
{
HashCodeFunc = hashCodeFunc;
}
public override bool Equals(T x, T y)
{
return EqualityFunc(x, y);
}
public override int GetHashCode(T obj)
{
if (HashCodeFunc == null)
{
return 1;
}
else
{
return HashCodeFunc(obj);
}
}
}
I ran into this same problem, and debugging led me to a different answer than most. Most people point out that GetHashCode() must be implemented.
However, in my case - which was LINQ's SequenceEqual() - GetHashCode() was never called. And, despite the fact that every object involved was typed to a specific type T, the underlying problem was that SequenceEqual() called T.Equals(object other), which I had forgotten to implement, rather than calling the expected T.Equals(T other).

Resources