Avoiding duplicate code when performing operation on different object properties - algorithm

I have recently run into a problem which has had me thinking in circles. Assume that I have an object of type O with properties O.A and O.B. Also assume that I have a collection of instances of type O, where O.A and O.B are defined for each instance.
Now assume that I need to perform some operation (like sorting) on a collection of O instances using either O.A or O.B, but not both at any given time. My original solution is as follows.
Example -- just for demonstration, not production code:
public class O {
int A;
int B;
}
public static class Utils {
public static void SortByA (O[] collection) {
// Sort the objects in the collection using O.A as the key. Note: this is custom sorting logic, so it is not simply a one-line call to a built-in sort method.
}
public static void SortByB (O[] collection) {
// Sort the objects in the collection using O.B as the key. Same logic as above.
}
}
What I would love to do is this...
public static void SortAgnostic (O[] collection, FieldRepresentation x /* some non-bool, non-int variable representing whether to chose O.A or O.B as the sorting key */) {
// Sort by whatever "x" represents...
}
... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me. Perhaps I am incorrect on that (and I am sure someone will correct me if that statement is wrong :D), but that is my current thought nonetheless.
Question: What is the best way to implement this method? The logic that I have to implement is difficult to break down into smaller methods, as it is already fairly optimized. At the root of the issue is the fact that I need to perform the same operation using different properties of an object. I would like to stay away from using codes/flags/etc. in the method signature if possible so that the solution can be as robust as possible.
Note: When answering this question, please approach it from an algorithmic point of view. I am aware that some language-specific features may be suitable alternatives, but I have encountered this problem before and would like to understand it from a relatively language-agnostic viewpoint. Also, please do not constrain responses to sorting solutions only, as I have only chosen it as an example. The real question is how to avoid code duplication when performing an identical operation on two different properties of an object.

"The real question is how to avoid code duplication when performing an identical operation on two different properties of an object."
This is a very good question as this situation arises all the time. I think, one of the best ways to deal with this situation is to use the following pattern.
public class O {
int A;
int B;
}
public doOperationX1() {
doOperationX(something to indicate which property to use);
}
public doOperationX2() {
doOperationX(something to indicate which property to use);
}
private doOperationX(input ) {
// actual work is done here
}
In this pattern, the actual implementation is performed in a private method, which is called by public methods, with some extra information. For example, in this case, it can be
doOperationX(A), or doOperationX(B), or something like that.
My Reasoning: In my opinion this pattern is optimal as it achieves two main requirements:
It keeps the public interface descriptive and clear, as it keeps operations separate, and avoids flags etc that you also mentioned in your post. This is good for the client.
From the implementation perspective, it prevents duplication, as it is in one place. This is good for the development.

A simple way to approach this I think is to internalize the behavior of choosing the sort field to the class O itself. This way the solution can be language-agnostic.
The implementation in Java could be using an Abstract class for O, where the purpose of the abstract method getSortField() would be to return the field to sort by. All that the invocation logic would need to do is to implement the abstract method to return the desired field.
O o = new O() {
public int getSortField() {
return A;
}
};

The problem might be reduced to obtaining the value of the specified field from the given object so it can be use for sorting purposes, or,
TField getValue(TEntity entity, string fieldName)
{
// Return value of field "A" from entity,
// implementation depends on language of choice, possibly with
// some sort of reflection support
}
This method can be used to substitute comparisons within the sorting algorithm,
if (getValue(o[i], "A")) > getValue(o[j], "A"))
{
swap(i, j);
}
The field name can then be parametrized, as,
public static void SortAgnostic (O[] collection, string fieldName)
{
if (getValue(collection[i], fieldName)) > getValue(collection[j], fieldName))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, "A").
Some languages allow you to express the field in a more elegant way,
public static void SortAgnostic (O[] collection, Expression fieldExpression)
{
if (getValue(collection[i], fieldExpression)) >
getValue(collection[j], fieldExpression))
{
swap(i, j);
}
...
}
which you can use like SortAgnostic(collection, entity => entity.A).
And yet another option can be passing a pointer to a function which will return the value of the field needed,
public static void SortAgnostic (O[] collection, Function getValue)
{
if (getValue(collection[i])) > getValue(collection[j]))
{
swap(i, j);
}
...
}
which given a function,
TField getValueOfA(TEntity entity)
{
return entity.A;
}
and passing it like SortAgnostic(collection, getValueOfA).

"... but creating a new, highly-specific type that I will have to maintain just to avoid duplicating a few lines of code seems unnecessary to me"
That is why you should use available tools like frameworks or other typo of code libraries that provide you requested solution.
When some mechanism is common that mean it can be moved to higher level of abstraction. When you can not find proper solution try to create own one. Think about the result of operation as not part of class functionality. The sorting is only a feature, that why it should not be part of your class from the beginning. Try to keep class as simple as possible.
Do not worry premature about the sense of having something small just because it is small. Focus on the final usage of it. If you use very often one type of sorting just create a definition of it to reuse it. You do not have to necessary create a utill class and then call it. Sometimes the base functionality enclosed in utill class is fair enough.
I assume that you use Java:
In your case the wheal was already implemented in person of Collection#sort(List, Comparator).
To full fill it you could create a Enum type that implement Comparator interface with predefined sorting types.

Related

Byte vs boolean for OrderBy

Is there any performance benefit to using a byte over a bool in ordering?
For example, given some code:
var foo = items.OrderByDescending(item => item.SomeProperty);
The existing code to get the value of SomeProperty is:
public byte SomeProperty
{
get
{
if (a == b)
return 1;
else
return 0;
}
}
I wanted to refactor this to:
public bool SomeProperty
{
get
{
a == b
}
}
I was told the first is more efficient. Is this true? Are there any downsides to using a bool over a byte?
The efficiency will hardly be in the processing efficiency. It will be more in efficiency of development code: is the code easy to understand? easy to reuse for similar items? easy to change if the internal structure changes without changing the interface? easy to test?
When designing a property your first question should be: what does my property stand for? What does it mean? Does it have an identifier and type that users will expect, or will they have to look it up in the documentation because they have no idea what it means?
For instance, if you have a class that represents something persistable, like a file, and you invent a property, which one will be easier to understand:
class Persistable
{
public int IsPersisted {get;}
public bool IsPersisted {get;}
...
Which one will readers immediately know what it means?
So for now your idea about persisted can have two values meaning "not persisted yet" and "persisted". A boolean will be enough. But if you foresee that in the near future the idea about persistence will change, for instance, the persistable can be "not persisted yet" "persisted" "changed after it has been persisted" "deleted". If you foresee that, you have to decide whether it is best to return a bool. Maybe your should return an enum:
public PersistencyState State {get;}
Conclusion Design the identifiers and types of your properties and methods such that the learning curve for your users is low, and that foreseeable changes don't have a great impact. Make sure that the properties are easy to test and maintain. In rare occasions portability is an issue.
Those items have bigger influence on your efficiency than the two code changes.
Back to your question
If you think about what SomeProperty represents, and you think: it represents the equality of a and b, then you should use:
public bool EqualAB => a == b
If your question is about whether you should use "get" or =>, the first one will call something sub-routine like, while the 2nd method will insert the code. If the part after the => is fairly big, and you use it on hundreds of locations, then your code will become bigger.
But then again: if your get is really big, should you make it a property?
public string ElderName
{
get
{
myDataBase.Open()
var allCustomers = myDataBase.FetchAllCustomers().ToList();
var eldestCustomer = this.FindEldestCustomer(allCustomers);
return eldestCustomer.Name;
}
}
Well this will have a fair impact on code size if you use the => notation on 1000 locations. But honestly, designers that put this in a property instead of a method don't deserve efficient code.
Finally, I asked here in stackoverflow whether there is a difference:
string Name {get => this.name;}
string Name => this.name;
The answer was that it translated into the same assembly code

JAVA 8 Extract predicates as fields or methods?

What is the cleaner way of extracting predicates which will have multiple uses. Methods or Class fields?
The two examples:
1.Class Field
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty)
.forEach(System.out::println);
}
private IntPredicate isOverFifty = number -> number > 50;
2.Method
void someMethod() {
IntStream.range(1, 100)
.filter(isOverFifty())
.forEach(System.out::println);
}
private IntPredicate isOverFifty() {
return number -> number > 50;
}
For me, the field way looks a little bit nicer, but is this the right way? I have my doubts.
Generally you cache things that are expensive to create and these stateless lambdas are not. A stateless lambda will have a single instance created for the entire pipeline (under the current implementation). The first invocation is the most expensive one - the underlying Predicate implementation class will be created and linked; but this happens only once for both stateless and stateful lambdas.
A stateful lambda will use a different instance for each element and it might make sense to cache those, but your example is stateless, so I would not.
If you still want that (for reading purposes I assume), I would do it in a class Predicates let's assume. It would be re-usable across different classes as well, something like this:
public final class Predicates {
private Predicates(){
}
public static IntPredicate isOverFifty() {
return number -> number > 50;
}
}
You should also notice that the usage of Predicates.isOverFifty inside a Stream and x -> x > 50 while semantically the same, will have different memory usages.
In the first case, only a single instance (and class) will be created and served to all clients; while the second (x -> x > 50) will create not only a different instance, but also a different class for each of it's clients (think the same expression used in different places inside your application). This happens because the linkage happens per CallSite - and in the second case the CallSite is always different.
But that is something you should not rely on (and probably even consider) - these Objects and classes are fast to build and fast to remove by the GC - whatever fits your needs - use that.
To answer, it's better If you expand those lambda expressions for old fashioned Java. You can see now, these are two ways we used in our codes. So, the answer is, it all depends how you write a particular code segment.
private IntPredicate isOverFifty = new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
private IntPredicate isOverFifty() {
return new IntPredicate<Integer>(){
public void test(number){
return number > 50;
}
};
}
1) For field case you will have always allocated predicate for each new your object. Not a big deal if you have a few instances, likes, service. But if this is a value object which can be N, this is not good solution. Also keep in mind that someMethod() may not be called at all. One of possible solution is to make predicate as static field.
2) For method case you will create the predicate once every time for someMethod() call. After GC will discard it.

Creating composite key class for Secondary Sort

I am trying to create a composite key class of a String uniqueCarrier and int month for Secondary Sort. Can anyone tell me, what are the steps for the same.
Looks like you have an equality problem since you're not using uniqueCarrier in your compareTo method. You need to use uniqueCarrier in your compareTo and equals methods (also define an equals method). From the java lang reference
The natural ordering for a class C is said to be consistent with equals if and only if e1.compareTo(e2) == 0 has the same boolean value as e1.equals(e2) for every e1 and e2 of class C. Note that null is not an instance of any class, and e.compareTo(null) should throw a NullPointerException even though e.equals(null) returns false.
You can also implement a RawComparator so that you can compare them without deserializing for some faster performance.
However, I recommend (as I always do) to not write things like Secondary Sort yourself. These have been implemented (as well as dozens of other optimizations) in projects like Pig and Hive. E.g. if you were using Hive, all you need to write is:
SELECT ...
FROM my_table
ORDER BY month, carrier;
The above is a lot simpler to write than trying to figure out how to write Secondary Sorts (and eventually when you need to use it again, how to do it in a generic fashion). MapReduce should be considered a low level programming paradigm and should only be used (IMHO) when you need high performance optimizations that you don't get from higher level projects like Pig or Hive.
EDIT: Forgot to mention about Grouping comparators, see Matt's answer
Your compareTo() implementation is incorrect. You need to sort first on uniqueCarrier, then on month to break equality:
#Override
public int compareTo(CompositeKey other) {
if (this.getUniqueCarrier().equals(other.getUniqueCarrier())) {
return this.getMonth().compareTo(other.getMonth());
} else {
return this.getUniqueCarrier().compareTo(other.getUniqueCarrier());
}
}
One suggestion though: I typically choose to implement my attributes directly as Writable types if possible (for example, IntWriteable month and Text uniqueCarrier). This allows me to call write and readFields directly on them, and also use their compareTo. Less code to write is always good...
Speaking of less code, you don't have to call the parent constructor for your composite key.
Now for what is left to be done:
My guess is you are still missing a hashCode() method, which should only return the hash of the attribute you want to group on, in this case uniqueCarrier. This method is called by the default Hadoop partitionner to distribute work across reducers.
I would also write custom GroupingComparator and SortingComparator to make sure grouping happens only on uniqueCarrier, and that sorting behaves according to CompositeKey compareTo():
public class CompositeGroupingComparator extends WritableComparator {
public CompositeGroupingComparator() {
super(CompositeKey.class, true);
}
#Override
public int compare(WritableComparable a, WritableComparable b) {
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.getUniqueCarrier().compareTo(second.getUniqueCarrier());
}
}
public class CompositeSortingComparator extends WritableComparator {
public CompositeSortingComparator()
{
super (CompositeKey.class, true);
}
#Override
public int compare (WritableComparable a, WritableComparable b){
CompositeKey first = (CompositeKey) a;
CompositeKey second = (CompositeKey) b;
return first.compareTo(second);
}
}
Then, tell your Driver to use those two:
job.setSortComparatorClass(CompositeSortingComparator.class);
job.setGroupingComparatorClass(CompositeGroupingComparator.class);
Edit: Also see Pradeep's suggestion of implementing RawComparator to prevent having to unmarshall to an Object each time, if you want to optimize further.

Can we sort an IList partially?

IList<A_Desc,A_premium,B_Desc,B_Premium>
Can I sort two columns A_Desc,A_premium...based on A_Desc ?
And let B_Desc,B_Premium be remain in same order before sorting
First off, a list can only be of one type, and only has one "column" of data, so you actually want two lists and a data type that holds "desc" and "premium". "desc" sounds like a String to me; I don't know what Premium is, but I'll pretend it's a double for lack of better ideas. I don't know what this data is supposed to represent, so to me, it's just some thingie.
public class Thingie{
public String desc;
public double premium;
}
That is, of course, a terrible way to define the class- I should instead have desc and premium be private, and Desc and Premium as public Properties with Get and Set methods. But this is the fastest way for me to get the point across.
It's more canonical to make Thingie implement IComparable, and compare itself to other Thingie objects. But I'm editing an answer I wrote before I knew you needed to write a custom type, and had the freedom to just make it implement IComparable. So here's the IComparer approach, which lets you sort objects that don't sort themselves by telling C# how to sort them.
Implement an IComparer that operates over your custom type.
public class ThingieSorter: IComparer<Thingie>{
public int Compare(Thingie t1, Thingie t2){
int r = t1.desc.CompareTo(t2);
if(r != 0){return r;}
return t1.premium.CompareTo(t2);
}
}
C# doesn't require IList to implement Sort- it might be inefficient if it's a LinkedList. So let's make a new list, based on arrays, which does sort efficiently, and sort it:
public List<Thingie> sortedOf(IList<Thingie> list){
List<Thingie> ret = new List<Thingie>(list);
ret.sort(new ThingieSorter());
return ret;
}
List<Thingie> implements the interface IList<Thingie>, so replacing your original list with this one shouldn't break anything, as long as you have nothing holding onto the original list and magically expecting it to be sorted. If that's happening, refactor your code so it doesn't grab the reference until after your list has been sorted, since it can't be sorted in place.

Why isn't .Except (LINQ) comparing things properly? (using IEquatable)

I have two collections of my own reference-type objects that I wrote my own IEquatable.Equals method for, and I want to be able to use LINQ methods on them.
So,
List<CandyType> candy = dataSource.GetListOfCandy();
List<CandyType> lollyPops = dataSource.GetListOfLollyPops();
var candyOtherThanLollyPops = candy.Except( lollyPops );
According to the documentation of .Except, not passing an IEqualityComparer should result in EqualityComparer.Default being used to compare objects. And the documentation for the Default comparer is this:
"The Default property checks whether type T implements the System.IEquatable generic interface and if so returns an EqualityComparer that uses that implementation. Otherwise it returns an EqualityComparer that uses the overrides of Object.Equals and Object.GetHashCode provided by T."
So, because I implement IEquatable for my object, it should use that and work. But, it doesn't. It doesn't work until I override GetHashCode. In fact, if I set a break point, my IEquatable.Equals method never gets executed. This makes me think that it's going with plan B according to its documentation. I understand that overriding GetHashCode is a good idea, anyway, and I can get this working, but I am upset that it is behaving in a way that isn't in line with what its own documentation stated.
Why isn't it doing what it said it would? Thank you.
After investigation, it turns out things aren't quite as bad as I thought. Basically, when everything is implemented properly (GetHashCode, etc.) the documentation is correct, and the behavior is correct. But, if you try to do something like implement IEquatable all by itself, then your Equals method will never get called (this seems to be due to GetHashCode not being implemented properly). So, while the documentation is technically wrong, it's only wrong in a fringe situation that you'd never ever want to do (if this investigation has taught me anything, it's that IEquatable is part of a whole set of methods you should implement atomically (by convention, not by rule, unfortunately)).
Good sources on this are:
Is there a complete IEquatable implementation reference?
MSDN Documentation: IEquatable<T>.Equals(T) Method
SYSK 158: IComparable<T> vs. IEquatable<T>
The interface IEqualityComparer<T> has these two methods:
bool Equals(T x, T y);
int GetHashCode(T obj);
A good implementation of this interface would thus implement both. The Linq extension method Except relies on the hash code in order to use a dictionary or set lookup internally to figure out which objects to skip, and thus requires that proper GetHashCode implementation.
Unfortunately, when you use EqualityComparer<T>.Default, that class does not provide a good GetHashCode implementation by itself, and relies on the object in question, the type T, to provide that part, when it detects that the object implements IEquatable<T>.
The problem here is that IEquatable<T> does not in fact declare GetHashCode so it's much easier to forget to implement that method properly, contrasted with the Equals method that it does declare.
So you have two choices:
Provide a proper IEqualityComparer<T> implementation that implements both Equals and GetHashCode
Make sure that in addition to implementing IEquatable<T> on your object, implement a proper GetHashCode as well
Hazarding a guess, are these different classes? I think by default IEquatable only works with the same class. So it might by falling back to the Object.Equal method.
I wrote a GenericEqualityComparer to be used on the fly for these types of methods: Distinct, Except, Intersect, etc.
Use as follows:
var results = list1.Except(list2, new GenericEqualityComparer<MYTYPE>((a, b) => a.Id == b.Id // OR SOME OTHER COMPARISON RESOLVING TO BOOLEAN));
Here's the class:
public class GenericEqualityComparer<T> : EqualityComparer<T>
{
public Func<T, int> HashCodeFunc { get; set; }
public Func<T, T, Boolean> EqualityFunc { get; set; }
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc)
{
EqualityFunc = equalityFunc;
HashCodeFunc = null;
}
public GenericEqualityComparer(Func<T, T, Boolean> equalityFunc, Func<T, int> hashCodeFunc) : this(equalityFunc)
{
HashCodeFunc = hashCodeFunc;
}
public override bool Equals(T x, T y)
{
return EqualityFunc(x, y);
}
public override int GetHashCode(T obj)
{
if (HashCodeFunc == null)
{
return 1;
}
else
{
return HashCodeFunc(obj);
}
}
}
I ran into this same problem, and debugging led me to a different answer than most. Most people point out that GetHashCode() must be implemented.
However, in my case - which was LINQ's SequenceEqual() - GetHashCode() was never called. And, despite the fact that every object involved was typed to a specific type T, the underlying problem was that SequenceEqual() called T.Equals(object other), which I had forgotten to implement, rather than calling the expected T.Equals(T other).

Resources