Comparator expalantion in Kotlin

Comparator expalantion in Kotlin - sorting

val list = listOf(7,3,5,9,1,3)
list.sortedWith(Comparator<Int>{ a, b ->
when {
a > b -> 1
a < b -> -1
else -> 0
}
})
Answer - [1, 3, 3, 5, 7, 9]
Above code sort the given list in Kotlin.Can someone explain how this comparator works and do the sorting.I'm struggling to understand how this list gets sorted. Thanks in advance,

Short answer: a Comparator tells sorting &c methods how to compare two objects, i.e. what order they should be in.  The code in the question doesn't actually need one, because Ints already have a natural ordering.  But it's perfectly good code, and illustrates what to do for objects that don't.
This is explained pretty well in the docs for Comparator… (In general, the docs are pretty good for all the Java standard library, so that's a good place to start.)
Some objects have a ‘natural ordering’; for example, there's a very obvious ordering for numbers (numerical order) and strings (dictionary order).  In Java and Kotlin, that's indicated by the objects implementing the Comparable interface.  All the standard sorting, ordering, &c functions know from that what order to use.
But what about objects that don't have a natural order?  In that case, if you want to sort them or do anything that involves ordering, you'll have to explain what order to use.  You do that by providing an object implementing Comparator.
This has a single method int compare(T o1, T o2): the implementation has to decide whether o1 is less than, greater than, or equal to o2.  It does this by returning a negative number if o1 < o2, zero if o1 = o2, or a positive number if o1 > o2.
That's what the code in the question does.
The List.sortedWith() method uses the comparator to create a sorted version of the list.  The sorting algorithm it uses will determine exactly which objects it compares, but for example it might start by comparing the first two items in the list: 7 and 3.  It would call the comparator with compare(7, 3), and the comparator will return a positive number (in this case 1) to indicate that 7 should come after 3 in the list.  It would then continue processing the list and making further comparisons until it can produce a list in order.
As I said, in this case there's no need to write a comparator, because Ints already have a natural ordering!  So you could simply use list.sorted().  (That would then use Int's own compareTo() method, which works in a similar way.)
But there's no harm in using your own comparator, and the code in the question shows how it could be done.

Related

Given 2 arrays, returns elements that are not included in both arrays

I had an interview, and did one of the questions described below:
Given two arrays, please calculate the result: get the union and then remove the intersection from the union. e.g.
int a[] = {1, 3, 4, 5, 7};
int b[] = {5, 3, 8, 10}; // didn't mention if has the same value.
result = {1,4,7,8,10}
This is my idea:
Sort a, b.
Check each item of b using 'dichotomy search' in a. If not found, pass. Otherwise, remove this item from both a, b
result = elements left in a + elements left in b
I know it is a lousy algorithm, but nonetheless it's better than nothing. Is there a better approach than this one?

There are many approaches to this problem. one approach is:
1. construct hash-map using distinct array elements of array a with elements as keys and 1 is a value.
2. for every element,e in array b
if e in hash-map
set value of that key to 0
else
add e to result array.
3.add all keys from hash-map whose values 1 to result array.

another approach may be:
join both lists
sort the joined list
walk through the joined list and completely remove any elements that occurs multiple times
this have one drawback: it does not work if input lists already have doublets. But since we are talking about sets and set theory i would also expect the inputs to be sets in the mathematical sense.

Another (in my opinion the best) approach:
you do not need a search through your both lists. you can just sequentially iterate through them:
sort a and b
declare an empty result set
take iterators to both lists and repeat the following steps:
if the iterators values are unequal: add the smaller number to the result set and increment the belonging iterator
if the iterators values are equal: increment both iterators without adding something to the result set
if one iterator reaches end: add all remaining elements of the other set to the result

How does the sort method work? Need to understand the source code

I tried to understand the sort method used like:
(1..10).sort {|a,b| b <=> a} #=> [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
by looking at its source code:
static VALUE
enum_sort(VALUE obj)
{
return rb_ary_sort(enum_to_a(0, 0, obj));
}
but, I cannot understand how sort works. Please help me to understand how it does the sorting.

Any time I have questions like this I find that it is helpful to look at the same method definition in Rubinius. Rubinius is a port of ruby written entirely in ruby which makes it much easier to read(for me, at least).
In this case, Enumberable#sort just transforms the Enumerable object to an Array object then calls Array#sort.
In Rubinius, Array#sort just calls Array#sortinplace which in turn calls either Array#isort or Array#mergesort, depending on the array size.
Both isort(insertion sort) and mergesort are common sorting algorithims that will be nearly identical regardless of which language they are written in and can be easily googled to get a better understanding.

1..10 is short for an array with the numbers from 1 to 10 (step size 1).
|a,b| picks two numbers from the array and b <=> a compares it. It delivers +1 when b is bigger than a, 0 when equal or -1 when smaller, e.g. for 10 <=> 9 the result is +1.
Based on that, an ordinary array sort is performed. Honestly, i don't know what sorting algorithm Ruby uses.

Ruby uses quicksort according to the latest source code

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.

Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end

You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).

So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here

If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64

Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...

I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.

A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Good hash function for permutations?

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.

One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.

Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.

If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.

depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine

I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.

You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though

I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.

The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.

You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.

It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.

I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.

Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.

You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.

Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio