Dense Rank in Unsorted Arrays (Clickhouse)

Dense Rank in Unsorted Arrays (Clickhouse) - clickhouse

Is there a way I can get the dense rank for the elements in the (unsorted) array.
For e.g. if I have an array [100,200,50] --> I need the relative rank of these elements from highest to lowest e.g. output -> [2,1,3]
Tried thinking about how to use arrayEnumerateDense but to no avail.

You are right, can be used the function arrayEnumerateDense. It is applied to reverse sorted array to get the required ranks and then map them to origin array.
SELECT
[100, 200, 50, 200, 50] AS arr,
arrayReverseSort(arr) AS sorted_arr,
arrayEnumerateDense(sorted_arr) AS sorted_arr_dense,
arrayMap(x -> (sorted_arr_dense[indexOf(sorted_arr, x)]), arr) AS arr_dense
FORMAT Vertical
/* Result
Row 1:
──────
arr: [100,200,50,200,50]
sorted_arr: [200,200,100,50,50]
sorted_arr_dense: [1,1,2,3,3]
arr_dense: [2,1,3,1,3]
*/
The same result can be got without using arrayEnumerateDense:
SELECT
[100, 200, 50, 200, 50] AS arr,
arrayReverseSort(arrayDistinct(arr)) AS sorted_dist_arr,
arrayMap(x -> indexOf(sorted_dist_arr, x), arr) AS arr_dense
FORMAT Vertical
/* Result
Row 1:
──────
arr: [100,200,50,200,50]
sorted_dist_arr: [200,100,50]
arr_dense: [2,1,3,1,3]
*/

Related

LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array

LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?

The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.

Find a number in a list of numbers using Java8 sterams

I have a list of numbers
List<Integer> tensOfMinutes = Arrays.asList(10, 20, 30, 40, 50, 60);
I'm trying to determine if an input int Integer minutes; is between any two members of the array above.
Example: for an input Integer minutes = 23; I expect to get 20 as an answer.
Any ideas for how to accomplish this while iterating a stream of tensOfMinutes ?

You can do it this way:
List<Integer> l = Arrays.asList(10, 20, 30, 40, 50, 60)
.stream()
.filter( (i) -> i<23 )
.collect(Collectors.toList());
System.out.println( l.get( l.size() - 1 ) );
You filter out all elements bigger than 23 and you print the last one of the remaining elements.
It would have been easier with the dropWhile function that we have in Java 9, Scala and Haskell:
https://docs.oracle.com/javase/9/docs/api/java/util/stream/Stream.html#dropWhile-java.util.function.Predicate-
Improved version by Holger
Stream<Integer> stream = Arrays.asList(10, 20, 30, 40, 50, 60).stream();
stream.filter( i -> i<23 )
.reduce( (a,b) -> b )
.ifPresent(System.out::println);
You can see the (a,b) -> b lambda used to ger the last element and the ifPresent method used to make it error safe.

List list=IntStream.of(10,20,6,7,81).boxed().collect(Collectors.toList());

You will want the number before the first larger number so try this:
int previous=0
for(Integer number : tensOfMinutes)
if(number<=numberToFind)
previous=number;
You will cycle through all numbers and remember the last one that was smaller or equal to the number youre searching for in the variable previous. This expects the list to be sorted in ascending order. I don't know the exact specs of the integer class but that's the way of approach

determining the sum of top-left to bottom-right diagonal values in a matrix with Ruby?

I have a square matrix of indeterminate row & column length (assume rows and columns are equal as befits a square).
I've plotted out an example matrix as follows:
matrix = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
My goal is to get a sum from top-left to bottom-right of the diagonal values.
Obviously in this example, this is all i'll need:
diagsum = matrix[0][0]+matrix[1][1]+matrix[2][2]
#=> 15
I see the pattern where it's a +1 incremental for each row & column argument in the matrix, so the code i've developed for my matrix of indeterminate length (supplied as the argument to my method diagsum would preferably need to implement some sort of row_count method on my matrix argument.

If
arr = [[1,2,3],
[4,5,6],
[7,8,9]]
then:
require 'matrix'
Matrix[*arr].trace
#=> 15

This will sum the diagonal values.
matrix = []
matrix[0] = [1,2,3]
matrix[1] = [4,5,6]
matrix[2] = [7,8,9]
def diagsum(mat)
sum = 0
mat.each_with_index { |row,i| sum += row[i] }
sum
end
puts (diagsum matrix) # 15

Not clear what x is.
But assuming that it is the number of columns/rows, you have 0..x, while the index only goes up to x - 1. You should change it to 0...x.
You are assigning to variable i, whose scope is only in the block.
You are only using i once, perhaps intended to correspond to either row or column, but not both.
You are adding the indices instead of the corresponding elements.
each will return the receiver regardless of whatever you get in the blocks.
puts will return nil regardless of whatever you get.

How to sort K sorted arrays, with MERGE SORT

I know that this question has been asked, and there is a very nice elegant solution using a min heap.
MY question is how would one do this using the merge function of merge sort.
You already have an array of sorted arrays. So you should be able to merge all of them into one array in O(nlog K) time, correct?
I just can't figure out how to do this!
Say I have
[ [5,6], [3,4], [1,2], [0] ]
Step 1: [ [3,4,5,6], [0,1,2] ]
Step2: [ [0,1,2,3,4,5,6] ]
Is there a simple way to do this? Is O(nlog K) theoretically achievable with mergesort?

As others have said, using the min heap to hold the next items is the optimal way. It's called an N-way merge. Its complexity is O(n log k).
You can use a 2-way merge algorithm to sort k arrays. Perhaps the easiest way is to modify the standard merge sort so that it uses non-constant partition sizes. For example, imagine that you have 4 arrays with lengths 10, 8, 12, and 33. Each array is sorted. If you concatenated the arrays into one, you would have these partitions (the numbers are indexes into the array, not values):
[0-9][10-17][18-29][30-62]
The first pass of your merge sort would have starting indexes of 0 and 10. You would merge that into a new array, just as you would with the standard merge sort. The next pass would start at positions 18 and 30 in the second array. When you're done with the second pass, your output array contains:
[0-17][18-62]
Now your partitions start at 0 and 18. You merge those two into a single array and you're done.
The only real difference is that rather than starting with a partition size of 2 and doubling, you have non-constant partition sizes. As you make each pass, the new partition size is the sum of the sizes of the two partitions you used in the previous pass. This really is just a slight modification of the standard merge sort.
It will take log(k) passes to do the sort, and at each pass you look at all n items. The algorithm is O(n log k), but with a much higher constant than the N-way merge.
For implementation, build an array of integers that contains the starting indexes of each of your sub arrays. So in the example above you would have:
int[] partitions = [0, 10, 18, 30];
int numPartitions = 4;
Now you do your standard merge sort. But you select your partitions from the partitions array. So your merge would start with:
merge (inputArray, outputArray, part1Index, part2Index, outputStart)
{
part1Start = partitions[part1Index];
part2Start = partitions[part2Index];
part1Length = part2Start - part1Start;
part2Length = partitions[part2Index-1] - part2Start;
// now merge part1 and part2 into the output array,
// starting at outputStart
}
And your main loop would look something like:
while (numPartitions > 1)
{
for (int p = 0; p < numPartitions; p += 2)
{
outputStart = partitions[p];
merge(inputArray, outputArray, p, p+1, outputStart);
// update partitions table
partitions[p/2] = partitions[p] + partitions[p+1];
}
numPartitions /= 2;
}
That's the basic idea. You'll have to do some work to handle the dangling partition when the number is odd, but in general that's how it's done.
You can also do it by maintaining an array of arrays, and merging each two arrays into a new array, adding that to an output array of arrays. Lather, rinse, repeat.

You should note that when we say complexity is O(n log k), we assume that n means TOTAL number of elements in ALL of k arrays, i.e. number of elements in a final merged array.
For example, if you want to merge k arrays that contain n elements each, total number of elements in final array will be nk. So complexity will be O(nk log k).

There different ways to merge arrays. To accoplish that task in N*Log(K) time you can use a structure called Heap (it is good structure to implement priority queue). I suppose that you already have it, if you don’t then pick up any available implementation: http://en.wikipedia.org/wiki/Heap_(data_structure)
Then you can do that like this:
1. We have A[1..K] array of arrays to sort, Head[1..K] - current pointer for every array and Count[1..K] - number of items for every array.
2. We have Heap of pairs (Value: int; NumberOfArray: int) - empty at start.
3. We put to the heap first item of every array - initialization phase.
4. Then we organize cycle:
5. Get pair (Value, NumberOfArray) from the heap.
6. Value is next value to output.
7. NumberOfArray – is number of array where we need to take next item (if any) and place to the heap.
8. If heap is not empty, then repeat from step 5
So for every item we operate only with heap built from K items as maximum. It mean that we will have N*Log(K) complexity as you asked.

I implemented it in python. The main idea is similar to mergesort. There are k arrays in lists. In function mainMerageK, just divide lists (k) into left (k/2) and right (k/2). Therefore, the total count of partition is log(k). Regarding function merge, it is easy to know the runtime is O(n). Finally, we get O(nlog k)
By the way, it also can be implemented in min heap, and there is a link: Merging K- Sorted Lists using Priority Queue
def mainMergeK(*lists):
# implemented by k-way partition
k = len(lists)
if k > 1:
mid = int(k / 2)
B = mainMergeK(*lists[0: mid])
C = mainMergeK(*lists[mid:])
A = merge(B, C)
print B, ' + ', C, ' = ', A
return A
return lists[0]
def merge(B, C):
A = []
p = len(B)
q = len(C)
i = 0
j = 0
while i < p and j < q:
if B[i] <= C[j]:
A.append(B[i])
i += 1
else:
A.append(C[j])
j += 1
if i == p:
for c in C[j:]:
A.append(c)
else:
for b in B[i:]:
A.append(b)
return A
if __name__ == '__main__':
x = mainMergeK([1, 3, 5], [2, 4, 6], [7, 8, 10], [9])
print x
The output likes below:
[1, 3, 5] + [2, 4, 6] = [1, 2, 3, 4, 5, 6]
[7, 8, 10] + [9] = [7, 8, 9, 10]
[1, 2, 3, 4, 5, 6] + [7, 8, 9, 10] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Just do it like a 2-way merge except with K items. Will result in O(NK). If you want O(N logK) you will need to use a min-heap to keep track of the K pointers(with source array as a metadata) in the algorithm below:
Keep an array of K elements - i.e K pointers showing position in each array.
Mark all K elements are valid.
loop:
Compare values in K pointers that are valid. if the value is minimum, select least index pointer and increment it into the next value in the array. If incremented value has crossed it's array, mark it invalid.
Add the least value into the result.
Repeat till all K elements are invalid.
For example,:
Positions Arrays
p1:0 Array 1: 0 5 10
p2:3 Array 2: 3 6 9
p3:2 Array 3: 2 4 6
Output (min of 0,3,2)=> 0. So output is {0}
Array
p1:5 0 5 10
p2:3 3 6 9
p3:2 2 4 6
Output (min of 5,3,2)=> 2. So {0,2}
Array
p1:5 0 5 10
p2:3 3 6 9
p3:4 2 4 6
Output (min of 5,3,4)=>3. So {0,2,3}
..and so on..until you come to a state where output is {0,2,3,4,5,6}
Array
p1:5 0 5 10
p2:9 3 6 9
p3:6 2 4 6
Output (min of 5,9,6)=>6. So {0,2,3,4,5,6}+{6} when you mark p3 as "invalid" as you have exhausted the array. (or if you are using a min-heap you will simply remove the min-item, get it's source array metadata: in this case array 3, see that it's done so you will not add anything new to the min-heap)

Allocate an array of integers proportionally compensating for rounding errors

I have an array of non-negative values. I want to build an array of values who's sum is 20 so that they are proportional to the first array.
This would be an easy problem, except that I want the proportional array to sum to exactly
20, compensating for any rounding error.
For example, the array
input = [400, 400, 0, 0, 100, 50, 50]
would yield
output = [8, 8, 0, 0, 2, 1, 1]
sum(output) = 20
However, most cases are going to have a lot of rounding errors, like
input = [3, 3, 3, 3, 3, 3, 18]
naively yields
output = [1, 1, 1, 1, 1, 1, 10]
sum(output) = 16 (ouch)
Is there a good way to apportion the output array so that it adds up to 20 every time?

There's a very simple answer to this question: I've done it many times. After each assignment into the new array, you reduce the values you're working with as follows:
Call the first array A, and the new, proportional array B (which starts out empty).
Call the sum of A elements T
Call the desired sum S.
For each element of the array (i) do the following:
a. B[i] = round(A[i] / T * S). (rounding to nearest integer, penny or whatever is required)
b. T = T - A[i]
c. S = S - B[i]
That's it! Easy to implement in any programming language or in a spreadsheet.
The solution is optimal in that the resulting array's elements will never be more than 1 away from their ideal, non-rounded values. Let's demonstrate with your example:
T = 36, S = 20. B[1] = round(A[1] / T * S) = 2. (ideally, 1.666....)
T = 33, S = 18. B[2] = round(A[2] / T * S) = 2. (ideally, 1.666....)
T = 30, S = 16. B[3] = round(A[3] / T * S) = 2. (ideally, 1.666....)
T = 27, S = 14. B[4] = round(A[4] / T * S) = 2. (ideally, 1.666....)
T = 24, S = 12. B[5] = round(A[5] / T * S) = 2. (ideally, 1.666....)
T = 21, S = 10. B[6] = round(A[6] / T * S) = 1. (ideally, 1.666....)
T = 18, S = 9. B[7] = round(A[7] / T * S) = 9. (ideally, 10)
Notice that comparing every value in B with it's ideal value in parentheses, the difference is never more than 1.
It's also interesting to note that rearranging the elements in the array can result in different corresponding values in the resulting array. I've found that arranging the elements in ascending order is best, because it results in the smallest average percentage difference between actual and ideal.

Your problem is similar to a proportional representation where you want to share N seats (in your case 20) among parties proportionnaly to the votes they obtain, in your case [3, 3, 3, 3, 3, 3, 18]
There are several methods used in different countries to handle the rounding problem. My code below uses the Hagenbach-Bischoff quota method used in Switzerland, which basically allocates the seats remaining after an integer division by (N+1) to parties which have the highest remainder:
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
However this method doesn't always give the same number of seats to parties with perfect equality as in your case:
proportional(20,[3, 3, 3, 3, 3, 3, 18])
[2,2,2,2,1,1,10]

You have set 3 incompatible requirements. An integer-valued array proportional to [1,1,1] cannot be made to sum to exactly 20. You must choose to break one of the "sum to exactly 20", "proportional to input", and "integer values" requirements.
If you choose to break the requirement for integer values, then use floating point or rational numbers. If you choose to break the exact sum requirement, then you've already solved the problem. Choosing to break proportionality is a little trickier. One approach you might take is to figure out how far off your sum is, and then distribute corrections randomly through the output array. For example, if your input is:
[1, 1, 1]
then you could first make it sum as well as possible while still being proportional:
[7, 7, 7]
and since 20 - (7+7+7) = -1, choose one element to decrement at random:
[7, 6, 7]
If the error was 4, you would choose four elements to increment.

A naïve solution that doesn't perform well, but will provide the right result...
Write an iterator that given an array with eight integers (candidate) and the input array, output the index of the element that is farthest away from being proportional to the others (pseudocode):
function next_index(candidate, input)
// Calculate weights
for i in 1 .. 8
w[i] = candidate[i] / input[i]
end for
// find the smallest weight
min = 0
min_index = 0
for i in 1 .. 8
if w[i] < min then
min = w[i]
min_index = i
end if
end for
return min_index
end function
Then just do this
result = [0, 0, 0, 0, 0, 0, 0, 0]
result[next_index(result, input)]++ for 1 .. 20
If there is no optimal solution, it'll skew towards the beginning of the array.
Using the approach above, you can reduce the number of iterations by rounding down (as you did in your example) and then just use the approach above to add what has been left out due to rounding errors:
result = <<approach using rounding down>>
while sum(result) < 20
result[next_index(result, input)]++

So the answers and comments above were helpful... particularly the decreasing sum comment from #Frederik.
The solution I came up with takes advantage of the fact that for an input array v, sum(v_i * 20) is divisible by sum(v). So for each value in v, I mulitply by 20 and divide by the sum. I keep the quotient, and accumulate the remainder. Whenever the accumulator is greater than sum(v), I add one to the value. That way I'm guaranteed that all the remainders get rolled into the results.
Is that legible? Here's the implementation in Python:
def proportion(values, total):
# set up by getting the sum of the values and starting
# with an empty result list and accumulator
sum_values = sum(values)
new_values = []
acc = 0
for v in values:
# for each value, find quotient and remainder
q, r = divmod(v * total, sum_values)
if acc + r < sum_values:
# if the accumlator plus remainder is too small, just add and move on
acc += r
else:
# we've accumulated enough to go over sum(values), so add 1 to result
if acc > r:
# add to previous
new_values[-1] += 1
else:
# add to current
q += 1
acc -= sum_values - r
# save the new value
new_values.append(q)
# accumulator is guaranteed to be zero at the end
print new_values, sum_values, acc
return new_values
(I added an enhancement that if the accumulator > remainder, I increment the previous value instead of the current value)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Dense Rank in Unsorted Arrays (Clickhouse) - clickhouse

Is there a way I can get the dense rank for the elements in the (unsorted) array. For e.g. if I have an array [100,200,50] --> I need the relative rank of these elements from highest to lowest e.g. output -> [2,1,3] Tried thinking about how to use arrayEnumerateDense but to no avail.

Related

LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array

Find a number in a list of numbers using Java8 sterams

determining the sum of top-left to bottom-right diagonal values in a matrix with Ruby?

How to sort K sorted arrays, with MERGE SORT

Allocate an array of integers proportionally compensating for rounding errors

Categories

Resources