LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array - lightgbm

LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?

The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.

Related

Rearrange list to satisfy a condition

I was asked this during a coding interview but wasn't able to solve this. Any pointers would be very helpful.
I was given an integer list (think of it as a number line) which needs to be rearranged so that the difference between elements is equal to M (an integer which is given). The list needs to be rearranged in such a way that the value of the max absolute difference between the elements' new positions and the original positions needs to be minimized. Eventually, this value multiplied by 2 is returned.
Test cases:
//1.
original_list = [1, 2, 3, 4]
M = 2
rearranged_list = [-0.5, 1.5, 3.5, 5.5]
// difference in values of original and rearranged lists
diff = [1.5, 0.5, 0.5, 1.5]
max_of_diff = 1.5 // list is rearranged in such a way so that this value is minimized
return_val = 1.5 * 2 = 3
//2.
original_list = [1, 2, 4, 3]
M = 2
rearranged_list = [-1, 1, 3, 5]
// difference in values of original and rearranged lists
diff = [2, 1, 1, 2]
max_of_diff = 2 // list is rearranged in such a way so that this value is minimized
return_val = 2 * 2 = 4
Constraints:
1 <= list_length <= 10^5
1 <= M <= 10^4
-10^9 <= list[i] <= 10^9
There's a question on leetcode which is very similar to this: https://leetcode.com/problems/minimize-deviation-in-array/ but there, the operations that are performed on the array are mentioned while that's not been mentioned here. I'm really stumped.
Here is how you can think of it:
The "rearanged" list is like a straight line that has a slope that corresponds to M.
Here is a visualisation for the first example:
The black dots are the input values [1, 2, 3, 4] where the index of the array is the X-coordinate, and the actual value at that index, the Y-coordinate.
The green line is determined by M. Initially this line runs through the origin at (0, 0). The red line segments represent the differences that must be taken into account.
Now the green line has to move vertically to its optimal position. We can see that we only need to look at the difference it makes with the first and with the last point. The other two inputs will never contribute to an extreme. This is generally true: there are only two input elements that need to be taken into account. They are the points that make the greatest (signed -- not absolute) difference and the least difference.
We can see that we need to move the green line in such a way that the signed differences with these two extremes are each others opposite: i.e. their absolute difference becomes the same, but the sign will be opposite.
Twice this absolute difference is what we need to return, and it is actually the difference between the greatest (signed) difference and the least (signed) difference.
So, in conclusion, we must generate the values on the green line, find the least and greatest (signed) difference with the data points (Y-coordinates) and return the difference between those two.
Here is an implementation in JavaScript running the two examples you provided:
function solve(y, slope) {
let low = Infinity;
let high = -Infinity;
for (let x = 0; x < y.length; x++) {
let dy = y[x] - x * slope;
low = Math.min(low, dy);
high = Math.max(high, dy);
}
return high - low;
}
console.log(solve([1, 2, 3, 4], 2)); // 3
console.log(solve([1, 2, 4, 3], 2)); // 4

SPSS: select a subset of columns or rows from a matrix

How can I select a subset of columns or rows from a matrix in SPSS?
Given the following example, I want to compute a matrix X2 containing the first two columns of X.
MATRIX.
COMPUTE
X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
* Compute new matrix X2 that contains the first two columns of X
MAGIC CODE ;)
END MATRIX.
What is the syntax for matrix subsetting operations in SPSS?
You can subset a matrix, so it would be simply COMPUTE XSub = X(:,1:2). Full example below.
MATRIX.
COMPUTE X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
COMPUTE XSub = X(:,1:2).
PRINT XSub.
END MATRIX.
To the add-on question in the comments, 1:n basically SPSS understands as a row vector of 1 2 3 .... n. You can create your own vector to subset the matrix though, such as {1,3} or {2,2} or {3,1} or whatever. The last example will return the 3rd column first and the first column second in the subsetted matrix. Example below:
MATRIX.
COMPUTE X = {1, 2, 2;
0, -1, 1;
1, 1, -2}.
COMPUTE XSub = X(:,{3,1}).
PRINT XSub.
END MATRIX.
Which prints out
Run MATRIX procedure:
XSUB
2 1
1 0
-2 1
------ END MATRIX -----
MATRIX.
COMPUTE X = {1, 2, 3; 4, 5, 6; 7, 8, 9}.
COMPUTE Y=MAKE(NROW(X),2,0).
LOOP i=1 to NROW(Y).
LOOP j=1 to NCOL(Y).
COMPUTE Y(i,j)=X(i,j).
END LOOP.
END LOOP.
PRINT X.
PRINT Y.
END MATRIX.

optimization of pairwise L2 distance computations

I need help optimizing this loop. matrix_1 is a (nx 2) int matrix and matrix_2 is a (m x 2), m & n very.
index_j = 1;
for index_k = 1:size(Matrix_1,1)
for index_l = 1:size(Matrix_2,1)
M2_Index_Dist(index_j,:) = [index_l, sqrt(bsxfun(#plus,sum(Matrix_1(index_k,:).^2,2),sum(Matrix_2(index_l,:).^2,2)')-2*(Matrix_1(index_k,:)*Matrix_2(index_l,:)'))];
index_j = index_j + 1;
end
end
I need M2_Index_Dist to provide a ((n*m) x 2) matrix with the index of matrix_2 in the first column and the distance in the second column.
Output example:
M2_Index_Dist = [ 1, 5.465
2, 56.52
3, 6.21
1, 35.3
2, 56.52
3, 0
1, 43.5
2, 9.3
3, 236.1
1, 8.2
2, 56.52
3, 5.582]
Here's how to apply bsxfun with your formula (||A-B|| = sqrt(||A||^2 + ||B||^2 - 2*A*B)):
d = real(sqrt(bsxfun(#plus, dot(Matrix_1,Matrix_1,2), ...
bsxfun(#minus, dot(Matrix_2,Matrix_2,2).', 2 * Matrix_1*Matrix_2.')))).';
You can avoid the final transpose if you change your interpretation of the matrix.
Note: There shouldn't be any complex values to handle with real but it's there in case of very small differences that may lead to tiny negative numbers.
Edit: It may be faster without dot:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), ...
bsxfun(#minus, sum(Matrix_2.*Matrix_2,2)', 2 * Matrix_1*Matrix_2.'))).';
Or with just one call to bsxfun:
d = sqrt(bsxfun(#plus, sum(Matrix_1.*Matrix_1,2), sum(Matrix_2.*Matrix_2,2)') ...
- 2 * Matrix_1*Matrix_2.').';
Note: This last order of operations gives identical results to you, rather than with an error ~1e-14.
Edit 2: To replicate M2_Index_Dist:
II = ndgrid(1:size(Matrix_2,1),1:size(Matrix_2,1));
M2_Index_Dist = [II(:) d(:)];
If I understand correctly, this does what you want:
ind = repmat((1:size(Matrix_2,1)).',size(Matrix_1,1),1); %'// first column: index
d = pdist2(Matrix_2,Matrix_1); %// compute distance between each pair of rows
d = d(:); %// second column: distance
result = [ind d]; %// build result from first column and second column
As you see, this code calls pdist2 to compute the distance between every pair of rows of your matrices. By default this function uses Euclidean distance.
If you don't have pdist2 (which is part of the the Statistics Toolbox), you can replace line 2 above with bsxfun:
d = squeeze(sqrt(sum(bsxfun(#minus,Matrix_2,permute(Matrix_1, [3 2 1])).^2,2)));

Allocate an array of integers proportionally compensating for rounding errors

I have an array of non-negative values. I want to build an array of values who's sum is 20 so that they are proportional to the first array.
This would be an easy problem, except that I want the proportional array to sum to exactly
20, compensating for any rounding error.
For example, the array
input = [400, 400, 0, 0, 100, 50, 50]
would yield
output = [8, 8, 0, 0, 2, 1, 1]
sum(output) = 20
However, most cases are going to have a lot of rounding errors, like
input = [3, 3, 3, 3, 3, 3, 18]
naively yields
output = [1, 1, 1, 1, 1, 1, 10]
sum(output) = 16 (ouch)
Is there a good way to apportion the output array so that it adds up to 20 every time?
There's a very simple answer to this question: I've done it many times. After each assignment into the new array, you reduce the values you're working with as follows:
Call the first array A, and the new, proportional array B (which starts out empty).
Call the sum of A elements T
Call the desired sum S.
For each element of the array (i) do the following:
a. B[i] = round(A[i] / T * S). (rounding to nearest integer, penny or whatever is required)
b. T = T - A[i]
c. S = S - B[i]
That's it! Easy to implement in any programming language or in a spreadsheet.
The solution is optimal in that the resulting array's elements will never be more than 1 away from their ideal, non-rounded values. Let's demonstrate with your example:
T = 36, S = 20. B[1] = round(A[1] / T * S) = 2. (ideally, 1.666....)
T = 33, S = 18. B[2] = round(A[2] / T * S) = 2. (ideally, 1.666....)
T = 30, S = 16. B[3] = round(A[3] / T * S) = 2. (ideally, 1.666....)
T = 27, S = 14. B[4] = round(A[4] / T * S) = 2. (ideally, 1.666....)
T = 24, S = 12. B[5] = round(A[5] / T * S) = 2. (ideally, 1.666....)
T = 21, S = 10. B[6] = round(A[6] / T * S) = 1. (ideally, 1.666....)
T = 18, S = 9. B[7] = round(A[7] / T * S) = 9. (ideally, 10)
Notice that comparing every value in B with it's ideal value in parentheses, the difference is never more than 1.
It's also interesting to note that rearranging the elements in the array can result in different corresponding values in the resulting array. I've found that arranging the elements in ascending order is best, because it results in the smallest average percentage difference between actual and ideal.
Your problem is similar to a proportional representation where you want to share N seats (in your case 20) among parties proportionnaly to the votes they obtain, in your case [3, 3, 3, 3, 3, 3, 18]
There are several methods used in different countries to handle the rounding problem. My code below uses the Hagenbach-Bischoff quota method used in Switzerland, which basically allocates the seats remaining after an integer division by (N+1) to parties which have the highest remainder:
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
However this method doesn't always give the same number of seats to parties with perfect equality as in your case:
proportional(20,[3, 3, 3, 3, 3, 3, 18])
[2,2,2,2,1,1,10]
You have set 3 incompatible requirements. An integer-valued array proportional to [1,1,1] cannot be made to sum to exactly 20. You must choose to break one of the "sum to exactly 20", "proportional to input", and "integer values" requirements.
If you choose to break the requirement for integer values, then use floating point or rational numbers. If you choose to break the exact sum requirement, then you've already solved the problem. Choosing to break proportionality is a little trickier. One approach you might take is to figure out how far off your sum is, and then distribute corrections randomly through the output array. For example, if your input is:
[1, 1, 1]
then you could first make it sum as well as possible while still being proportional:
[7, 7, 7]
and since 20 - (7+7+7) = -1, choose one element to decrement at random:
[7, 6, 7]
If the error was 4, you would choose four elements to increment.
A naïve solution that doesn't perform well, but will provide the right result...
Write an iterator that given an array with eight integers (candidate) and the input array, output the index of the element that is farthest away from being proportional to the others (pseudocode):
function next_index(candidate, input)
// Calculate weights
for i in 1 .. 8
w[i] = candidate[i] / input[i]
end for
// find the smallest weight
min = 0
min_index = 0
for i in 1 .. 8
if w[i] < min then
min = w[i]
min_index = i
end if
end for
return min_index
end function
Then just do this
result = [0, 0, 0, 0, 0, 0, 0, 0]
result[next_index(result, input)]++ for 1 .. 20
If there is no optimal solution, it'll skew towards the beginning of the array.
Using the approach above, you can reduce the number of iterations by rounding down (as you did in your example) and then just use the approach above to add what has been left out due to rounding errors:
result = <<approach using rounding down>>
while sum(result) < 20
result[next_index(result, input)]++
So the answers and comments above were helpful... particularly the decreasing sum comment from #Frederik.
The solution I came up with takes advantage of the fact that for an input array v, sum(v_i * 20) is divisible by sum(v). So for each value in v, I mulitply by 20 and divide by the sum. I keep the quotient, and accumulate the remainder. Whenever the accumulator is greater than sum(v), I add one to the value. That way I'm guaranteed that all the remainders get rolled into the results.
Is that legible? Here's the implementation in Python:
def proportion(values, total):
# set up by getting the sum of the values and starting
# with an empty result list and accumulator
sum_values = sum(values)
new_values = []
acc = 0
for v in values:
# for each value, find quotient and remainder
q, r = divmod(v * total, sum_values)
if acc + r < sum_values:
# if the accumlator plus remainder is too small, just add and move on
acc += r
else:
# we've accumulated enough to go over sum(values), so add 1 to result
if acc > r:
# add to previous
new_values[-1] += 1
else:
# add to current
q += 1
acc -= sum_values - r
# save the new value
new_values.append(q)
# accumulator is guaranteed to be zero at the end
print new_values, sum_values, acc
return new_values
(I added an enhancement that if the accumulator > remainder, I increment the previous value instead of the current value)

Recursive interlacing permutation

I have a program (a fractal) that draws lines in an interlaced order. Originally, given H lines to draw, it determines the number of frames N, and draws every Nth frame, then every N+1'th frame, etc.
For example, if H = 10 and N = 3, it draws them in order:
0, 3, 6, 9,
1, 4, 7,
2, 5, 8.
However I didn't like the way bands would gradually thicken, leaving large areas between undrawn for a long time. So the method was enhanced to recursively draw midpoint lines in each group instead of the immediately sebsequent lines, for example:
0, (32) # S (step size) = 32
8, (24) # S = 16
4, (12) # S = 8
2, 6, (10) # S = 4
1, 3, 5, 7, 9. # S = 2
(The numbers in parentheses are out of range and not drawn.) The algorithm's pretty simple:
Set S to a power of 2 greater than N*2, set F = 0.
While S > 1:
Draw frame F.
Set F = F + S.
If F >= H, then set S = S / 2; set F = S / 2.
When the odd numbered frames are drawn on the last step size, they are drawn in simple order just as an the initial (annoying) method. The same with every fourth frame, etc. It's not as bad because some intermediate frames have already been drawn.
But the same permutation could recursively be applied to the elements for each step size. In the example above, the last line would change to:
1, # the 0th element, S' = 16
9, # 4th, S' = 8
5, # 2nd, S' = 4
3, 7. # 1st and 3rd, S' = 2
The previous lines have too few elements for the recursion to take effect. But if N was large enough, some lines might require multiple levels of recursion. Any step size with 3 or more corresponding elements can be recursively permutated.
Question 1. Is there a common name for this permutation on N elements, that I could use to find additional material on it? I am also interested in any similar examples that may exist. I would be surprised if I'm the first person to want to do this.
Question 2. Are there some techniques I could use to compute it? I'm working in C but I'm more interested at the algorithm level at this stage; I'm happy to read code other language (within reason).
I have not yet tackled its implemention. I expect I will precompute the permutation first (contrary to the algorithm for the previous method, above). But I'm also interested if there is a simple way to get the next frame to draw without having to precomputing it, similar in complexity to the previous method.
It sounds as though you're trying to construct one-dimensional low-discrepancy sequences. Your permutation can be computed by reversing the binary representation of the index.
def rev(num_bits, i):
j = 0
for k in xrange(num_bits):
j = (j << 1) | (i & 1)
i >>= 1
return j
Example usage:
>>> [rev(4,i) for i in xrange(16)]
[0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15]
A variant that works on general n:
def rev(n, i):
j = 0
while n >= 2:
m = i & 1
if m:
j += (n + 1) >> 1
n = (n + 1 - m) >> 1
i >>= 1
return j
>>> [rev(10,i) for i in xrange(10)]
[0, 5, 3, 8, 2, 7, 4, 9, 1, 6]

Resources