Calculate Marginal effect by hand (without using packages or Stata or R) with logit and dummy variables - probability

I have the following dilemma:
I understand-ish what marginal effects are, also the calculation of it, derivation of the sigmoid function and how to interpret it (as a the change in probability by increasing your variable of interest by "a little bit", this little bit being 1 for discrete vars or by a std(x)/1000 for continuous ). Now, the part I find tricky is to corroborate the results of the marginal effects by hand and recalculating the probabilities for x=0 and then x=1 (for example) and then get a difference in probability equal to the marginal effect I got earlier, I am particularly stuck with dummy variables since If I increase one, I have to decrease the other one, so I am not so sure how to work around it and interpret it. (this question also applies for highly correlated variables)
To make it more clear, let's say I have the following dataset:
#Python
[1. , 0. , 0. , 4.6, 3.1, 1.5, 0.2],
[1. , 0. , 1. , 5. , 3.6, 1.4, 0.2],
[1. , 1. , 0. , 5.4, 3.9, 1.7, 0.4],
[1. , 0. , 1. , 4.6, 3.4, 1.4, 0.3],
[1. , 1. , 0. , 5. , 3.4, 1.5, 0.2],
[1. , 0. , 0. , 4.4, 2.9, 1.4, 0.2],
[1. , 0. , 1. , 4.9, 3.1, 1.5, 0.1],
[1. , 1. , 0. , 5.4, 3.7, 1.5, 0.2],
...
Var_0 = What will be the intercept.
Var_1, var_2 = One hot encoded variables (2/3 dummies), one dropped to avoid co linearity.
Var 3+ = Normal continuous variables
Coefficients:
[ 7.56986405, 0.75703164, 0.27158741, -0.37447474, -2.79926022, 1.43890492, -2.95286947]
logit
[-3.34739217,
-2.27001103,
-1.49517926,
-0.77178644,
-0.808111,
-2.48474722,
-1.76183804,
-0.90621541
...]
Probabilities
[0.03398066,
0.09363728,
0.18314562,
0.31609279,
0.30829318,
0.0769344 ,
0.14656029,
0.28777491,
...]
Marginal effect = p*(1-p) * B_j
Now let's say that I am interested in the marginal effect of var_1 (one of the dummies), I will simply do: p*(1-p) * 0.7570
Which will result in an array of length n (# of obs) with different marginal effects (which is fine because I understand that the effects are non constant and non-linear). Let's say this array goes from [0.0008 to 0.0495]
Now the problem is, how can you verify this results? How can I measure the marginal effect when the dummy goes from values 0 to 1?
You could argue that I could do two things MEM and AME methods:
MEM: Leave all the values at its mean and then calculate all over again for var_1 = 0 and then for var_1 = 1 (MEM method)
(you can't really do this because that you will be assuming that you can have
some observations where var_1 and var_2 will be equal to 1 at the same time,
which incorrect since the mean for a dummy is like a proportion of how many "1s"
there are for that column)
AME: Leave as observed, but changing all the values of var_1 to 0 (making all the values of var_2 = 1) and then do the opposite (var_1 = 1, var_2 =0, you have to do this since it can't belong to two categories at the same time), and then take the average of the results (AME method) (Side comment:One thing I am not sure if it is the average between the difference in marginal effects when var_1 = 0 and then 1, or if it is an average between the probabilities when var_1 =0 and then 1, I used both, but probability I think it makes more sense to me)
Now, if I try the 2nd approach I get very different results to what I originally got ( which were values between [0.0008 to 0.0495]), it gives me values between [0.0022 to 0.1207], which is a massive difference.
To summarise:
How can do a mathematical corroboration to get the same values I got initially ([0.0008 to 0.0495])
How can I interpret these original values in the first place? Because if I take 0.0495, I am basically saying, if I increase var_1 by 1-unit (from 0 to 1), I will have a 4.95% increase in probability of my event happening, the problems is that it doesn't consider that to make the 1-unit increase I need to, by default, decrease the other dummy variable (var_2), so I will be doing something of a double-change in the variables or like a double marginal effect at the same time.

Related

Better than brute force algorithms for a coin-flipping game

I have a problem and I feel like there should be a well-known algorithm for solving it that's better than just brute force, but I can't think of one, so I'm asking here.
The problem is as follows: given n sorted (from low to high) lists containing m probabilities, choose one index for each list such that the sum of the chosen indexes is less than m. Then, for each list, we flip a coin, where the chance of it landing heads is equal to the probability at the chosen index for that list. Maximize the chance of the coin landing heads at least once.
Are there any algorithms for solving this problem that are better than just brute force?
This problem seems most similar to the knapsack problem, except the value of the items in the knapsack isn't merely a sum of the items in the knapsack. (Written in Python, instead of sum(p for p in chosen_probabilities) it's 1 - math.prod([1 - p for p in chosen_probabilities])) And, there's restrictions on what items you can add given what items are already in the knapsack. For example, if the index = 3 item for a particular list is already in the knapsack, then adding in the item with index = 2 for that same list isn't allowed (since you can only pick one index for each list). So there are certain items that can and can't be added to the knapsack based on what items are already in it.
Linear optimization won't work because the values in the lists don't increase linearly, the final coin probability isn't linear with respect to the chosen probabilities, and our constraint is on the sum of the indexes, rather than the values in the lists themselves. As David has pointed out, linear optimization will work if you use binary variables to pick out the indexes and a logarithm to deal with the non-linearity.
EDIT:
I've found that explaining the motivation behind this problem can be helpful for understanding it. Imagine you have 10 seconds to solve a problem, and three different ways to solve it. You have models of how likely it is that each method will solve the problem, given how many seconds you try that method for, but if you switch methods, you lose all progress on the one you were previously trying. What methods should you try and for how long?
Maximizing 1 - math.prod([1 - p for p in chosen_probabilities]) is equivalent to minimizing math.prod([1 - p for p in chosen_probabilities]), which is equivalent to minimizing the log of this objective, which is a linear function of 0-1 indicator variables, so you could do an integer programming formulation this way.
I can't promise that this will be much better than brute force. The problem is that math.log(1 - p) is well approximated by -p when p is close to zero. My intuition is that for nontrivial instances it will be qualitatively similar to using integer programming to solve subset sum, which doesn't go particularly well.
If you're willing to settle for a bicriteria approximation scheme (get an answer such that the sum of the chosen indexes is less than m, that is at least as good as the best answer summing to less than (1 − ε) m) then you can round up the probability to multiples of ε and use dynamic programming to get an algorithm that runs in time polynomial in n, m, 1/ε.
Here is working code for David Eisenstat's solution.
To understand the implementation, I think it helps to go through the math first.
As a reminder, there are n lists, each with m options. (In the motivating example at the bottom of the question, each list represents a method for solving the problem, and you are given m-1 seconds to solve the problem. Each list is such that list[index] gives the chance of solving the problem with that method if the method is run for index seconds.)
We let the lists be stored in a matrix called d (named data in the code), where each row in the matrix is a list. (And thus each column represents an index, or, if following the motivating example, an amount of time.)
The probability of the coin landing heads, given that we chose index j* for list i, is computed as
We would like to maximize this.
(To explain the stats behind this equation, we're computing 1 minus the probability that the coin doesn't land on heads. The probability that the coin doesn't land on heads is the probability that each flip doesn't land on heads. The probability that a single flip doesn't land on heads is just 1 minus the probability that does land on heads. And the probability it does land on heads is the number we've chosen, d[i][j*]. Thus, the total probability that all the flips land on tails is just the product of the probability that each one lands on tails. And then the probability that the coin lands on heads is just 1 minus the probability that all the flips land on tails.)
Which, as David pointed out, is the same as minimizing:
Which is the same as minimizing:
Which is equivalent to:
Then, since this is linear sum, we can turn it into an integer program.
We'll be minimizing:
This lets the computer choose the indexes by allowing it to create an n by m matrix of 1s and 0s called x where the 1s pick out particular indexes. We'll then define rules so that it doesn't pick out invalid sets of indexes.
The first rule is that you have to pick out an index for each list:
The second rule is that you have to respect the constraint that the indexes chosen must sum to m or less:
And that's it! Then we can just tell the computer to minimize that sum according to those rules. It will spit out an x matrix with a single 1 on each row to tell us which index it has picked for the list on that row.
In code (using the motivating example), this is implemented as:
'''
Requirements:
cvxopt==1.2.6
cvxpy==1.1.10
ecos==2.0.7.post1
numpy==1.20.1
osqp==0.6.2.post0
qdldl==0.1.5.post0
scipy==1.6.1
scs==2.1.2
'''
import math
import cvxpy as cp
import numpy as np
# number of methods
n = 3
# if you have 10 seconds, there are 11 options for each method (0 seconds, 1 second, ..., 10 seconds)
m = 11
# method A has 30% chance of working if run for at least 3 seconds
# equivalent to [0, 0, 0, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]
A_list = [0, 0, 0] + [0.3] * (m - 3)
# method B has 30% chance of working if run for at least 3 seconds
# equivalent to [0, 0, 0, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]
B_list = [0, 0, 0] + [0.3] * (m - 3)
# method C has 40% chance of working if run for 4 seconds, 30% otherwise
# equivalent to [0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]
C_list = [0.3, 0.3, 0.3, 0.3] + [0.4] * (m - 4)
data = [A_list, B_list, C_list]
# do the logarithm
log_data = []
for row in data:
log_row = []
for col in row:
# deal with domain exception
if col == 1:
new_col = float('-inf')
else:
new_col = math.log(1 - col)
log_row.append(new_col)
log_data.append(log_row)
log_data = np.array(log_data)
x = cp.Variable((n, m), boolean=True)
objective = cp.Minimize(cp.sum(cp.multiply(log_data, x)))
# the current solver doesn't work with equalities, so each equality must be split into two inequalities.
# see https://github.com/cvxgrp/cvxpy/issues/1112
one_choice_per_method_constraint = [cp.sum(x[i]) <= 1 for i in range(n)] + [cp.sum(x[i]) >= 1 for i in range(n)]
# constrain the solution to not use more time than is allowed
# note that the time allowed is (m - 1), not m, because time is 1-indexed and the lists are 0-indexed
js = np.tile(np.array(list(range(m))), (n, 1))
time_constraint = [cp.sum(cp.multiply(js, x)) <= m - 1, cp.sum(cp.multiply(js, x)) >= m - 1]
constraints = one_choice_per_method_constraint + time_constraint
prob = cp.Problem(objective, constraints)
result = prob.solve()
def compute_probability(data, choices):
# compute 1 - ((1 - p1) * (1 - p2) * ...)
return 1 - np.prod(np.add(1, -np.multiply(data, choices)))
print("Choices:")
print(x.value)
'''
Choices:
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]]
'''
print("Chance of success:")
print(compute_probability(data, x.value))
'''
Chance of success:
0.7060000000000001
'''
And there we have it! The computer has correctly determined that running method A for 3 seconds, method B for 3 seconds, and method C for 4 seconds is optimal. (Remember that the x matrix is 0-indexed, while the times are 1-indexed.)
Thank you, David, for the suggestion!

Algorithm for tracking values through time

There is likely a known algorithm for doing this, but I wasn't able to find it using my Google skills, so I will try to describe what I have to do and what I did so far.
I have a source of characteristic values of a system which I would like to plot as a trend. The values are being returned from an algorithm in real time, and each value has a set of properties (magnitude, phase, quality).
However, these values can appear and disappear in time, and I can also get some intermittent values which I will disregard if they don't repeat during a longer period (several samples).
For example, I might be getting these values:
Time (Mag, Phase, Quality)
t = 1 (10.10, 0.90, 0.90); (17.00, 0.02, 0,12)
t = 2 (10.15, 0.91, 0.89); (17.10, 0.12, 0,12)
t = 3 (17.10, 0.12, 0,12)
t = 4 (10.25, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 5 ( 6.15, 0.41, 0.39); (10.35, 0.91, 0.89); (17.12, 0.12, 0,12)
t = 6 (10.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 7 ( 9.20, 0.90, 0.85); (11.20, 0.90, 0.85); (17.02, 0.13, 0,11)
t = 8 ( 9.80, 0.90, 0.85); (11.80, 0.90, 0.85); (17.02, 0.13, 0,11)
I'd like to track these sets of values through time according to the similarity with previous values. I.e. in the example above, I have two main trends (Mag 10 and Mag 17), with several specific situations:
moments where I will shortly lose one of the values (Mag 10 is lost in t = 3),
moments where I shortly get a new temporary/invalid reading (Mag 6 in t = 5) for a single sample,
moments where it's not completely clear which set corresponds to the previous sample (Mag 9.2 and Mag 11.2 could both be a continuation of Mag 10.2 from the previous sample, and in t = 8 it becomes apparent that there are now two different sets (Mag 9.8 and Mag 11.8).
If I just grouped the values as they arrive from the system, I would not get their correct trends, i.e. without tracking, the magnitudes would appear like this:
However, properly matching these values against old magnitude should result in this trend:
I've written an algorithm which tracks the values through time by effectively trying all permutations of sets against the previous "active" sets. It calculates the differences between all new values and the previous known values, which is basically a N^2 algorithm, and then checks all permutations to find the smallest total distance (something like N! complexity):
for each X in new_sets
for each Y in existing_sets
distance(X, Y) = calculate_distance(X, Y);
for each P in permutations(new_sets)
total_distance = sum(distance(X, Y)) for all (X, Y) in permutation
permutation P with min total_distance is the best match
As I go through time, I also remove measurements from existing_sets if they are not matched within several samples.
This works reasonable fine, as long as I don't have too many values, but the time complexity becomes problematic after I begin tracking more than 10 items. It also feels like reinventing the wheel.
Is there a known/better (in terms of time complexity) algorithm for doing this?
Without constraints on the behaviour of the sources, there is obviously no solution. If we may say that the magnitudes from different sources are reasonably separated, and changes are reasonably small, the solution is to keep the trends in the sorted order. Then binary search them to find the trend closest to the new reading.

stasmodels SARIMAX predictions

I'm trying to understand how to verify a ARIMAX model for > 1 step ahead using statsmodels.
My understanding is the results.get_prediction(start=, dynamic=) api does this but I'm having trouble getting my head around how it works. My training data is indexed by a localised DateTimeIndex (tz='Sydney\Australia') at 15T freq. I want to predict a full day for '2019-02-04 00:00:00+1100' using one-step-ahead prediction up to '2019-02-04 06:00:00+1100' the previous predicted endogenous values for the rest of the day.
Is the code below correct? It seems statsmodel converts the start to a TimeStamp and treats dynamic as a multiple of the freq, so this should start the simulation using 1 step ahead until 06:00 then use the previous predicted endogenous values. The results don't look great so I want to confirm it's a model issue rather than me having incorrect diagnosis.
dt = '2019-02-04'
predict = res.get_prediction(start='2019-02-04 00:00:00+11:00')
predict_dy = res.get_prediction(start='2019-02-04 00:00:00+11:00', dynamic=4*6)
fig = plt.figure(figsize=(10,10)) ax = fig.gca()
y_train[dt].plot(ax=ax, style='o', label='Observed')
predict.predicted_mean[dt].plot(ax=ax, style='r--', label='One-step-ahead forecast')
predict_dy.predicted_mean[dt].plot(ax=ax, style='g', label='Dynamic forecast')
It seems statsmodel converts the start to a TimeStamp
Yes, if you give it a string value, then it will attempt to map it to an index in your dataset (like a timestamp).
and treats dynamic as a multiple of the freq
But this is not correct. dynamic is an integer offset to start. So if dynamic=0, that means that dynamic prediction begins at start, whereas if dynamic=1, that means that dynamic prediction begins at start+1.
It's not quite clear to me what's going on in your example (or what you think is not great about the predictions you generated), so here is a description of how dynamic works that may help:
Here's an example that may help explain how things work. A couple of key points for this exercise will be:
I set all elements of endog to be equal to 1
This is an AR(1) model with parameter 0.5. That means that if we know y_t, then the prediction of y_t+1 is equal to 0.5 * y_t.
Now, the example code is:
ix = pd.date_range(start='2018-12-01', end='2019-01-31', freq='D')
endog = pd.Series(np.ones(len(ix)), index=ix)
mod = sm.tsa.SARIMAX(endog, order=(1, 0, 0), concentrate_scale=True)
res = mod.smooth([0.5])
p1 = res.predict(start='January 1, 2019', end='January 5, 2019').rename('d=False')
p2 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=0).rename('d=0')
p3 = res.predict(start='January 1, 2019', end='January 5, 2019', dynamic=1).rename('d=2')
print(pd.concat([p1, p2, p3], axis=1))
this gives:
d=False d=0 d=2
2019-01-01 0.5 0.50000 0.5000
2019-01-02 0.5 0.25000 0.5000
2019-01-03 0.5 0.12500 0.2500
2019-01-04 0.5 0.06250 0.1250
2019-01-05 0.5 0.03125 0.0625
The first column (d=False) is the default case, where dynamic=False. Here, all predictions are one-step-ahead predictions. Since I set every element of endog to 1 and we have an AR(1) model with parameter 0.5, all one-step-ahead predictions will be equal to 0.5 * 1 = 0.5.
In the second column (d=0), we specify that dynamic=0 so that dynamic prediction begins at the first prediction. This means that we do not use any endog data past start - 1 in forming our predictions, which in this case means we do not use any data past December 31, 2018 in making predictions. The first prediction will be equal to 0.5 times the observation on December 31, 2018, i.e. 0.5 * 1 = 0.5. Each subsequent prediction will be equal to 0.5 * the previous prediction, so the second prediction is 0.5 * 0.5 = 0.25, etc.
The third column (d=1) is like the second column, except that here dynamic=1 so that dynamic prediction begins at the second prediction. This means we do not use any endog data past start (i.e. past January 1, 2019).

Randomly select N unique elements from a list, given a probability for each

I've run into a problem: I have a list or array (IList) of elements that have a field (float Fitness). I need to efficiently choose N random unique elements depending on this variable: the bigger - the more likely it is to be chosen.
I searched on the internet, but the algorithms I found were rather unreliable.
The answer stated here seems to have a bigger probability at the beginning which I need to make sure to avoid.
-Edit-
For example I need to choose from objects with the values [-5, -3, 0, 1, 2.5] (negative values included).
The basic algorithm is to sum the values, and then draw a point from 0-sum(values) and an order for the items, and see which one it "intersects".
For the values [0.1, 0.2, 0.3] the "windows" [0-0.1, 0.1-0.3, 0.3-0.6] will look like this:
1 23 456
|-|--|---|
|-*--*---|
And you draw a point [0-0.6] and see what window it hit on the axis.
Pseudo-python for this:
original_values = {val1, val2, ... valn}
# list is to order them, order doesn't matter outside this context.
values = list(original_values)
# limit
limit = sum(values)
draw = random() * limit
while true:
candidate = values.pop()
if candidate > draw:
return candidate
draw -= candidate
So what shall those numbers represent?
Does 2.5 mean, that the probability to be chosen is twice as high than 1.25? Well - the negative values don't fit into that scheme.
I guess fitness means something like -5: very ill, 2.5: very fit. We have a range of 7.5 and could randomly pick an element, if we know how many candidates there are and if we have access by index.
Then, take a random number between -5 and 2.5 and see, if our number is lower than or equal to the candidates fitness. If so, the candidate is picked, else we repeat with step 1. I would say, that we then generate a new threshold to survive, because if we got an 2.5, but no candidate with that fitness remains, we would search infinitely.
The range of fitnesses has to be known for this, too.
fitnesses [-5, -3, 0, 1, 2.5]
rand -5 x x x x x
-2.5 - - x x x
0 - - x x x
2.5 - - - - x
If every candidate shall be testet every round, and the -5 guy shall have a chance to survive, you have to stretch the interval of random numbers a bit, to give him a chance, for instance, from -6 to 3.

algorithm to map values to unit scale with logarithmic spacing

I'm looking for an algorithm to map values to a unit scale with logarithmic spacing. The scale ranges from 0 to 1. Incoming values would be in the range of 0 to 10000.
0 maps to 0, 1 maps to .2, 10 maps to .4, 100 maps to .6
1000 maps to .8, 10000 maps to 1.0
Help/pointers would be appreciated.
If you are literally looking "to map values to a unit scale with logarithmic spacing", and with f(0)=0, then your example values are wrong.
However, you can do this with f(x) = log(1+x)/log(1+max)
So with max=10000, we have :
f(0)=0
f(1)=0.0753
f(2)=0.1193
f(10)=0.2603
f(100)=0.5010
f(1000)=0.7501
f(10000)=1
which on a log scale makes sense : if 1 is near 0 and 10000 is 1, then 100 which has the average number of zeroes of the previous numbers, should be around 0.5. You really don't wart to start considering log(0)as an option.
However, as soon as your minimum value is not 0 anymore (even if the min value is very very very small, as soon as it's non-zero), you can do a more reasonable interpolation :
f(x) = (log(x) - log(min)) / (log(max) - log(min))
which is the same as user3246191's comment under his answer :
f(x) = log(x/min) / log(max/min)
Since all values returned by f in this post are fractions of logarithms, you can take the logarithm in any base you please. I would recommend the native one for your programming language (ie if log10(x) is defined as ln(x)/ln(10), take ln(x) instead).
It is not really clear what is the transform you are trying to apply. For what you try to say it seems that a potential function would be
f(x) = 0.2(1+ log(x)/log(10))
which satisfies f(1) = 0.2, f(10) = 0.4, f(100) = 0.6, f(1000) = 0.8, f(10000) = 1
but on the other hand f(0.1) = 0 and f(0) = -infty.
Of course it is possible to modify f so that f(0) = 0 but this will be somewhat arbitrary and your question is not really well formulated then.

Resources