Use numeric features for contextual bandit with vowpal wabbit - vowpalwabbit

I currently try to get familiar with contextual bandits using vowpal wabbit, but I have some problems to use numerical features.
Basically my bandit should make a decision between two actions (action 1 = send data, action 2 = idle) based on two numerical features (estimated data rate and age of information). This is my simple code so far:
import pandas as pd
from vowpalwabbit import pyvw
train_data = [{'action': 2, 'cost': 0.1, 'probability': 0.5, 'feature1': 3, 'feature2': 10},
{'action': 1, 'cost': 9.99, 'probability': 0.6, 'feature1': 3, 'feature2': 10},
{'action': 1, 'cost': 0.1, 'probability': 0.2, 'feature1': 29, 'feature2': 90},
{'action': 2, 'cost': 9.99, 'probability': 0.3, 'feature1': 29, 'feature2': 90}]
train_df = pd.DataFrame(train_data)
train_df['index'] = range(1, len(train_df) + 1)
train_df = train_df.set_index("index")
test_data = [{'feature1': 29, 'feature2': 90},
{'feature1': 3, 'feature2': 10}]
test_df = pd.DataFrame(test_data)
test_df['index'] = range(1, len(test_df) + 1)
test_df = test_df.set_index("index")
vw = pyvw.vw("--cb 2")
for i in train_df.index:
action = int(train_df.loc[i, "action"])
cost = train_df.loc[i, "cost"]
probability = train_df.loc[i, "probability"]
feature1 = train_df.loc[i, "feature1"]
feature2 = train_df.loc[i, "feature2"]
# Construct the example in the required vw format.
learn_example = str(action) + ":" + str(cost) + ":" + str(probability) + " | rate:" + str(feature1) + " aoi:" + str(feature2)
vw.learn(learn_example)
for j in test_df.index:
feature1 = test_df.loc[j, "feature1"]
feature2 = test_df.loc[j, "feature2"]
#test_example = "| " + str(feature1) + " " + str(feature2)
test_example = "| rate:" + str(feature1) + " aoi:" + str(feature2)
choice = vw.predict(test_example)
print(j, choice)
The output is:
1 1
2 1
Normally I would expect that the output is action 1 for the first prediction and action two for the second prediction following the cost structure of my training data. When I change the features values to characters (e.g. "a" for high datarate, "b" for low data rate, like in the official tutorial -> VW Tutorial) and adapt the training/testing string, I get the right predictions, so I think the issue is a wrong implementation of the feature:value pairs.
Does anybody know the mistake in my code?

Thank you for raising this issue.
In VW, when you use features with numeric values (name:value) the hash is calculated from namespace and feature name. So no matter what numeric value is used VW has only one weight in the model for that feature.
Instead, if you were to use categorical values for your features, (i.e. rate=low, rate=high) each categorical value gets a weight in the model.
Because when using numerical values there is only one weight for each feature, VW has less degrees of freedom than in the categorical case to map between features and the right action and so it needs more data. In fact, if you were to copy your training dataset 75 times and train with a total of 300 examples then your predictions will be correct. Alternatively, using categorical features your original set of 4 examples is enough for VW to correctly predict your actions. This is how you could format it:
2:0.1:0.5 | rate=3 aoi=10
1:9.99:0.6 | rate=3 aoi=10
1:0.1:0.2 | rate=29 aoi=90
2:9.99:0.3 | rate=29 aoi=90
| rate=29 aoi=90
| rate=3 aoi=10
Note: In the VW text format = is not actually a special character like : is. The = is just part of the feature name and makes it easier for the human reader to identity it as a categorical feature with that name. If there was a feature rate that has two possible values low or high, this could be passed as rate=low or rate=high. Replacing = with something else would change the hash but the learning would not change as they are still interpreted distinct features. So you could also use rate-high, rate_high, etc.

Related

Is there an algorithm to redistribute an array equally?

I have a one-dimensional array of non-negative integers. When plotted as a histogram, the range is mostly smaller numbers:
Is there an algorithm to redistribute the values more equally while maintaining the original order? By order I mean the minimum and maximum values would retain their original places in the array and everything else in between would scale up or down. When plotted afterwards the histogram would be more or less flat.
Searching the web I came across a "probability integral transform" in statistics, but that required sorting the data.
EDIT - Apologies for omitting why I don't want to sort it. The array is a plot and each integer represents a pixel. If I sort it that would destroy the plot. I'm dividing each integer by the maximum value and using that as an index into a palette. Because there's so much bias towards smaller values, only a small amount of the palette is visible in the final image. I thought if I was able to redistribute the values somehow, it'd use the full range of the palette.
You could apply this algorithm:
Let c be a freely chosen coefficient between 0 and 1: the closer to 1 the more close the resulting values will be to each other. If exactly 1, then all values will be equal. If 0, the result will be the original data set. A candidate value for c could for instance be 0.9
Let avg be the average of the input values
Apply the following transformation to each value in the input set:
new value := value * (1 − c) + avg * c
Here is an interactive implementation in a JavaScript snippet:
let a = [150, 100, 40, 33, 9, 3, 5, 13, 8, 1, 3, 2, 1, 1, 0, 0];
let avg = a.reduce((acc, val) => acc + val) / a.length;
function refresh(c) {
// Apply transformation from a to b using c:
display(a.map(val => val * (1 - c) + avg * c));
}
// I/O management
var display = (function (b) {
this.clearRect(0, 0, this.canvas.width, this.canvas.height); // Clear display
for (let i = 0; i < b.length; i++) {
this.beginPath();
this.rect(20 + i * 20, 150 - b[i], 19, b[i]);
this.fill();
}
}).bind(document.querySelector("canvas").getContext("2d"));
document.querySelector("input").oninput = e => refresh(+e.target.value);
refresh(0);
Coefficient: <input type="number" min="0" max="1" step="0.01" value="0"><br>
<canvas height="150" width="400"></canvas>
Use the Coefficient input box to experiment with different values for it on a sample data set.
In Python the transformation, for a given list a and coefficient c, could look like this:
avg = sum(a) / len(a)
b = [value * (1 - c) + avg * c for value in a]

LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array

LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?
The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.

Mixture model of a normal and constant

I'd like to model a distribution which is a mixture of a Normal and the constant 0.
I couldn't find a solution because in all the mixture examples I've found the class of distribution is the same for every category.
Here is some code to illustrate what I'm looking for:
with pm.Model() as model:
x_non_zero = pm.Normal(...)
zero_rate = pm.Uniform('zero_rate', lower=0.0, upper=.0, testval=0.5)
fr = pm.Bernoulli('fr', p=zero_rate)
x = pm.???('x', pm.switch(pm.eq(fr, 0), x_non_zero, 0), observed=data['x'])
I'm interested in the rate the data is exactly zero and the parameters of the normal when it is non-zero.
Here is how the data I'm modelling roughly looks like:
One option will be to try with a Gaussian mixture model, we may think of a Gaussian with sd=0 as a constant value. Another option will be to use a model like the following:
with pm.Model() as model:
mean = pm.Normal('mean', mu=100, sd=10)
sd = pm.HalfNormal('sd', sd=10)
category = pm.Categorical('category', p=[0.5, 0.5], shape=len(x))
mu = pm.switch(pm.eq(category, 0), 0, mean)
eps = pm.switch(pm.eq(category, 0), 0.1, sd)
obs = pm.Normal('obs', mu=mu, sd=eps, observed=x)
step0 = pm.ElemwiseCategorical(vars=[category], values=[0, 1])
step1 = pm.Metropolis(vars=[mean, sd])
trace = pm.sample(10000, step=[step0, step1])
to find out the rate you can compute
burnin = 100
np.mean(trace[burnin]['category'])

Calculate cash flows given a target IRR

I apologize if the answer for this is somewhere already, I've been searching for a couple of hours now and I can't find what I'm looking for.
I'm building a simple financial calculator to calculate the cash flows given the target IRR. For example:
I have an asset worth $18,000,000 (which depreciates at $1,000,000/year)
I have a target IRR of 10% after 5 years
This means that the initial investment is $18,000,000, and in year 5, I will sell this asset for $13,000,000
To reach my target IRR of 10%, the annual cash flows have to be $2,618,875. Right now, I calculate this by hand in an Excel sheet through guess-and-check.
There's other variables and functionality, but they're not important for what I'm trying to do here. I've found plenty of libraries and functions that can calculate the IRR for a given number of cash flows, but nothing comes up when I try to get the cash flow for a given IRR.
At this point, I think the only solution is to basically run a loop to plug in the values, check to see if the IRR is higher or lower than the target IRR, and keep calculating the IRR until I get the cash flow that I want.
Is this the best way to approach this particular problem? Or is there a better way to tackle it that I'm missing? Help greatly appreciated!
Also, as an FYI, I'm building this in Ruby on Rails.
EDIT:
IRR Function:
NPV = -(I) + CF[1]/(1 + R)^1 + CF[2]/(1 + R)^2 + ... + CF[n]/(1 + R)^n
NPV = the Net Present Value (this value needs to be as close to 0 as possible)
I = Initial investment (in this example, $18,000,000)
CF = Cash Flow (this is the value I'm trying to calculate - it would end up being $2,618,875 if I calculated it by hand. In my financial calculator, all of the cash flows would be the same since I'm solving for them.)
R = Target rate of return (10%)
n = the year (so this example would end at 5)
I'm trying to calculate the Cash Flows to within a .005% margin of error, since the numbers we're working with are in the hundreds of millions.
Let
v0 = initial value
vn = value after n periods
n = number of periods
r = annual rate of return
y = required annual net income
The one period discount factor is:
j = 1/(1+r)
The present value of the investment is:
pv = - v0 + j*y + j^2*y + j^3*y +..+ j^n*y + j^n*vn
= - v0 + y*(j + j^2 + j^3 +..+ j^n) + j^n*vn
= - v0 + y*sn + j^n*vn
where
sn = j + j^2 + j^3 + j^4 +..+ j^n
We can calulate sn as follows:
sn = j + j^2 + j^3 + j^4 +..+ j^n
j*sn = j^2 + j^3 + j^4 +..+ j^n + j^(n+1)
sn -j*sn = j*(1 - j^n)
sn = j*(1 - j^n)/(1-j)
= (1 - j^n)/[(1+r)(r/(1+r)]
= (1 - j^n)/r
Set pv = 0 and solve for y:
y*sn = v0 - vn * j^n
y = (v0 - vn * j^n)/sn
= r * (v0 - vn * j^n)/(1 - j^n)
Our Ruby method:
def ann_ret(v0, vn, n, r)
j = 1/(1+r)
(r * (v0 - vn * j**n)/(1 - j**n)).round(2)
end
With annual compounding:
ann_ret(18000000, 13000000, 5, 0.1) # => 2618987.4
With semi-annual compounding:
2 * ann_ret(18000000, 13000000, 10, 0.05) # => 2595045.75
With daily compounding:
365 * ann_ret(18000000, 13000000, 5*365, 0.10/365) # => 2570881.20
These values differ slightly from the required annual return you calculate. You should be able to explain the difference by comparing present value formulae.
There's a module called Newton in Ruby... it uses the Newton Raphson method.
I've been using this module to implement the IRR function into this library:
https://github.com/Noverde/exonio
If you need the IRR, you can use like this:
Exonio.irr([-100, 39, 59, 55, 20]) # ==> 0.28095

What are the ways of deciding probabilities in hidden markov models?

I am starting to learn hidden markov models and on the wiki page, as well as on github there are alot of examples but most of the probabilities are already there(70% change of rain, 30% chance of changing state, etc..). The spell checking or sentences examples, seem to study books and then rank the probabilities of words.
So does the markov model include a way of figuring out the probabilities or are we suppose to some other other model to pre-calculate it?
Sorry if this question is off. I think its straightforward how the hidden markov model selects probable sequences but the probability part is a bit grey to me(because its often provided). Examples or any info would be great.
For those not familiar with markov models, here's an example(from wikipedia) http://en.wikipedia.org/wiki/Viterbi_algorithm and http://en.wikipedia.org/wiki/Hidden_Markov_model
#!/usr/bin/env python
states = ('Rainy', 'Sunny')
observations = ('walk', 'shop', 'clean')
start_probability = {'Rainy': 0.6, 'Sunny': 0.4}
transition_probability = {
'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},
'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}
emission_probability = {
'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5},
'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1},
}
#application code
# Helps visualize the steps of Viterbi.
def print_dptable(V):
print " ",
for i in range(len(V)): print "%7s" % ("%d" % i),
print
for y in V[0].keys():
print "%.5s: " % y,
for t in range(len(V)):
print "%.7s" % ("%f" % V[t][y]),
print
def viterbi(obs, states, start_p, trans_p, emit_p):
V = [{}]
path = {}
# Initialize base cases (t == 0)
for y in states:
V[0][y] = start_p[y] * emit_p[y][obs[0]]
path[y] = [y]
# Run Viterbi for t > 0
for t in range(1,len(obs)):
V.append({})
newpath = {}
for y in states:
(prob, state) = max([(V[t-1][y0] * trans_p[y0][y] * emit_p[y][obs[t]], y0) for y0 in states])
V[t][y] = prob
newpath[y] = path[state] + [y]
# Don't need to remember the old paths
path = newpath
print_dptable(V)
(prob, state) = max([(V[len(obs) - 1][y], y) for y in states])
return (prob, path[state])
#start trigger
def example():
return viterbi(observations,
states,
start_probability,
transition_probability,
emission_probability)
print example()
You're looking for an EM (expectation maximization) algorithm to compute the unknown parameters from sets of observed sequences. Probably the most commonly used is the Baum-Welch algorithm, which uses the forward-backward algorithm.
For reference, here is a set of slides I've used previously to review HMMs. It has a nice overview of Forward-Backward, Viterbi, and Baum-Welch

Resources