Vowpal Wabbit readable model weight interpretation - vowpalwabbit

Recently I'm using Vowpal Wabbit for classification, and I get a question about readable_model.
Here is my command: vw --quiet --save_resume --compressed aeroplane.txt.gzip --loss_function=hinge --readable_model aeroplane.txt
And the readable model file as below:
Version 7.7.0
Min label:-1.000000
Max label:1.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:1
initial_t 0.000000
norm normalizer 116869.664062
t 3984.000051
sum_loss 2400.032932
sum_loss_since_last_dump 2400.032932
dump_interval 1.000000
min_label -1.000000
max_label 1.000000
weighted_examples 3984.000051
weighted_labels 0.000051
weighted_unlabeled_examples 0.000000
example_number 2111
total_features 1917412
0:4.879771 0.004405 0.007933
1:5.268138 0.017729 0.020223
2:0.464031 0.001313 0.007443
3:3.158707 0.083495 0.029674
4:-22.006199 0.000721 0.004386
5:7.686290 0.018617 0.011562
......
1023:0.363004 0.022025 0.020973
116060:0.059659 2122.647461 1.000000
I have 1024 features for each example, and use i-1 as feature name for i feature.
My question is: Why I get 3 weights for each feature? Isn't it supposed to be only 1 weight? I'm a fresher to ML and get quite confused.

Simple SGD (stochastic gradient descent) needs just one parameter per feature, the weight. However, VW uses --adaptive --normalized --invariant update by default.
So in addition to the weight, it needs to store two other parameters per feature:
sum of gradients for --adaptive (AdaGrad-style update),
per-feature regularization for --normalized.
These two additional parameters are stored in the model (readable or binary) only if --save_resume is provided.
What is strange that if you run vw --sgd --save_resume, the readable model still contains three parameters per feature, although --sgd effectively means not-adaptive, not-normalized and not-invariant. I think this is a bug in the implementation of --readable_model.
EDIT: This last strange behavior was a bug indeed and it is fixed now.

Related

Dynamic number system in Qlik Sense

My data consists of large numbers, I have a column say - 'amount', while using it in charts(sum of amount in Y axis) it shows something like 1.4G, I want to show them as if is billion then e.g. - 2.8B, or in millions then 80M or if it's in thousands (14,000) then simply- 14k.
I have used - if(sum(amount)/1000000000 > 1, Num(sum(amount)/1000000000, '#,###B'), Num(sum(amount)/1000000, '#,###M')) but it does not show the M or B at the end of the figure and also How to include thousand in the same code.
EDIT: Updated to include the dual() function.
This worked for me:
=dual(
if(sum(amount) < 1, Num(sum(amount), '#,##0.00'),
if(sum(amount) < 1000, Num(sum(amount), '#,##0'),
if(sum(amount) < 1000000, Num(sum(amount)/1000, '#,##0k'),
if(sum(amount) < 1000000000, Num(sum(amount)/1000000, '#,##0M'),
Num(sum(amount)/1000000000, '#,##0B')
))))
, sum(amount)
)
Here are some example outputs using this script to format it:
=sum(amount)
Formatted
2,526,163,764
3B
79,342,364
79M
5,589,255
5M
947,470
947k
583
583
0.6434
0.64
To get more decimals for any of those, like 2.53B instead of 3B, you can format them like '#,##0.00B' by adding more zeroes at the end.
Also make sure that the Number Formatting property is set to Auto or Measure expression.

Keras predict gives different error than evaluate, loss different from metrics

I have the following problem:
I have an autoencoder in Keras, and train it for a few epochs. The training overview shows a validation MAE of 0.0422 and an MSE of 0.0024.
However, if I then call network.predict and manually calculate the validation errors, I get 0.035 and 0.0024.
One would assume that my manual calculation of the MAE is simply incorrect, but the weird thing is that if I use an identity model (simply outputs what you input) and use that to evaluate the predicted values, the same error value is returned as for my manual calculation. The code looks as follows:
input = Input(shape=(X_train.shape[1], ))
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(input)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
network = Model(input, decoded)
# sgd = SGD(lr=8, decay=1e-6)
# network.compile(loss='mean_squared_error', optimizer='adam')
network.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
# Fitting the data
network.fit(X_train, X_train, epochs=2, batch_size=1, shuffle=True, validation_data=(X_valid, X_valid),
callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=20, verbose=0, mode='auto')])
# Results
recon_valid = network.predict(X_valid, batch_size=1)
score2 = network.evaluate(X_valid, X_valid, batch_size=1, verbose=0)
print('Network evaluate result: mae={}, mse={}'.format(*score2))
x = Input((X_train.shape[1],))
m = Model(x, x)
m.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
score1 = m.evaluate(recon_valid, X_valid, batch_size=1, verbose=0)
print('Identity evaluate result: mae={}, mse={}'.format(*score1))
errors_test = np.absolute(X_valid - recon_valid)
print("Manual MAE: {}".format(np.average(errors_test)))
errors_test = np.square(X_valid - recon_valid)
print("Manual MSE: {}".format(np.average(errors_test)))
Which outputs the following:
Train on 282 samples, validate on 94 samples
Epoch 1/2
2018-04-18 17:24:01.464947: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
282/282 [==============================] - 0s - loss: 0.0861 - mean_squared_error: 0.0187 - val_loss: 0.0451 - val_mean_squared_error: 0.0025
Epoch 2/2
282/282 [==============================] - 0s - loss: 0.0440 - mean_squared_error: 0.0025 - val_loss: 0.0422 - val_mean_squared_error: 0.0024
Network evaluate result: mae=0.04216482736011769, mse=0.0024067993242382767
Identity evaluate result: mae=0.03506102238563781, mse=0.0024067993242382767
Manual MAE: 0.03506102412939072
Manual MSE: 0.002406799467280507
I know that my manual calculation is correct, since the identity model (m) returns the same value. The only possible explanation for the difference in MAE values would then be if network.evaluate(X_valid, X_valid) somehow uses different values than those returned by network.predict(X_valid), but then the MSE would also be different.
This leaves me completely confused, thinking there might be a bug in the Keras MAE calculation. Has anyone had this issue before or have any ideas how it might be fixed? I am using the Tensorflow backend.
Any help would be much appreciated!
EDIT: I'm almost certain this is a bug. If I keep loss='mae' but also add metrics=['mse', 'mae'], the MAE returned by the metrics is the same as my manual computation and the identity model. The same is true for MSE: if I set loss='mse', the MSE returned by the metric is different from the loss.
It turns out that the loss is supposed to be different than the metric, because of the regularization. Using regularization, the loss is higher (in my case), because the regularization increases loss when the nodes are not as active as specified. The metrics don't take this into account, and therefore return a different value, which equals what one would get when manually computing the error.
The metrics during training and validation are different because of different reasons:
The dataset is different
During trainning the weights are changing in every step so the metrics are changing too
The metric during training is of the current batch of data or a running average of the metrics of the last batches. For the evaluation, the metric is for the whole dataset.

Is it possible to vectorize annotation for matplotlib?

As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)
The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:
Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs
The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.
def annotate(row, ax):
ax.annotate(row.name, (row.exp, row.model),
xytext=(10, 20), textcoords='offset points',
arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
family='sans-serif', fontsize=8, color='darkslategrey')
def plot2File(df, file, seq, z, p, s):
""" Plot predictions vs experimental """
plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
df.apply(annotate, ax=ax, axis=1)
# for row in df.itertuples():
# ax.annotate(row.Index, (row.exp, row.model),
# xytext=(10, 20), textcoords='offset points',
# arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
# family='sans-serif', fontsize=8, color='darkslategrey')
plt.savefig(file, bbox_inches='tight', format='pdf')
plt.close()
Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?

WINBUGS : adding time and product fixed effects in a hierarchical data

I am working on a Hierarchical panel data using WinBugs. Assuming a data on school performance - logs with independent variable logp & rank. All schools are divided into three categories (cat) and I need beta coefficient for each category (thus HLM). I am wanting to account for time-specific and school specific effects in the model. One way can be to have dummy variables in the list of variables under mu[i] but that would get messy because my number of schools run upto 60. I am sure there must be a better way to handle that.
My data looks like the following:
school time logs logp cat rank
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school (hierarchies)
My WinBUGS code is given below.
model {
# N observations
for (i in 1:n){
logs[i] ~ dnorm(mu[i], tau)
mu[i] <- bcons +bprice*(logp[i])
+ brank[cat[i]]*(rank[i])
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
As you can see in the data sample above, I have multiple observations for school over time. How can I modify the code to account for time and school specific fixed effects. I have used STATA in the past and we get fe,be,i.time options to take care of fixed effects in a panel data. But here I am lost.

the output of stanford nlp classifier

We are learning the usage of stanford-nlp classifier. As its wiki page said, it can be used to build model for classifying numerical data like Iris:
http://www-nlp.stanford.edu/wiki/Software/Classifier#Iris_data_set
But on interpreting the output we have difficult on some of them: there are 4 columns for input attributes(1-Value, 2-Value, 3-Value, 4-Value) and one column for output label (Iris-setosa, Iris-versicolor, Iris-virginica). But what is CLASS here? Is it the output column overall?
Built this classifier: Linear classifier with the following weights
Iris-setosa Iris-versicolor Iris-virginica
3-Value -2.27 0.03 2.26
CLASS 0.34 0.65 -1.01
4-Value -1.07 -0.91 1.99
2-Value 1.60 -0.13 -1.43
1-Value 0.69 0.42 -1.23
Total: -0.72 0.05 0.57
Prob: 0.15 0.32 0.54
CLASS is like the intercept term in a simple linear regression - it represents the relative frequency of different classes. It is a feature of every instance.

Resources