Null linear regression model in vowpal wabbit - vowpalwabbit

I would like to run a linear regression on vowpal wabbit using the null model (intercept only - for comparison reasons). Which optimizer should I use for this? Also is the best constant loss reported that of the simple average?

A1: For linear regression, if you care about averages, you should use --loss_function squared (which is the default). If you care more about the median rather than the average (e.g. if you have some outliers that may greatly mess-up the average), use --loss_function quantile. BTW: these are not optimizers, just loss functions. I would leave the optimizer (enhanced SGD) as is (the default) since it works very well.
A2: best constant is the constant prediction that would give the lowest error, and best constant loss is the average error for always predicting that best constant number. It is the weighted average of all your target-variables. This is not the same as the intercept b in the linear-regression formula y = Ai*xi + B. B is the free term, independent of the inputs. B is not necessarily the average of the ys.
A3: If you want to find the intercept of your model, look for the weight named Constant in your model. This would require two short steps:
# 1) Train your model from the dataset
# and save the model in human-readable (aka "inverted hash") format
vw --invert_hash model.ih your_dataset
# 2) Search for the free/intercept term in the readable model
grep '^Constant:' model.ih
The output of the grep step should be something like:
Constant:116060:-1.085126
Where 116060 is the hash-slot (location in the model) and -1.085126 is the value of the intercept (assuming no hash collisions, and a linear combination of the inputs.)

Related

How choose a custom seasonality on fbprophet without overfitting

I was looking at custom prophet seasonality parameters from this link: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html
And they say: "The default values are often appropriate, but they can be increased when the seasonality needs to fit higher-frequency changes, and generally be less smooth. Increasing the number of Fourier terms allows the seasonality to fit faster changing cycles, but can also lead to overfitting: N Fourier terms corresponds to 2N variables used for modelling the cycle"
What does the last part "N Fourier terms corresponds to 2N variables used for modelling the cycle" mean, and what method should I follow to select a good value for this? For example with ARIMA, I can use auto_arima and or auto correlation to determine values of p,d,q.

How to calculate the standard errors of ERGM predicted probabilities?

I'm having trouble estimating the standard errors from the predicted probabilities from a ERGM model, to calculate a confidence interval. Getting the predicted probabilities is not a problem, but I want to get a sense of the uncertainty surrounding the predictions.
Below is a reproduceable example based on the data set of marriage and business ties among Renaissance Florentine families.
library(statnet)
data(flo)
flomarriage <- network(flo,directed=FALSE)
flomarriage
flomarriage %v% "wealth" <- c(10,36,27,146,55,44,20,8,42,103,48,49,10,48,32,3)
flomarriage
gest <- ergm(flomarriage ~ edges +
absdiff("wealth"))
summary(gest)
plogis(coef(gest)[['edges']] + coef(gest)[['absdiff.wealth']]*10)
Based on the model, it is estimated that a wealth difference of 10 corresponds to a 0.182511 probability of a tie. My first question is, is this a correct interpretation? And my second question is, how does one calculate the standard error of this probability?
This is the correct interpretation for a dyad-independent model such as this one. For a dyad-dependent model, it would be the conditional probability given the rest of the network.
You can obtain the standard error of the prediction on the logit scale by rewriting your last line as a dot product of a weight vector and the coefficient vector:
eta <- sum(c(1,10)*coef(gest))
plogis(eta)
then, since vcov(gest) is the covariance matrix of parameter estimates, and using the variance formulas (e.g., http://www.stat.cmu.edu/~larry/=stat401/lecture-13.pdf),
(var.eta <- c(1,10)%*%vcov(gest)%*%c(1,10))
You can then get the variance (and the standard error) of the predicted probability using the Delta Method (e.g., https://blog.methodsconsultants.com/posts/delta-method-standard-errors/). However, unless you require the standard error as such, as opposed to a confidence interval, my advice would be to first calculate the interval for eta (i.e., eta + qnorm(c(0.025,0.0975))*sqrt(var.eta)), then call plogis() on the endpoints.
Really very helpful answer! Many thanks for it.
Here's just a small, almost cosmetic note about the suggested formula: 1) the upper bound in qnorm should be 0.975, not 0.0975, and 2) use matrix multiplication before sqrt --> eta + qnorm(c(0.025,0.975))%*%sqrt(var.eta).

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.

What does a Bayesian Classifier score represent?

I'm using the ruby classifier gem whose classifications method returns the scores for a given string classified against the trained model.
Is the score a percentage? If so, is the maximum difference 100 points?
It's the logarithm of a probability. With a large trained set, the actual probabilities are very small numbers, so the logarithms are easier to compare. Theoretically, scores will range from infinitesimally close to zero down to negative infinity. 10**score * 100.0 will give you the actual probability, which indeed has a maximum difference of 100.
Actually to calculate the probability of a typical naive bayes classifier where b is the base, it is b^score/(1+b^score). This is the inverse logit (http://en.wikipedia.org/wiki/Logit) However, given the independence assumptions of the NBC, these scores tend to be too high or too low and probabilities calculated this way will accumulate at the boundaries. It is better to calculate the scores in a holdout set and do a logistic regression of accurate(1 or 0) on score to get a better feel for the relationship between score and probability.
From a Jason Rennie paper:
2.7 Naive Bayes Outputs Are Often Overcondent
Text databases frequently have
10,000 to 100,000 distinct vocabulary words; documents often contain 100 or more
terms. Hence, there is great opportunity for duplication.
To get a sense of how much duplication there is, we trained a MAP Naive Bayes
model with 80% of the 20 Newsgroups documents. We produced p(cjd;D) (posterior)
values on the remaining 20% of the data and show statistics on maxc p(cjd;D) in
table 2.3. The values are highly overcondent. 60% of the test documents are assigned
a posterior of 1 when rounded to 9 decimal digits. Unlike logistic regression, Naive
Bayes is not optimized to produce reasonable probability values. Logistic regression
performs joint optimization of the linear coecients, converging to the appropriate
probability values with sucient training data. Naive Bayes optimizes the coecients
one-by-one. It produces realistic outputs only when the independence assumption
holds true. When the features include signicant duplicate information (as is usually
the case with text), the posteriors provided by Naive Bayes are highly overcondent.

Resources