When passing the init_score parameter to lightgbm for binary targets, do I pass my initial predictions as probabilities or as logit values? - lightgbm

My understanding of GBMs, is that the trees in the GBM model/predict the logit (log odds) that is then wrapped in the logistic function to get the final probability prediction. Given this, it is unclear whether I should pass my initial prediction as probabilities or as logit values.
Also when I make predictions, do I sum my initial probability prediction with the GBM probability prediction, or do I convert both to logits and then use the logistic function on the summed logits predictions?

Maybe this blog will give you some idea.
It says init_score in the binary classification is calculated in this way:
def logloss_init_score(y):
p = y.mean()
p = np.clip(p, 1e-15, 1 - 1e-15) # never hurts
log_odds = np.log(p / (1 - p))
return log_odds

Related

Using RMSE function with 5-fold cross validation to choose the best model out of 3

I have defined three different models obtained from the dataset diabetes from the library lars. The first model (M1) is the one that minimizes the BIC value out of all the possible regression models obtained combining the explanatory variables (which are p=10, so 2^10 possible models). The other two are obtained through glmnet and are a Lasso regression with respectively lambda.min (M2) and lambda.1se (M3), where lambda.min and lambda.1se are obtained through cv.glmnet. Now I should perform 5-fold cross-validation using the RMSE (Root Mean Square Error) function to check which of the tree models Μ1, Μ2 and Μ3, has the best predictive performance. In order to find the errors in the models obtained from Lasso I have to use the ordinal least squares estimates.
This is my code as for now:
library(lars)
library(glmnet)
data(diabetes)
y<-diabetes$y
x<-diabetes$x
x2<-diabetes$x2
X = as.data.frame(cbind(x))
Y = as.data.frame(y)
p=10
n=442
best_score = Inf
M1 = NA
for (i in 1:(2^p-1)){
model = lm(y ~ ., data = subset(X, select = c(which(as.integer(intToBits(i)) == 1))))
if (BIC(model) < best_score){
M1 = model
best_score = BIC ( model )
}
}
W<-as.matrix(X)
Y<-as.matrix(Y)
lasso<-glmnet(W, Y)
x11()
plot(lasso, label=T)
x11()
plot(lasso, xvar = 'lambda', label=T)
lasso$df
lasso$lambda
cvfit<-cv.glmnet(W,Y)
cvfit
coef(cvfit, s="lambda.min")
coef(cvfit, s="lambda.1se")
M2<-glmnet(W,Y,lambda = cvfit$lambda.min)
M3<-glmnet(W,Y,lambda = cvfit$lambda.1se)
I really don't know where to put hands now. Should I first of all split the original dataset in 5 and then compute again the models on the different train and test set? And how do I compute the final RMSE for each model? And what does it mean that I should use ordinal least square estimates for the models obtained through Lasso?

Implement density function

I am going through my book , it states that " Write a sampling algorithm for this density function"
y=x^2+(2/3)*x+1/3; 0 < 𝑥 < 1
Or I can use Monte Carlo?
Any help would be appreciated!
I'm assuming you mean you want to generate random x values that have the distribution specified by density y(x).
It's often desirable to derive the cumulative distribution function by integrating the density, and use inverse transform sampling to generate x values. In your case the the CDF is a third order polynomial which doesn't factor to yield a simple cube-root solution, so you would have to use a numerical solver to find the inverse. Time to consider alternatives.
Another option is to use the acceptance/rejection method. After checking the derivative, it's clear that your density is convex, so it's easy to create a bounding function b(x) by drawing a straight line from f(0) to f(1). This yields b(x) = 1/3 + 5x/3. This bounding function has area 7/6, while your f(x) has an area of 1, since it is a valid density. Consequently, 6/7 of points generated uniformly under b(x) will also fall under f(x), and only 1 out of 7 attempts will fail in the rejection scheme. Here's a plot of f(x) and b(x):
Since b(x) is linear, it is easy to generate x values using it as a distribution after scaling by 6/7 to make it a valid distribution function. The algorithm, expressed in pseudocode, then becomes:
function generate():
while TRUE:
x <- (sqrt(1 + 35 * U(0,1)) - 1) / 5 # inverse CDF transform of b(x)
if U(0, b(x)) <= f(x):
return x
end while
end function
where U(a,b) means generate a value uniformly distributed between a and b, f(x) is your density, and b(x) is the bounding function described above.
I implemented the algorithm described above to generate 100,000 candidate values, of which 14,199 (~1/7) were rejected, as expected. The end results are presented in the following histogram, which you can compare to f(x) in the plot above.
I'm assuming that you have a function y(x), which takes a value between [0,1] and returns the value of y. You just need to provide a random value of x and return the corresponding value of y.
def getSample():
#get uniform random number
x = numpy.random.random()
#sample my custom function
return y(x)

The interpretation of cross validation scores

I am trying to understand this part of code that I found on the Internet:
kfold = KFold(n_splits=7, random_state=seed)
results = cross_val_score(estimator, x, y, cv=kfold)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
What the does cross_val_score?
I know that it calculates scores. I want to understand the meaning of these scores, and how they are evaluated.
Here's the working of cross_val_score:
As seen in source code of cross_val_score, this x you supplied to cross_val_score will be divided into X_train, X_test using cv=kfold. Same for y.
X_test will be held back and X_train and y_train will be passed on to estimator for fit().
After fitting, estimator will then be scored using X_test and y_test.
The steps 1 to 3 will be repeated for folds specified in kfold and array of scores will returned from cross_val_score.
Explanation of 3rd point: Scoring depends on the estimator and scoring param in cross_val_score. In your code here, you have not passed any scorer in scoring. So default estimator.score() will be used.
If estimator is a classifier, then estimator.score(X_test, y_test) will return accuracy. If its a regressor, then R-squared is returned.

Latent factor recovery with probabilistic matrix factorization using Edward

I implemented a probabilistic matrix factorization model (R = U'V) following the example in Edward's repo:
# data
U_true = np.random.randn(D, N)
V_true = np.random.randn(D, M)
R_true = np.dot(np.transpose(U_true), V_true) + np.random.normal(0, 0.1, size=(N, M))
# model
I = tf.placeholder(tf.float32, [N, M])
U = Normal(loc=tf.zeros([D, N]), scale=tf.ones([D, N]))
V = Normal(loc=tf.zeros([D, M]), scale=tf.ones([D, M]))
R = Normal(loc=tf.matmul(tf.transpose(U), V), scale=tf.ones([N, M]))
I get a good performance when predicting the data in matrix R. However, when I evaluate the inferred traits in U and V, the error varies a lot and can get very high.
I tried with a latent space of small dimension (e.g. 2) and checked if latent traits weren't simply permuted. They sometimes get permuted but even after realigning them the error is still significant.
To throw some numbers: for a synthetic R matrix generated from U and V both normally distributed (mean 0 and variance 1), I can achieve a mean absolute error of 0.003 on R, but on U and V it's usually around 0.5.
I know this model is symmetric, but I am not sure about the implications. I would like to ask:
Is it actually possible to guarantee the recovery of the original latent traits in some way?
If so, how could it be achieved, preferably using Edward?

Analysing the result of LSTM Theano Sentiment Analysis

I'm trying the code from this link http://deeplearning.net/tutorial/lstm.html but changing the imdb data to my own. This is the screenshot of my result.
I want to determine the overall accuracy of running LSTM for sentiment analysis, but cannot understand the output. The train, valid and test values print multiple times but it's usually the same value.
Any help would be much appreciated.
The value it prints is computed by the following function:
def pred_error(f_pred, prepare_data, data, iterator, verbose=False):
"""
Just compute the error
f_pred: Theano fct computing the prediction
prepare_data: usual prepare_data for that dataset.
"""
valid_err = 0
for _, valid_index in iterator:
x, mask, y = prepare_data([data[0][t] for t in valid_index],
numpy.array(data[1])[valid_index],
maxlen=None)
preds = f_pred(x, mask)
targets = numpy.array(data[1])[valid_index]
valid_err += (preds == targets).sum()
valid_err = 1. - numpy_floatX(valid_err) / len(data[0])
return valid_err
It is easy to follow, and what it computes is 1 - accuracy, where accuracy is percentage of samples labeled correctly. In other words, you get around 72% accuracy on the training set, almost 95% accuracy on the validation set, and 50% accuracy on the test set.
The fact that your validation accuracy is so high compared to the train accuracy is a little bit suspicious, I would trace the predictions and see if may be our validation set is somehow not representative, or too small.

Resources