ggpredict : confidence interval for negative binomial models - intervals

I used the following code to model count data :
ModActi<-glmmTMB(Median ~ H_veg + D_veg + Landscape + JulianDay +
H_veg:D_veg + (1 | Site),
data=MyDataActi, family=nbinom2)
I then used the ggpredict function of the ggeffects package to plot the predicted values of my model for the categorical variable "Landscape":
pr1 <- ggpredict(ModActi, "Landscape")
plot(pr1)
I obtain this Graph.
As you can see, lower confidence intervals are negative, as if the function would calculate them for a normal distribution.
In the help menu of ggpredict, it is not clear to me if there is a way to calculate confidence intervals for a negative binomial distribution (as stated in the model) ?
EDIT : if I use glmer in poisson, the confidence intervals are correct.

My supervisor found a nice solution by recalculating the standard errors in the predict table :
pr1 <- ggpredict(ModActi, "Landscape")
Ynontransform=log(pr1$predicted)
SEnontransform=log(pr1$conf.high)-Ynontransform
ConfLow=exp(Ynontransform-SEnontransform)
pr1$conf.low=ConfLow
plot(pr1)

This was because glmmTMB only returned predictions on the response scale and these were not back transformed. Now glmmTMB was update on CRAN and I also revised ggeffects. You can try out the current dev-version at https://github.com/strengejacke/ggeffects, which now properly computes the CI (after updating glmmTMB to version 0.2.1).

Related

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

Curve fit does not return expected result

I need a little help with my code during curve fitting some data.
I have the following data:
'''
x_data=[0.0, 0.006702200711821348, 0.012673613376102217, 0.01805805116486128, 0.02296065262674275, 0.027460615301376282,
0.03161908492177514, 0.03548425629114566, 0.03909479074665314, 0.06168416627459879, 0.06395092768264225,
0.0952415360565632, 0.0964823380829502, 0.11590819258911032, 0.11676250975220677, 0.18973251809768016,
0.1899603458289615, 0.2585011532435637, 0.2586068948029052, 0.40046782450999047, 0.40067753715444315]
y_data=[0.005278154532534359, 0.004670803439961002, 0.004188802888597246, 0.003796976494876385, 0.003472183813732432,
0.0031985782141146, 0.002964943046115825, 0.0027631157936632137, 0.0025870148284089897, 0.001713418196416643,
0.0016440241050665323, 0.0009291243501697267, 0.0009083385934116964, 0.0006374601714823219, 0.0006276132323039056,
0.00016900738921547616, 0.00016834735819595378, 7.829234957755694e-05, 7.828353274888779e-05, 0.00015519569743801753,
0.00015533437619227267]
'''
I know that the data can be fitted using the following mathematical model:
'''
def model(x,a,b,c):
return (ab)/(bx+1)+3cx**2
'''
I am trying to obtain the a,b,c coefficients of the model calibrated, so that I obtain the following result (in red is the model calibrated and in blue is the data sample):
My code to achieve the shown result in the former picture is:enter image description here
'''
import numpy as np
from scipy.optimize import curve_fit
popt, _pcov = curve_fit(model, x_data, y_data,maxfev = 100000)
x_sample=np.linspace(0,0.5,1000)
y_sample=model(x_sample,*popt)
'''
If I plot the predicted data based on the fitted coefficients (in green) I get this result:enter image description here
for some reason I get some coefficients that produce a result I know it is wrong. Does anyone know how to solve this issue?
Your model y=(ab)/(bx+1)+3cx**2 appears not really satisfising. Instead of the hyperbolic term an exponential term seems better according to the shape of the data. That is why the proposed model is :
y=A * exp(B * x) + C * x**2
The method to compute approximates of the parameters A,B,C is shown below :
Details of the numerical calculus :
Note :
The parabolic term appears under represented. This is because they are not enough points at large x compare to the many points at small x.
The method used above is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales. The method isn't iterative and doesn't need initial "guessed" values. The accuracy is not good in case of few points, due to the numerical integration (calculus of the Sk).
If necessary, this can be improved thanks to post-treatment with non-linear regression starting from the above approximative values of the parameters;
An even better model is made of two exponentials :

Tuning max_depth in Random Forest using CARET

I'm building a Random Forest with Caret package on R with method = "rf". I see that every type of random forest on caret seems only tune mtry which is the number of features selected randomly for each tree. I do not understand why max_depth of each tree is not a tunable parameter (like cart) ? In my mind, it is a parameter which can limit over-fitting.
For example, my rf seems really better on train data than the test data :
model <- train(
group ~., data = train.data, method = "rf",
trControl = trainControl("repeatedcv", number = 5,repeats =10),
tuneLength=5
)
> postResample(fitted(model),train.data$group)
Accuracy Kappa
0.9574592 0.9745841
> postResample(predict(model,test.data),test.data$group)
Accuracy Kappa
0.7333333 0.5428571
As you can see my model is clearly over-fitted. However, I tried a lot of different things to handle this but nothing worked. I always have something like 0.7 accuracy on test data and 0.95 on train data. This is why I want to optimize other parameters.
I cannot share my data to reproduce this.

Using tf.metrics.mean_iou during training

I want to train a model using the tensorflow estimator and want to track multiple metrics during training end evaluation. The metrics i want to track are accruacy and mean intersection-over-union (and my loss).
I managed to figure out how to track the accuracy during training:
if mode == tf.estimator.ModeKeys.TRAIN:
...
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction, name='acc_op')
tf.summary.scalar('accuracy', accuracy[1])
and evaluation:
if mode == tf.estimator.ModeKeys.EVAL:
...
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction)
eval_metric_ops = {'accuracy': accuracy}
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metric_ops)
For evaluation the mean intersection over union works the same. So its actually:
if mode == tf.estimator.ModeKeys.EVAL:
...
miou = tf.metrics.mean_iou(labels=indices_ground_truth, predictions=indices_prediction, num_classes=13)
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction)
eval_metric_ops = {'miou': miou,
'accuracy': accuracy}
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metric_ops)
As far as i know i have to track the update operation (the second return value) on the value during training. Otherwise it returns 0 every time. For a single value like the accuracy that works.
But for the miou the second return value is the update operation of the confusion matrix used to calculate the miou. Thats a [numClass,numClass] tensor. If i try to track it like the accuracy tf.summary.scalar('miou', miou[1]) it crashes because a [numClass,numClass] tensor is not a scalar.
tf.summary.scalar('miou', miou[0]) gives me 0s everytime.
So how can i give the miou to the summary?
Here is how I calculate the IoU while training:
mIoU, update_op = tf.contrib.metrics.streaming_mean_iou(predict, raw_gt, num_classes=2, weights=None)
tf.summary.scalar('meanIoU', mIoU)
confusion_matrix, _ = sess.run([update_op, train_op], feed_dict=feed_dict)
iou = sess.run(mIoU)
print('iou score = {:.3f}, ({:.3f} sec/step)'.format(iou, duration))
You don't need to track the confusion matrix output to track the IoU on tensorboard. The above works fine for me. I think, what you are missing is running the tensors in your session. You need to run update_op such as sess.run(update_op), while running metric operations as sess.run(iou)

EasyPredictModelWrapper giving wrong prediction

public BinomialModelPrediction predictBinomial(RowData data) throws PredictException {
double[] preds = this.preamble(ModelCategory.Binomial, data);
BinomialModelPrediction p = new BinomialModelPrediction();
double d = preds[0];
p.labelIndex = (int)d;
String[] domainValues = this.m.getDomainValues(this.m.getResponseIdx());
p.label = domainValues[p.labelIndex];
p.classProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.classProbabilities, 0, p.classProbabilities.length);
if(this.m.calibrateClassProbabilities(preds)) {
p.calibratedClassProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.calibratedClassProbabilities, 0, p.calibratedClassProbabilities.length);
}
return p;
}
Eg: classProbabilities =[0.82333,0,276666]
labelIndex = 1
label = true
domainValues = [false,true]
what does this labelIndex signifies and does the class probabilities
order is same as the domain value order ,If order is same then it means that here probability of false is 0.82333 and probability of true is 0.27666 but why is this labelIndex showing as 1 and label as true.
Please help me to figure out this issue.
Like Tom commented, the prediction is not "wrong". You can infer from this that the threshold H2O has chosen is less than 0.27666. You probably have imbalanced training data, otherwise H2O would have not picked a low threshold for classifying a predicted value of 0.27666 as a 1. Does your training set include fewer examples of the positive class than the negative class?
If you don't like that threshold for whatever reason, then you can manually create your own. Just make sure you know how to properly evaluate the effect of using different thresholds on the performance of your model, otherwise I'd recommend just using the default threshold.
The name, "classProbabilities" is a misnomer. These are not actual probabilities, they are predicted values, though people often use the terms interchangeably. Binary classification algorithms produce "predicted values" that look like probabilities when they're between 0 and 1, but unless a calibration process is performed, they are not going to represent the probabilities. Calibration is not necessarily a straight-forward process and there are many techniques. Here's some more info about calibration methods for imbalanced data. In H2O, you can perform calibration using Platt scaling using the calibrate_model option. But this is probably not really necessary to what you're trying to do.
The proper way to use the raw output from a binary classification model is to only look at the predicted value for the positive class (you can simply ignore the predicted value for the negative class). Then you choose a threshold which suits your needs, or you can use the default threshold in H2O, which is chosen to maximize the F1 score. Some other software will use a hardcoded threshold of 0.5, but that will be a terrible choice if you don't have an even number of positive and negative examples in your training data. If you have only a few positive examples in your training data, then the best threshold will be something much lower than 0.5.

Resources