h2o - Tweedie response link function - h2o

I'm using h2o.gbm, and i'm specifying a tweedie distribution. The response should logged, and I think by specifying a tweedie distribution that h2o will log the response. Given the following from the documentation:
When specifying the distribution, the loss function is automatically
selected as well. For exponential families (such as Poisson, Gamma, and
Tweedie), the canonical logarithmic link function is used.
However, tweedie distributions have a point mass of 0. So if h2o is logging the response, is it actually logging the response variables when the value is 0 or is there some other transformation? such as:
data[,"new_response"] <- h2o.if_else(data$response == 0, 0, log(data$response))

The response value is set to .1, to prevent taking the log of 0. You can find the line of code where this takes place here.
double y1 = yr == 0?.1:yr;

Related

Copying embeddings for gensim word2vec

I wanted to see if I can simply set new weights for gensim's Word2Vec without training. I get the 20 News Group data set from scikit-learn (from sklearn.datasets import fetch_20newsgroups) and trained an instance of Word2Vec on it:
model_w2v = models.Word2Vec(sg = 1, size=300)
model_w2v.build_vocab(all_tokens)
model_w2v.train(all_tokens, total_examples=model_w2v.corpus_count, epochs = 30)
Here all_tokens is the tokenized data set.
Then I created a new instance of Word2Vec without training
model_w2v_new = models.Word2Vec(sg = 1, size=300)
model_w2v_new.build_vocab(all_tokens)
and set the embeddings of the new Word2Vec equal to the first one
model_w2v_new.wv.vectors = model_w2v.wv.vectors
Most of the functions work as expected, e.g.
model_w2v.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
model_w2v_new.wv.similarity( w1='religion', w2 = 'religions')
> 0.4796233
and
model_w2v.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
model_w2v_new.wv.words_closer_than(w1='religion', w2 = 'judaism')
> ['religions']
and
entities_list = list(model_w2v.wv.vocab.keys()).remove('religion')
model_w2v.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
model_w2v_new.wv.most_similar_to_given(entity1='religion',entities_list = entities_list)
> 'religions'
However, most_similar doesn't work:
model_w2v.wv.most_similar(positive=['religion'], topn=3)
[('religions', 0.4796232581138611),
('judaism', 0.4426296651363373),
('theists', 0.43141329288482666)]
model_w2v_new.wv.most_similar(positive=['religion'], topn=3)
>[('roderick', 0.22643062472343445),
> ('nci', 0.21744996309280396),
> ('soviet', 0.20012077689170837)]
What am I missing?
Disclaimer. I posted this question on datascience.stackexchange but got no response, hoping to have a better luck here.
Generally, your approach should work.
It's likely the specific problem you're encountering was caused by an extra probing step you took and is not shown in your code, because you had no reason to think it significant: some sort of most_similar()-like operation on model_w2v_new after its build_vocab() call but before the later, malfunctioning operations.
Traditionally, most_similar() calculations operate on a version of the vectors that has been normalized to unit-length. The 1st time these unit-normed vectors are needed, they're calculated – and then cached inside the model. So, if you then replace the raw vectors with other values, but don't discard those cached values, you'll see results like you're reporting – essentially random, reflecting the randomly-initialized-but-never-trained starting vector values.
If this is what happened, just discarding the cached values should cause the next most_similar() to refresh them properly, and then you should get the results you expect:
model_w2v_new.wv.vectors_norm = None

Using tf.metrics.mean_iou during training

I want to train a model using the tensorflow estimator and want to track multiple metrics during training end evaluation. The metrics i want to track are accruacy and mean intersection-over-union (and my loss).
I managed to figure out how to track the accuracy during training:
if mode == tf.estimator.ModeKeys.TRAIN:
...
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction, name='acc_op')
tf.summary.scalar('accuracy', accuracy[1])
and evaluation:
if mode == tf.estimator.ModeKeys.EVAL:
...
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction)
eval_metric_ops = {'accuracy': accuracy}
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metric_ops)
For evaluation the mean intersection over union works the same. So its actually:
if mode == tf.estimator.ModeKeys.EVAL:
...
miou = tf.metrics.mean_iou(labels=indices_ground_truth, predictions=indices_prediction, num_classes=13)
accuracy = tf.metrics.accuracy(labels=indices_ground_truth, predictions=indices_prediction)
eval_metric_ops = {'miou': miou,
'accuracy': accuracy}
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=eval_metric_ops)
As far as i know i have to track the update operation (the second return value) on the value during training. Otherwise it returns 0 every time. For a single value like the accuracy that works.
But for the miou the second return value is the update operation of the confusion matrix used to calculate the miou. Thats a [numClass,numClass] tensor. If i try to track it like the accuracy tf.summary.scalar('miou', miou[1]) it crashes because a [numClass,numClass] tensor is not a scalar.
tf.summary.scalar('miou', miou[0]) gives me 0s everytime.
So how can i give the miou to the summary?
Here is how I calculate the IoU while training:
mIoU, update_op = tf.contrib.metrics.streaming_mean_iou(predict, raw_gt, num_classes=2, weights=None)
tf.summary.scalar('meanIoU', mIoU)
confusion_matrix, _ = sess.run([update_op, train_op], feed_dict=feed_dict)
iou = sess.run(mIoU)
print('iou score = {:.3f}, ({:.3f} sec/step)'.format(iou, duration))
You don't need to track the confusion matrix output to track the IoU on tensorboard. The above works fine for me. I think, what you are missing is running the tensors in your session. You need to run update_op such as sess.run(update_op), while running metric operations as sess.run(iou)

ggpredict : confidence interval for negative binomial models

I used the following code to model count data :
ModActi<-glmmTMB(Median ~ H_veg + D_veg + Landscape + JulianDay +
H_veg:D_veg + (1 | Site),
data=MyDataActi, family=nbinom2)
I then used the ggpredict function of the ggeffects package to plot the predicted values of my model for the categorical variable "Landscape":
pr1 <- ggpredict(ModActi, "Landscape")
plot(pr1)
I obtain this Graph.
As you can see, lower confidence intervals are negative, as if the function would calculate them for a normal distribution.
In the help menu of ggpredict, it is not clear to me if there is a way to calculate confidence intervals for a negative binomial distribution (as stated in the model) ?
EDIT : if I use glmer in poisson, the confidence intervals are correct.
My supervisor found a nice solution by recalculating the standard errors in the predict table :
pr1 <- ggpredict(ModActi, "Landscape")
Ynontransform=log(pr1$predicted)
SEnontransform=log(pr1$conf.high)-Ynontransform
ConfLow=exp(Ynontransform-SEnontransform)
pr1$conf.low=ConfLow
plot(pr1)
This was because glmmTMB only returned predictions on the response scale and these were not back transformed. Now glmmTMB was update on CRAN and I also revised ggeffects. You can try out the current dev-version at https://github.com/strengejacke/ggeffects, which now properly computes the CI (after updating glmmTMB to version 0.2.1).

EasyPredictModelWrapper giving wrong prediction

public BinomialModelPrediction predictBinomial(RowData data) throws PredictException {
double[] preds = this.preamble(ModelCategory.Binomial, data);
BinomialModelPrediction p = new BinomialModelPrediction();
double d = preds[0];
p.labelIndex = (int)d;
String[] domainValues = this.m.getDomainValues(this.m.getResponseIdx());
p.label = domainValues[p.labelIndex];
p.classProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.classProbabilities, 0, p.classProbabilities.length);
if(this.m.calibrateClassProbabilities(preds)) {
p.calibratedClassProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.calibratedClassProbabilities, 0, p.calibratedClassProbabilities.length);
}
return p;
}
Eg: classProbabilities =[0.82333,0,276666]
labelIndex = 1
label = true
domainValues = [false,true]
what does this labelIndex signifies and does the class probabilities
order is same as the domain value order ,If order is same then it means that here probability of false is 0.82333 and probability of true is 0.27666 but why is this labelIndex showing as 1 and label as true.
Please help me to figure out this issue.
Like Tom commented, the prediction is not "wrong". You can infer from this that the threshold H2O has chosen is less than 0.27666. You probably have imbalanced training data, otherwise H2O would have not picked a low threshold for classifying a predicted value of 0.27666 as a 1. Does your training set include fewer examples of the positive class than the negative class?
If you don't like that threshold for whatever reason, then you can manually create your own. Just make sure you know how to properly evaluate the effect of using different thresholds on the performance of your model, otherwise I'd recommend just using the default threshold.
The name, "classProbabilities" is a misnomer. These are not actual probabilities, they are predicted values, though people often use the terms interchangeably. Binary classification algorithms produce "predicted values" that look like probabilities when they're between 0 and 1, but unless a calibration process is performed, they are not going to represent the probabilities. Calibration is not necessarily a straight-forward process and there are many techniques. Here's some more info about calibration methods for imbalanced data. In H2O, you can perform calibration using Platt scaling using the calibrate_model option. But this is probably not really necessary to what you're trying to do.
The proper way to use the raw output from a binary classification model is to only look at the predicted value for the positive class (you can simply ignore the predicted value for the negative class). Then you choose a threshold which suits your needs, or you can use the default threshold in H2O, which is chosen to maximize the F1 score. Some other software will use a hardcoded threshold of 0.5, but that will be a terrible choice if you don't have an even number of positive and negative examples in your training data. If you have only a few positive examples in your training data, then the best threshold will be something much lower than 0.5.

Matlab : image region analyzer. Alternative for 'bwpropfilt'?

I'm running basic edge detection to detect windows region based on this http://www.mathworks.com/videos/edge-detection-with-matlab-119353.html
The edge works successfully :
final_edge = edge(gray_I,'sobel');
BW_out = bwareaopen(imfill(final_edge,'holes'),20);
figure;
imshow(BW_out);
Now when come to these following codes to filter image based on properties, it seems like my MATLAB R2013a can't identify this bwpropfilt method.
% imageRegionAnalyzer(BW);
% Filter image based on image properties
BW_out = bwpropfilt(BW_out,'Area', [400, 467]);
BW_out = bwpropfilt(BW_out,'Solidity',[0.5, 1]);
It says:
Undefined function 'bwpropfilt' for input arguments of type 'char'.
Then what should be my alternative to change this bwpropfilt?
bwpropfilt simply takes a look at the corresponding attribute that is output from regionprops and gives you objects that conform to that certain range and also filtering out those that are outside of the range. You can rewrite the algorithm by explicitly calling regionprops, creating a logical array to index into the structure to retain only the values within the right range (seen in the third input of bwpropfilt) corresponding to the property you want to examine (seen in the second input of bwpropfilt). If you want to finally reconstruct the image after filtering, you'll need to use the column major linear indices found in the PixelIdxList attribute, stack them all into a single vector and write to a new output image by setting all of these values to true.
Specifically, you can use the following code to reproduce the last two lines of code you have shown:
% Run regionprops and get all properties
s = regionprops(BW_out, 'all');
%%% For the first line of code
values = [s.Area];
s = s(values > 400 & values < 467);
%%% For the second line of code
values = [s.Solidity];
s = s(values > 0.5 & values < 1);
% Stack column major indices
ind = vertcat(s.PixelIdxList);
% Create output image
final_out = false(size(BW_out));
final_out(ind) = true;
final_out contains the filtered image only retaining the values within the range specified by the desired property.
Caution
The above logic only works for attributes returned from regionprops that contain only a single scalar value per unique region. If you examine the supported properties found in bwpropfilt, you will see that this list is a subset of the full list found in regionprops. This makes sense as certain regionprops properties return a vector or a matrix depending on what you choose so using a range to filter out properties becomes ambiguous if you have multiple values that characterize a particular unique region returned by regionprops.
Minor Note
Being curious, I opened up bwpropfilt to see how it is implemented as I currently have MATLAB R2016a. The above logic, with the exception of some exception handling, is essentially how bwpropfilt has been implemented so the code that I wrote is in line with the logic of the function.

Resources