H2O isolation forest -- scores greater than 1 - h2o

Isolation Forest in H2O (3.30.0.1, R 3.6.1) computed scores greater than 1 when model applied to the test set. Here is the code to reproduce scores greater than 1. Looks like h2o is not using the normalization used in the original paper [https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest] which is score=2^(-mean length/c(n)), c(n) is alway positive for n>0, so the scores should be always less than 1.
Other implementations of isolation forest produce scores less than 1 for the same data set.
Download train and test data files.
library(data.table)
library(h2o)
h2o.init()
#import data
train<-h2o.importFile('train.csv')
test<-h2o.importFile('test.csv')
#Train model
model <- h2o.isolationForest(training_frame = train)
# Calculate score
scores <- h2o.predict(model,test)
max(scores[,1])

If your question is why the predicted value is greater than 1 for some rows in the test set, then this is due to the test values having shorter mean_lengths. I.e. fewer than average splits are needed to isolate them compared to training data. Remember that isolation forest uses an ensemble of trees (not just one). So if you have records more unique than training (anomalies), then your predicted value can be greater than 1 (or mean_length is shorter than normal).
You can see this by looking at the rows in your test data that have predict larger than 1:
scores[scores[,1] > 1, ]
predict mean_length
1 1.232558 3.82
2 1.023256 4.36
3 1.069767 4.24
4 1.286822 3.68
Also, you can see that your training data had an mean_length averaged for all rows was 4.42 (which is larger than ones from your test data, mean_lengths above)
scores_train <- h2o.predict(model,train)
mean(scores_train[,'mean_length'])
4.42
Check out this post to learn more about interpreting isolation forests.

Neema pointed out that h2o is using min/max path length to normalize which makes scores great than 1 possible.
[https://support.h2o.ai/support/tickets/97280]

Related

Distribution of the Training Data vs Distribution of the Test/Prediction

Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

pair-wise aggregation of input lines in Hadoop

A bunch of driving cars produce traces (sequences of ordered positions)
car_id order_id position
car1 0 (x0,y0)
car1 1 (x1,y1)
car1 2 (x2,y2)
car2 0 (x0,y0)
car2 1 (x1,y1)
car2 2 (x2,y2)
car2 3 (x3,y3)
car2 4 (x4,y4)
car3 0 (x0,y0)
I would like to compute the distance (path length) driven by the cars.
At the core, I need to process all records line by line pair-wise. If the
car_id of the previous line is the same as the current one then I need to
compute the distance to the previous position and add it up to the aggregated
value. If the car_id of the previous line is different from the current line
then I need to output the aggregate for the previous car_id, and initialize the
aggregate of the current car_id with zero.
How should the architecture of the hadoop program look like? Is it possible to
archieve the following:
Solution (1):
(a) Every mapper computes the aggregated distance of the trace (per physical
block)
(b) Every mapper aggregates the distances further in case the trace was split
among multiple blocks and nodes
Comment: this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Solution (2)
(a) The mappers read the data line by line (do no computations) and send the
data to the reducer based on the car_id.
(b) The reducers sort the data for individual car_ids based on order_id,
computes the distances, and aggregates them
Comment: high network load due to laziness of mappers
Solution (3)
(a) implement a custom reader to read define a logical record to be the whole
trace of one car
(b) each mapper computes the distances and the aggregate
(c) reducer is not really needed as everything is done by the mapper
Comment: high main memory costs as the whole trace needs to be loaded into main
memory (although only two lines are used at a time).
I would go with Solution (2), since it is the cleanest to implement and reuse.
You certainly want to sort based on car_id AND order_id, so you can compute the distances on the fly without loading them all up into memory.
Your concern about high network usage is valid, however, you can pre-aggregate the distances in a combiner.
How would that look like, let's take some pseudo-code:
Mapper:
foreach record:
emit((car_id, order_id), (x,y))
Combiner:
if(prev_order_id + 1 == order_id): // subsequent measures
// compute distance and emit that as the last possible order
emit ((car_id, MAX_VALUE), distance(prev, cur))
else:
// send to the reducer, since it is probably crossing block boundaries
emit((car_id, order_id), (x,y))
The reducer then has two main parts:
compute the sum over subsequent measures, like the combiner did
sum over all existing sums, tagged with order_id = MAX_VALUE
That's already best-effort what you can get from a network usage POV.
From a software POV, better use Spark- your logic will be five lines instead of 100 across three class files.
For your other question:
this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Hadoop only guarantees that it is not splitting through records when reading, it may very well be that your record is already touching two different blocks underneath. The way to find that out is basically to rewrite your input format to make this information available to your mappers, or even better- take your logic into account when splitting blocks.

WEKA difference between output of J48 and ID3 algorithm

I have a data set which I am classifying in WEKA using J48 and ID3 algorithm. The output of J48 algorithm is:
Correctly Classified Instances 73 92.4051 %
Incorrectly Classified Instances 6 7.5949 %
Kappa statistic 0.8958
Mean absolute error 0.061
Root mean squared error 0.1746
Relative absolute error 16.7504 %
Root relative squared error 40.9571 %
Total Number of Instances 79
and the output using ID3 is:
Correctly Classified Instances 79 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 79
My question is, if J48 is an extension of ID3 and is newer compared to it, how come ID3 is giving a better result than J48?
The J48 model is more accurate in the quality in the process, based in C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. The result in this case is only reflect of the kind of your data set you used. The ID3 could be implemented when you need more faster/simpler result without taking into account all those additional factors in the J48 consider. Take a Look into pruning decision tree and deriving rule sets HERE
In the web are a lot of resource in the theme about these comparatives results its more important to learn to identified in which case we apply the different classifier once we know how each one work(1)
Decision trees are more likely to face problem of Data over-fitting , In your case ID3 algorithm is facing the issue of data over-fitting. This is the problem of Decision trees ,that it splits the data until it make pure sets. This Problem of Data over-fitting is fixed in it's extension that is J48 by using Pruning.
Another point to cover : You should use K-fold Cross validation for Validating your Model.

Cross Validation in Weka

I've always thought from what I read that cross validation is performed like this:
In k-fold cross-validation, the original sample is randomly
partitioned into k subsamples. Of the k subsamples, a single subsample
is retained as the validation data for testing the model, and the
remaining k − 1 subsamples are used as training data. The
cross-validation process is then repeated k times (the folds), with
each of the k subsamples used exactly once as the validation data. The
k results from the folds then can be averaged (or otherwise combined)
to produce a single estimation
So k models are built and the final one is the average of those.
In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?
So, here is the scenario again: you have 100 labeled data
Use training set
weka will take 100 labeled data
it will apply an algorithm to build a classifier from these 100 data
it applies that classifier AGAIN on
these 100 data
it provides you with the performance of the
classifier (applied to the same 100 data from which it was
developed)
Use 10 fold CV
Weka takes 100 labeled data
it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.
it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.
It does the same thing for set 2 to 10 and produces 9 more classifiers
it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets
Let me know if that answers your question.
I would have answered in a comment but my reputation still doesn't allow me to:
In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.
The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.
I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:
It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.
Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.
Then Weka averages the performances and presents that on the output pane.
once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice.
During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.
According to "Data Mining with Weka" at The University of Waikato:
Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.
We take a training set and we create a classifier
Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
We want to keep the variance in the estimate as low as possible.
Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further.
(In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)
So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces. Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.
That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”);
hold out each part in turn;
and average the results.
So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.

Clustering algorithm to cluster objects based on their relation weight

I have n words and their relatedness weight that gives me a n*n matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords are {tennis,federer,wimbledon,london,police} and we have the following data from our weight matrix:
tennis federer wimbledon london police
tennis 1 0.8 0.6 0.4 0.0
federer 0.8 1 0.65 0.4 0.02
wimbledon 0.6 0.65 1 0.08 0.09
london 0.4 0.4 0.08 1 0.71
police 0.0 0.02 0.09 0.71 1
I need an algorithm to to cluster them into 2 clusters : {tennis,federer,wimbledon} {london,police}. Is there any know clustering algorithm than can deal with such thing ? I did some research, it appears that K-means algorithm is the most well known algorithm being used for clustering but apparently K-means doesn't suit this case.
I would greatly appreciate any help.
You can treat it as a network clustering problem. With a recent version of mcl software (http://micans.org/mcl), you can do this (I've called your example fe.data).
mcxarray -data fe.data -skipr 1 -skipc 1 -write-tab fe.tab -write-data fe.mci -co 0 -tf 'gq(0)' -o fe.cor
# the above computes correlations (put in data file fe.cor) and a network (put in data file fe.mci).
# below proceeds with the network.
mcl fe.mci -I 3 -o - -use-tab fe.tab
# this outputs the clustering you expect. -I is the 'inflation parameter'. The latter affects
# cluster granularity. With the default parameter 2, everything ends up in a single cluster.
Disclaimer: I wrote mcl and a slew of associated network loading/conversion and analysis programs recently rebranded as 'mcl-edge'. They all come together in a single software package. Seeing your example made me curious whether it would be doable with mcl-edge, so I quickly tested it.
Consider DBSCAN. If it suits your needs, you might wish to take a closer look at an optimised version, TI-DBSCAN, which uses triangle inequality for reducing spatial query cost.
DBSCAN's advantages and disadvantages are discussed on Wikipedia. It splits input data to a set of clusters whose cardinality isn't known a priori. You'd have to transform your similarity matrix into a distance matrix, for example by taking 1 - similarity as a distance.
Check this book on Information retrieval
http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html
it explains very well what you want to do
Your weights are higher for more similar words and lower for more different words. A clustering algorithm requires similar points/words to be closer spatially and different words to be distant. You should change the matrix M into 1-M and then use any clustering method you want, including k-means.
If you've got a distance matrix, it seems a shame not to try http://en.wikipedia.org/wiki/Single_linkage_clustering. By hand, I think you get the following clustering:
((federer, tennis), wimbledon) (london, police)
The similarity for the link that joins the two main groups (either tennis-london or federer-london) is smaller than any of the similarities that build the two groups: london-police, tennis-federer, and federer-wimbledon: this characteristic is guaranteed by single linkage clustering, since it binds together closest clusters at each stage, and the two main groups are linked by the last binding found.
DBSCAN (see other answers) and successors such as OPTICS are clearly an option.
While the examples are on vector data, all that the algorithms need is a distance function. If you have a similarity matrix, that can trivially be used as distance function.
The example data set probably is a bit too small for them to produce meaningful results. If you just have this little of data, any "hierarchical clustering" should be feasible and do the job for you. You then just need to decide on the best number of clusters.

Resources