Making predictions with a LightGBM model that was trained with weights - lightgbm

Making predictions with a LightGBM model that was trained with weights.
I have trained a LightGBM model with weights, however, these weights are now still part of my feature set. Meanwhile, the data I would like to make predictions on does not have a weight column. How can I used the model I trained with weights to make predictions on non-weighted data?

Related

Recall Precision Curve for clustering algorithms

I would like to know whether precision recall curve is relevant for clustering algorithms. For example by using unsupervised learning techniques such as Mean shift or DBSCAN.(Or is it relevant only for classification algorithms). If yes how to get the plot points for low recall values? Is it allowed to change the model parameters to get low recall rates for a model?
PR curves (and ROC curves) require a ranking.
E.g. a classificator score that can be used to rank objects by how likely they belong to class A, or not.
In clustering, you usually do not have such a ranking.
Without a ranking, you don't get a curve. Also, what is precision and recall in clustering? Use ARI and NMI for evaluation.
But there are unsupervised methods such as outlier detection where, e.g., the ROC curve is a fairly common evaluation method. The PR curve is more problematic, because at 0 it is not defined, and ton shouldn't linearly interpolate. Thus, the popular "area under curve" is not well defined for PR curves. Since there are a dozen of other measures, I'd avoid PR-AUC because of this.

How to reshape the pre-trained weights to input them to 3d convoluional neural network?

I have pre-trained weights for a 3d convolutional layer using Matlab. The weights is a 5d tensor with dimension (512,4,4,4,160). [out_channels, filter_depth, filter_height, filter_width, in_channels]
Now I want to input it as the initial weights for fine-tuning in tensorflow's tf.nn.conv3d. I see that the shape of weights are allowed for 3d convolutional neural networks should be: (4,4,4,160,512).[filter_depth, filter_height, filter_width, in_channels, out_channels]. Can I just use tf.Variable().reshape(4,4,4,160,512)? But I feel it is not the correct weights if I just use reshape.
The tf.transpose operation can reorder axes: https://www.tensorflow.org/versions/r0.11/api_docs/python/array_ops.html#transpose
Provided that initial shape of tensor input is (512,4,4,4,160) the output tensor of tf.transpose(input, perm=[4,1,2,3,0]) will have shape (160,4,4,4,512).
Also you may need to reverse your weights along some axis or axes. In tensorflow convolutions are implemented as cross-correlations: https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#convolution

How to use the Spark Mlib Multilayer Perceptron Weights Array

I have a requirement where i need to find the relative importance of the attributes used in ANN implementation. I use the spark MLib library MultiLayerPerceptron for implementation. The model gives me a vector which is an array of the weights. I know there are algorithms to derive the relative importance from weights , but the MLib implementation gives out a big single dimensional array and does not tell anything about the weights corresponding to each input. Anyone know how to get the weights corresponding to each input node?
The model flattens the weights matrices with the Breeze manipulation: toDenseVector. (notice the line: val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector)
This manipulation acts like numpy's flatten().
Therefore, to retrieve the weights matrices, you have to do two things:
Split the weights vector to parts, according to your layers. You have to take (layerSize + 1) * nextLayerSize weights per each non-final layer (+1 because of the bias).
For each flattened weight matrix, apply numpy's reshape with parameters (layerSize + 1, nextLayerSize).
When you derive the relative importance from your weights, notice that in the pyspark implementation, the bias is represented as the last feature: .
Therefore the last row in each weight matrix represents the bias value.

Relationship (overlap, includes or exclude) between two multivariate normal distribution

Is there a way to compare two multivariate normal distribution models, given their means and covariance matrices, to see what relationship they have, if any? I am looking at topological relationships between the two distributions, such as, one overlaps other, or one includes other, etc.
I have checked various statistical measures - OVL, Hellingar distance, mahalanobis distance - but most of them are only for univariate normal distribution. I didn't find anything on multivariate distribution. If possible, an implementation of such algorithm would be very helpful.

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Resources