Python OLS with categorical label - statsmodels

I have a dataset where I am trying to predict the type of car based off of a number of features. I would like to an OLS regression to see
import statsmodels.api as sm
X = features
# where 0 = sedan, 1 = minivan , etc
y = [0,0,1,0,2,....]
X2 = sm.add_constant(np.array(X))
est = sm.OLS(np.array(y), X2)
est2 = est.fit()
^ I don't feel like doing this is correct because I am not specifying that it is categorical, I feel like the functional form should change. Was wondering if anyone had any insight on this.

Ordinary least squares regression assumes a numerical dependent variable, you cannot use it to predict categorical outcomes.
To predict categorical outcomes with a regression model, you want to use multinomial logistic regression, for example using sklearn.

Related

How to output XGBoost output in log odds form in Python

I have a simple XGBClassifier
model = XGBClassifier()
which I use to fit a model (X are the predictive features, Y is the binary target):
model.fit(X, Y)
If I want to calculate the probabilities from the XGBClassifier model that I have just trained, then I use this code:
y_pred_proba = []
for i in range(len(X)):
y_pred_proba.append(0)
y_pred_proba[i] = model.predict_proba(X.iloc[[i]]).ravel()[1]
But how do I get the log(odds)?
If I applied the following formula:
ln(odds) = ln(probability / (1-probability))
I'd get the odds ratio. I guess you cannot convert the probabilities to odds as simple as that. I guess you need a sigmoid function, right?
I understand that the default XGBClassifier objective function is a logistic regression. Is there a command to output the log(odds) of the XGBClassifier?
If I had fit a logistic regression like this:
import sklearn
model_adult = sklearn.linear_model.LogisticRegression(max_iter=10000)
model_adult.fit(X, Y)
Then I could have generated the log(odds) output through this code:
print(model_adult.predict_log_proba(X))
Is there anything similar with XGBClassifier?

If my model is trained using sigmoid at the final layer and binary_crossentropy, can I shtill out put probability of classes rather than 0/1?

I have trained a CNN model with dense layer at the end using a sigmoid function:
model.add(layers.Dense(1, activation='sigmoid'))
I have also compiled using binary cross entropy:
model.compile(loss='binary_crossentropy',
optimizer = 'Adam',
metrics=[tf.keras.metrics.Precision(),tf.keras.metrics.Recall(),'accuracy'])
The f1 score of the binary images classification comes low and my model predicts one class over the other. So I decided to add a threshold based on the output probability of my sigmoid function at the final layer:
c = load_img('/home/kenan/Desktop/COV19D/validation/covid/ct_scan_19/120.jpg',
color_mode='grayscale',
target_size = (512,512))
c=img_to_array(c)
c= np.expand_dims(c, axis=0)
pred = model.predict_proba(c)
pred
y_classes = ((model.predict(c)> 0.99)+0).ravel()
y_classes
I want to use 'pred' in my code as a probability of the class but it is always either 0 or 1 as shown below:
Out[113]: array([[1.]], dtype=float32)
why doesn't it give the probability of predicting the class between [0,1] instead of 1? is there a way to get the class probability in my case rather than 0 or 1?
No you cant. Sigmoid activation in the final layer will output ONE value in the range of 0 to 1. If you want to obtain class probabilities of the different labels, you'll have to change the final layer activation to softmax.

What is the name of this image similarity/ distance based metric?

I used the following code to calculate the similarity between images 1 and 2 (i1 and i2). 1=exactly similar while 0=very different. I'd like to know what method this algorithm is using (i.e. Euclidian distance or..?) Thank you.
import math
i1=all_images_saved[0][1]
i2=all_images_saved[0][2]
i1_norm = i1/np.sqrt(np.sum(i1**2))
i2_norm = i2/np.sqrt(np.sum(i2**2))
np.sum(i1_norm*i2_norm)
Looks like cosine similarity. You can check it gives the same results as:
from scipy import spatial
cosine_distance = spatial.distance.cosine(i1.flatten(), i2.flatten())
cosine_similarity = 1 - cosine_distance
I don't believe it's a distance, otherwise 0 would mean identical. This looks like the dot product of 2 normalized vectors in which case I would say about the original vectors that they are (with the range of values from -1 to 1 being in between the thresholds describe below):
1 = co-directional
0 = orthogonal
-1 = opposite direction
And given the geometric definition of the dot product, if you have the dot product and the magnitude of your vectors you can derive the angle between the 2:
a . b = ||a|| ||b|| cos θ
Or have I completely missed something here?

The point that minimizes the sum of euclidean distances to a set of n points

I have a set of points W={(x1, y1), (x2, y2),..., (xn, yn)} on the 2D plane. Can you find an algorithm that takes these points as the input and returns a point (x, y) on the 2D plane which has the minimum sum of distances from the points in W? In other words, if
di = Euclidean_distance((x, y), (xi, yi))
I want to minimize:
d1 + d2 + ... + dn
The Problem
You're looking for the geometric median.
An Easy Solution
There is no closed-form solution to this problem, so iterative or probabilistic methods are used. The easiest way to find this is probably with Weiszfeld's algorithm:
We can implement this in Python as follows:
import numpy as np
from numpy.linalg import norm as npnorm
c_pt_old = np.random.rand(2)
c_pt_new = np.array([0,0])
while npnorm(c_pt_old-c_pt_new)>1e-6:
num = 0
denom = 0
for i in range(POINT_NUM):
dist = npnorm(c_pt_new-pts[i,:])
num += pts[i,:]/dist
denom += 1/dist
c_pt_old = c_pt_new
c_pt_new = num/denom
print(c_pt_new)
There's a chance that Weiszfeld's algorithm won't converge, so it might be best to run it several times from different starting points.
A General Solution
You can also find this using second-order cone programming (SOCP). In addition to solving your specific problem, this general formulation then allows you to easily add constraints and weightings, such as variable uncertainty in the location of each data point.
To do so, you create a number of indicator variables representing the distance between the proposed center point and the data points.
You then minimize the sum of the indicator variables. The result follows
import cvxpy as cp
import numpy as np
import matplotlib.pyplot as plt
#Generate random test data
POINT_NUM = 100
pts = np.random.rand(POINT_NUM,2)
c_pt = cp.Variable(2) #The center point we wish to locate
distances = cp.Variable(POINT_NUM) #Distance from the center point to each data point
#Generate constraints. These are used to hold distances.
constraints = []
for i in range(POINT_NUM):
constraints.append( cp.norm(c_pt-pts[i,:])<=distances[i] )
objective = cp.Minimize(cp.sum(distances))
problem = cp.Problem(objective,constraints)
optimal_value = problem.solve()
print("Optimal value = {0}".format(optimal_value))
print("Optimal location = {0}".format(c_pt.value))
plt.scatter(x=pts[:,0], y=pts[:,1], s=1)
plt.scatter(c_pt.value[0], c_pt.value[1], s=10)
plt.show()
SOCPs are available in a number of solvers including CPLEX, Elemental, ECOS, ECOS_BB, GUROBI, MOSEK, CVXOPT, and SCS.
I've tested and the two approaches give the same answers to within tolerance.
Weiszfeld, E. (1937). "Sur le point pour lequel la somme des distances de n points donnes est minimum". Tohoku Mathematical Journal. 43: 355–386.
If that point does not need to be from your sample, then the mean minimises the euclidean distance.
A third method would be to use a compact nonlinear programming formulation. An unconstrained NLP model would be:
min sum(i, ||x-p(i)|| )
This has just 2 variables (the coordinates of x).
There is a very good initial point available. Let p(i,c) be the coordinates of the data points. Then the mean is
m(c) = sum(i, p(i,c)) / n
where n is the number of data points. This point is often very close to the optimal value of x. So we can use m as an excellent initial point for x.
Some limited experiments indicate this approach is quite faster than a cone programming formulation for large n.
For details see Yet Another Math Programming Consultant - Finding the Central Point in a Point Cloud blog post.

integration of multidimensional data (matlab)

I have a (somewhat complicated expression) in three dimensions, x,y,z. I'm interested in the cumulative integral over one of them. My best solution thus far is to create a 3D grid, evaluate the expression at every point, then integrate over the third dimension with cumtrapz. This is just a scaled down example of what I'm trying to achieve:
%integration
xvec = linspace(-pi,pi,40);
yvec = linspace(-pi,pi,40);
zvec = 1:160;
[x,y,z] = meshgrid(xvec,yvec,zvec);
f = #(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20));
output = cumtrapz(f(x,y,z),3);
%(plotting)
for j = 1:length(output(1,1,:));
surf(output(:,:,j));
zlim([-120,120]);
shading interp
pause(.05);
drawnow;
end
Given the sizes of vectors (x,y~100, z~5000), is this a computationally sensible way to do this?
if this is the function form you want to integrate over,#(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20)), x,y,z can be separately integrated and the integral can be analytically solved using complex number by replacing sin(x)=(exp(ix)-exp(ix))/2i, and cos(x)=(exp(ix)+exp(ix))/2, which will greatly reduce the time cost of your calculation

Resources