How to output XGBoost output in log odds form in Python - probability

I have a simple XGBClassifier
model = XGBClassifier()
which I use to fit a model (X are the predictive features, Y is the binary target):
model.fit(X, Y)
If I want to calculate the probabilities from the XGBClassifier model that I have just trained, then I use this code:
y_pred_proba = []
for i in range(len(X)):
y_pred_proba.append(0)
y_pred_proba[i] = model.predict_proba(X.iloc[[i]]).ravel()[1]
But how do I get the log(odds)?
If I applied the following formula:
ln(odds) = ln(probability / (1-probability))
I'd get the odds ratio. I guess you cannot convert the probabilities to odds as simple as that. I guess you need a sigmoid function, right?
I understand that the default XGBClassifier objective function is a logistic regression. Is there a command to output the log(odds) of the XGBClassifier?
If I had fit a logistic regression like this:
import sklearn
model_adult = sklearn.linear_model.LogisticRegression(max_iter=10000)
model_adult.fit(X, Y)
Then I could have generated the log(odds) output through this code:
print(model_adult.predict_log_proba(X))
Is there anything similar with XGBClassifier?

Related

If my model is trained using sigmoid at the final layer and binary_crossentropy, can I shtill out put probability of classes rather than 0/1?

I have trained a CNN model with dense layer at the end using a sigmoid function:
model.add(layers.Dense(1, activation='sigmoid'))
I have also compiled using binary cross entropy:
model.compile(loss='binary_crossentropy',
optimizer = 'Adam',
metrics=[tf.keras.metrics.Precision(),tf.keras.metrics.Recall(),'accuracy'])
The f1 score of the binary images classification comes low and my model predicts one class over the other. So I decided to add a threshold based on the output probability of my sigmoid function at the final layer:
c = load_img('/home/kenan/Desktop/COV19D/validation/covid/ct_scan_19/120.jpg',
color_mode='grayscale',
target_size = (512,512))
c=img_to_array(c)
c= np.expand_dims(c, axis=0)
pred = model.predict_proba(c)
pred
y_classes = ((model.predict(c)> 0.99)+0).ravel()
y_classes
I want to use 'pred' in my code as a probability of the class but it is always either 0 or 1 as shown below:
Out[113]: array([[1.]], dtype=float32)
why doesn't it give the probability of predicting the class between [0,1] instead of 1? is there a way to get the class probability in my case rather than 0 or 1?
No you cant. Sigmoid activation in the final layer will output ONE value in the range of 0 to 1. If you want to obtain class probabilities of the different labels, you'll have to change the final layer activation to softmax.

Python OLS with categorical label

I have a dataset where I am trying to predict the type of car based off of a number of features. I would like to an OLS regression to see
import statsmodels.api as sm
X = features
# where 0 = sedan, 1 = minivan , etc
y = [0,0,1,0,2,....]
X2 = sm.add_constant(np.array(X))
est = sm.OLS(np.array(y), X2)
est2 = est.fit()
^ I don't feel like doing this is correct because I am not specifying that it is categorical, I feel like the functional form should change. Was wondering if anyone had any insight on this.
Ordinary least squares regression assumes a numerical dependent variable, you cannot use it to predict categorical outcomes.
To predict categorical outcomes with a regression model, you want to use multinomial logistic regression, for example using sklearn.

Optimal parameters not found: Number of calls to function has reached maxfev = 100

I'm new to python, I try to give some adjustment to the data, but when I get the graph, only the original data appears and with the message "Optimal parameters not found: Number of calls to function has reached maxfev = 1000." Could you help me find my mistake?
%matplotlib inline
import matplotlib.pylab as m
from scipy.optimize import curve_fit
import numpy as num
import scipy.optimize as optimize
xData=num.array([0,0,100,200,250,300,400], dtype="float")
yData=num.array([0,0,0,0,75,100,100], dtype="float")
m.plot(xData, yData, 'ro', label='Datos originales')
def fun(x, a, b):
return a + b * num.log(x)
popt,pcov=optimize.curve_fit(fun, xData, yData,p0=[1,1], maxfev=1000)
print=popt
x=num.linspace(1,400,7)
m.plot(x,fun(x, *popt), label='FunciĆ³n ajustada')
m.xlabel('concentraciĆ³n')
m.ylabel('% mortalidad')
m.legend()
m.grid()
The model in your code is "a + b * num.log(x)". Because your data contains an x value of 0.0, the evaluation of log(0.0) gives errors and will not allow the fitting software to function. Sometimes these x values of 0.0 can be replaced with very small numbers, as log(small number) will not fail - but in this case the equation and data do not appear to match and so using that technique alone would not be sufficient here.
My thought is that a different equation would be a better model for this data. I performed an equation search using your data, and found that several different sigmoidal type equations gave suspiciously good fits to this data set - which is not surprising because of the small number of data points.
The sigmoidal equations I tried were all extremely sensitive to the initial parameter estimates. Here is a graphical Python fitter using scipy's Differential Evolution genetic algorithm module to determine the initial parameter estimates for curve_fit's non-linear solver. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. Here those bounds are taken from the data maximum and minimun values.
I personally would not use this fit precisely because the small number of data points is giving such suspiciously good fits, and strongly recommend taking additional data points if at all possible. I could however not find any equations with less than three parameters that would fit the data.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData=numpy.array([0,0,100,200,250,300,400], dtype="float")
yData=numpy.array([0,0,0,0,75,100,100], dtype="float")
def func(x, a, b, c): # Sigmoid B equation from zunzun.com
return a / (1.0 + numpy.exp(-1.0 * (x - b) / c))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
parameterBounds = []
parameterBounds.append([minX, maxX]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([0.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData), 100)
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

How do i analyse scientific plots programmatically

I have a graphs that contain a series of curves.I want to know which curve is closest to the origin.
Given that i have the csv file , how can i process the CSV file in an efficient manner to get the curve closest to origin.For the plot above AV1 is the expected output.
Those look like either exponential decay curves, of the form y = a * c^x
This means the logarithm is a linear function: log y = log a + x * log c
Perhaps use a linear regression model on the logarithm of "ssimulacra score" to get a slope and intercept for each line, and choose the line having the smallest intercept?
scikit learn has an easy-to-use linear regressor:
http://scikit-learn.org/stable/modules/linear_model.html
You can use pandas to read the csv file easily.
pip3 install scikit-learn pandas numpy
python:
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.read_csv('filename.csv')
X = df['bpp'].values
y = np.log(df['ssimulacra score'].values)
reg = linear_model.LinearRegression()
reg.fit(X, y)
intercept = reg.intercept_
Get the intercept for each line and return the line having the smallest intercept.

integration of multidimensional data (matlab)

I have a (somewhat complicated expression) in three dimensions, x,y,z. I'm interested in the cumulative integral over one of them. My best solution thus far is to create a 3D grid, evaluate the expression at every point, then integrate over the third dimension with cumtrapz. This is just a scaled down example of what I'm trying to achieve:
%integration
xvec = linspace(-pi,pi,40);
yvec = linspace(-pi,pi,40);
zvec = 1:160;
[x,y,z] = meshgrid(xvec,yvec,zvec);
f = #(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20));
output = cumtrapz(f(x,y,z),3);
%(plotting)
for j = 1:length(output(1,1,:));
surf(output(:,:,j));
zlim([-120,120]);
shading interp
pause(.05);
drawnow;
end
Given the sizes of vectors (x,y~100, z~5000), is this a computationally sensible way to do this?
if this is the function form you want to integrate over,#(x,y,z) sin(x).*cos(y).*exp(z/80).*cos((x-z/20)), x,y,z can be separately integrated and the integral can be analytically solved using complex number by replacing sin(x)=(exp(ix)-exp(ix))/2i, and cos(x)=(exp(ix)+exp(ix))/2, which will greatly reduce the time cost of your calculation

Resources