How do i analyse scientific plots programmatically - bash

I have a graphs that contain a series of curves.I want to know which curve is closest to the origin.
Given that i have the csv file , how can i process the CSV file in an efficient manner to get the curve closest to origin.For the plot above AV1 is the expected output.

Those look like either exponential decay curves, of the form y = a * c^x
This means the logarithm is a linear function: log y = log a + x * log c
Perhaps use a linear regression model on the logarithm of "ssimulacra score" to get a slope and intercept for each line, and choose the line having the smallest intercept?
scikit learn has an easy-to-use linear regressor:
http://scikit-learn.org/stable/modules/linear_model.html
You can use pandas to read the csv file easily.
pip3 install scikit-learn pandas numpy
python:
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.read_csv('filename.csv')
X = df['bpp'].values
y = np.log(df['ssimulacra score'].values)
reg = linear_model.LinearRegression()
reg.fit(X, y)
intercept = reg.intercept_
Get the intercept for each line and return the line having the smallest intercept.

Related

How to output XGBoost output in log odds form in Python

I have a simple XGBClassifier
model = XGBClassifier()
which I use to fit a model (X are the predictive features, Y is the binary target):
model.fit(X, Y)
If I want to calculate the probabilities from the XGBClassifier model that I have just trained, then I use this code:
y_pred_proba = []
for i in range(len(X)):
y_pred_proba.append(0)
y_pred_proba[i] = model.predict_proba(X.iloc[[i]]).ravel()[1]
But how do I get the log(odds)?
If I applied the following formula:
ln(odds) = ln(probability / (1-probability))
I'd get the odds ratio. I guess you cannot convert the probabilities to odds as simple as that. I guess you need a sigmoid function, right?
I understand that the default XGBClassifier objective function is a logistic regression. Is there a command to output the log(odds) of the XGBClassifier?
If I had fit a logistic regression like this:
import sklearn
model_adult = sklearn.linear_model.LogisticRegression(max_iter=10000)
model_adult.fit(X, Y)
Then I could have generated the log(odds) output through this code:
print(model_adult.predict_log_proba(X))
Is there anything similar with XGBClassifier?

Initializing the derivative of a state variable during process simulation

When simulating a process using GEKKO (for example, as in Example 15 here), how would I set the initial value of the derivative of a state variable? I am using IMODE=4, but I could also use IMODE=7.
[Edit] I have fitted the parameters of a ODE-model with measured input and output using IMODE=5 and I would like to predict model output beyond measured time points.
Here is a modification of Problem 8 from that same link as a simple example. To initialize the derivative, create a new variable such as dydt and define a new equation that is equal to the derivative.
from gekko import GEKKO
import numpy as np
import matplotlib.pyplot as plt
m = GEKKO()
k = 10
m.time = np.linspace(0,20,100)
y = m.Var(value=5)
dydt = m.Var(value=0)
t = m.Param(value=m.time)
m.Equation(k*dydt==-t*y)
m.Equation(dydt==y.dt())
m.options.IMODE=4
m.solve(disp=False)
plt.plot(m.time,y.value,label='y')
plt.plot(m.time,dydt.value,label='dy/dt')
plt.xlabel('time'); plt.ylabel('y')
plt.legend(); plt.grid(); plt.show()
Unlike other differential algebraic equation (DAE) solvers, Gekko does not require consistent initial conditions for the states and derivatives. Gekko can also solve higher-index DAEs where the index is the number of times that constraints must be differentiated to return to ODE form.

Access Z coordinate in a LINESTRING Z in geopandas?

I have a GeoDataFrame with a LINESTRING Z geometry where Z is my altitude for the lat/long. (There are other columns in the dataframe that I deleted for ease of sharing but are relevant when displaying the resulting track)
TimeUTC Latitude Longitude AGL geometry
0 2021-06-16 00:34:04+00:00 42.835413 -70.919610 82.2 LINESTRING Z (-70.91961 42.83541 82.20000, -70...
I would like to find the maximum Z value in that linestring but I am unable to find a way to access it or extract the x,y,z values in a way that I can determine the maximum value outside of the linestring.
line.geometry.bounds only returns the x,y min/max.
The best solution I could come up with was to turn all the points into a list of tuples:
points = line.apply(lambda x: [y for y in x['geometry'].coords], axis=1)
And then find the maximum value of the third element:
from operator import itemgetter
max(ft2,key=itemgetter(2))[2]
I hope there is a better solution available.
Thank you.
You can take your lambda function approach and just take it one step further:
import numpy as np
line['geometry'].apply(lambda geom: np.max([coord[2] for coord in geom.coords]))
Here's a fully reproducible example from start to finish:
import shapely
import numpy as np
import geopandas as gpd
linestring = shapely.geometry.LineString([[0,0,0],
[1,1,1],
[2,2,2]])
gdf = gpd.GeoDataFrame({'id':[1,2,3],
'geometry':[linestring,
linestring,
linestring]})
gdf['max_z'] = (gdf['geometry']
.apply(lambda geom:
np.max([coord[2] for coord in geom.coords])))
In the example above, I create a new column called "max_z" that stores the maximum Z value for each row.
Important note
This solution will only work if you exclusively have LineStrings in your geometries. If, for example, you have MultiLineStrings, you'll have to adapt the function I wrote to take care of that.

How come some of the lines get ignored with hough line function?

I'm struggling a bit to figure out
how to make sure all lines get recognized with Line Hough Transform taken from sckit-image library.
https://scikit-image.org/docs/dev/auto_examples/edges/plot_line_hough_transform.html#id3
Here below all lines got recognized:
But if I apply the same script on similar image,
one line will get ignored after applying the Hough transform,
I have read the documentation which says:
The Hough transform constructs a histogram array representing the parameter
space (i.e., an :math:`M \\times N` matrix, for :math:`M` different values of
the radius and :math:`N` different values of :math:`\\theta`). For each
parameter combination, :math:`r` and :math:`\\theta`, we then find the number
of non-zero pixels in the input image that would fall close to the
corresponding line, and increment the array at position :math:`(r, \\theta)`
appropriately.
We can think of each non-zero pixel "voting" for potential line candidates. The
local maxima in the resulting histogram indicates the parameters of the most
probably lines
So my conclusion is the line got removed since it hadn't got enough "votes",
(I have tested it with different precisions (0.05, 0.5, 0.1) degree, but still got the same issue).
Here is the code:
import numpy as np
from skimage.transform import hough_line, hough_line_peaks
from skimage.feature import canny
from skimage import data,io
import matplotlib.pyplot as plt
from matplotlib import cm
# Constructing test image
image = io.imread("my_image.png")
# Classic straight-line Hough transform
# Set a precision of 0.05 degree.
tested_angles = np.linspace(-np.pi / 2, np.pi / 2, 3600)
h, theta, d = hough_line(image, theta=tested_angles)
# Generating figure 1
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
ax = axes.ravel()
ax[0].imshow(image, cmap=cm.gray)
ax[0].set_title('Input image')
ax[0].set_axis_off()
ax[1].imshow(np.log(1 + h),
extent=[np.rad2deg(theta[-1]), np.rad2deg(theta[0]), d[-1], d[0]],
cmap=cm.gray, aspect=1/1.5)
ax[1].set_title('Hough transform')
ax[1].set_xlabel('Angles (degrees)')
ax[1].set_ylabel('Distance (pixels)')
ax[1].axis('image')
ax[2].imshow(image, cmap=cm.gray)
origin = np.array((0, image.shape[1]))
for _, angle, dist in zip(*hough_line_peaks(h, theta, d)):
y0, y1 = (dist - origin * np.cos(angle)) / np.sin(angle)
ax[2].plot(origin, (y0, y1), '-r')
ax[2].set_xlim(origin)
ax[2].set_ylim((image.shape[0], 0))
ax[2].set_axis_off()
ax[2].set_title('Detected lines')
plt.tight_layout()
plt.show()
How should I "catch" this line too,
any suggestion?
Shorter lines have lower accumulator values in the Hough transform, so you have to adjust the threshold appropriately. If you know how many line segments you are looking for, you can set the threshold fairly low and then limit the number of peaks detected.
Here's a condensed version of the code above, with modified threshold, for reference:
import numpy as np
from skimage.transform import hough_line, hough_line_peaks
from skimage import io
import matplotlib.pyplot as plt
from matplotlib import cm
from skimage import color
# Constructing test image
image = color.rgb2gray(io.imread("my_image.png"))
# Classic straight-line Hough transform
# Set a precision of 0.05 degree.
tested_angles = np.linspace(-np.pi / 2, np.pi / 2, 3600)
h, theta, d = hough_line(image, theta=tested_angles)
hpeaks = hough_line_peaks(h, theta, d, threshold=0.2 * h.max())
fig, ax = plt.subplots()
ax.imshow(image, cmap=cm.gray)
for _, angle, dist in zip(*hpeaks):
(x0, y0) = dist * np.array([np.cos(angle), np.sin(angle)])
ax.axline((x0, y0), slope=np.tan(angle + np.pi/2))
plt.show()
(Note: axline requires matplotlib 3.3.)

Optimal parameters not found: Number of calls to function has reached maxfev = 100

I'm new to python, I try to give some adjustment to the data, but when I get the graph, only the original data appears and with the message "Optimal parameters not found: Number of calls to function has reached maxfev = 1000." Could you help me find my mistake?
%matplotlib inline
import matplotlib.pylab as m
from scipy.optimize import curve_fit
import numpy as num
import scipy.optimize as optimize
xData=num.array([0,0,100,200,250,300,400], dtype="float")
yData=num.array([0,0,0,0,75,100,100], dtype="float")
m.plot(xData, yData, 'ro', label='Datos originales')
def fun(x, a, b):
return a + b * num.log(x)
popt,pcov=optimize.curve_fit(fun, xData, yData,p0=[1,1], maxfev=1000)
print=popt
x=num.linspace(1,400,7)
m.plot(x,fun(x, *popt), label='FunciĆ³n ajustada')
m.xlabel('concentraciĆ³n')
m.ylabel('% mortalidad')
m.legend()
m.grid()
The model in your code is "a + b * num.log(x)". Because your data contains an x value of 0.0, the evaluation of log(0.0) gives errors and will not allow the fitting software to function. Sometimes these x values of 0.0 can be replaced with very small numbers, as log(small number) will not fail - but in this case the equation and data do not appear to match and so using that technique alone would not be sufficient here.
My thought is that a different equation would be a better model for this data. I performed an equation search using your data, and found that several different sigmoidal type equations gave suspiciously good fits to this data set - which is not surprising because of the small number of data points.
The sigmoidal equations I tried were all extremely sensitive to the initial parameter estimates. Here is a graphical Python fitter using scipy's Differential Evolution genetic algorithm module to determine the initial parameter estimates for curve_fit's non-linear solver. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. Here those bounds are taken from the data maximum and minimun values.
I personally would not use this fit precisely because the small number of data points is giving such suspiciously good fits, and strongly recommend taking additional data points if at all possible. I could however not find any equations with less than three parameters that would fit the data.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData=numpy.array([0,0,100,200,250,300,400], dtype="float")
yData=numpy.array([0,0,0,0,75,100,100], dtype="float")
def func(x, a, b, c): # Sigmoid B equation from zunzun.com
return a / (1.0 + numpy.exp(-1.0 * (x - b) / c))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
parameterBounds = []
parameterBounds.append([minX, maxX]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([0.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData), 100)
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Resources