Subtracting a best fit line with numpy.polyfit()? - curve-fitting

So I'm working on a project and I have a set of data that I loaded in as a csv. The data has a spot that that I need to flatten out. I used the numpy.polyfit() function to find a line of best fit, but what I can't seem to figure out is how to subtract off the best fit line. Any advice?
Here is the code I'm using so far:
μ = pd.read_csv("C:\\Users\\ander\\Documents\\Data\\plots and code\\dataframe2.csv")
yvalue = "average"
xvalue = "xvalue"
X = μ[xvalue][173:852]
Y = μ[yvalue][173:852]
fit = np.polyfit(X, Y, 1)
μ = μ.subtract(fit, μ)

The polyfit function finds the linear coefficient of the best fit. In order to subtract the line from your data, you first need to create the linear function itself. For example, you can use the numpy.poly1d function.
I'll show you an example. Since we don't have access to the .csv file I made up X and Y:
import matplotlib.pyplot as plt
import numpy as np
DATA_SIZE = 500
μ_X = np.sort(np.random.uniform(0,10,DATA_SIZE))
μ_Y = 3*np.exp(-(μ_X-7)**2) + np.random.normal(0,0.08,DATA_SIZE) + 0.5*μ_X
X = μ_X[50:200]
Y = μ_Y[50:200]
plt.scatter(μ_X, μ_Y, label='Full data')
plt.scatter(X, Y, label='Selected region')
plt.legend()
plt.show()
Now we can fit the baseline from the orange data and subtract the linear function from all the data (blue).
fit = np.polyfit(X, Y, 1)
linear_baseline = np.poly1d(fit) # create the linear baseline function
μ_Y = μ_Y - linear_baseline(μ_X) # subtract the baseline from μ_Y
plt.scatter(μ_X, μ_Y, label='Linear baseline removed')
plt.legend()
plt.show()

Related

SciPy: von Mises distribution on a half circle?

I'm trying to figure out the best way to define a von-Mises distribution wrapped on a half-circle (I'm using it to draw directionless lines at different concentrations). I'm currently using SciPy's vonmises.rvs(). Essentially, I want to be able to put in, say, a mean orientation of pi/2 and have the distribution truncated to no more than pi/2 either side.
I could use a truncated normal distribution, but I will lose the wrapping of the von-mises (say if I want a mean orientation of 0)
I've seen this done in research papers looking at mapping fibre orientations, but I can't figure out how to implement it (in python). I'm a bit stuck on where to start.
If my von Mesis is defined as (from numpy.vonmises):
np.exp(kappa*np.cos(x-mu))/(2*np.pi*i0(kappa))
with:
mu, kappa = 0, 4.0
x = np.linspace(-np.pi, np.pi, num=51)
How would I alter it to use a wrap around a half-circle instead?
Could anyone with some experience with this offer some guidance?
Is is useful to have direct numerical inverse CDF sampling, it should work great for distribution with bounded domain. Here is code sample, building PDF and CDF tables and sampling using inverse CDF method. Could be optimized and vectorized, of course
Code, Python 3.8, x64 Windows 10
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate
def PDF(x, μ, κ):
return np.exp(κ*np.cos(x - μ))
N = 201
μ = np.pi/2.0
κ = 4.0
xlo = μ - np.pi/2.0
xhi = μ + np.pi/2.0
# PDF normaliztion
I = integrate.quad(lambda x: PDF(x, μ, κ), xlo, xhi)
print(I)
I = I[0]
x = np.linspace(xlo, xhi, N, dtype=np.float64)
step = (xhi-xlo)/(N-1)
p = PDF(x, μ, κ)/I # PDF table
# making CDF table
c = np.zeros(N, dtype=np.float64)
for k in range(1, N):
c[k] = integrate.quad(lambda x: PDF(x, μ, κ), xlo, x[k])[0] / I
c[N-1] = 1.0 # so random() in [0...1) range would work right
#%%
# sampling from tabular CDF via insverse CDF method
def InvCDFsample(c, x, gen):
r = gen.random()
i = np.searchsorted(c, r, side='right')
q = (r - c[i-1]) / (c[i] - c[i-1])
return (1.0 - q) * x[i-1] + q * x[i]
# sampling test
RNG = np.random.default_rng()
s = np.empty(20000)
for k in range(0, len(s)):
s[k] = InvCDFsample(c, x, RNG)
# plotting PDF, CDF and sampling density
plt.plot(x, p, 'b^') # PDF
plt.plot(x, c, 'r.') # CDF
n, bins, patches = plt.hist(s, x, density = True, color ='green', alpha = 0.7)
plt.show()
and graph with PDF, CDF and sampling histogram
You could discard the values outside the desired range via numpy's filtering (theta=theta[(theta>=0)&(theta<=np.pi)], shortening the array of samples). So, you could first increment the number of generated samples, then filter and then take a subarray of the desired size.
Or you could add/subtract pi to put them all into that range (via theta = np.where(theta < 0, theta + np.pi, np.where(theta > np.pi, theta - np.pi, theta))). As noted by #SeverinPappadeux such changes the distribution and is probably not desired.
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
import numpy as np
from scipy.stats import vonmises
mu = np.pi / 2
kappa = 4
orig_theta = vonmises.rvs(kappa, loc=mu, size=(10000))
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(12, 4))
for ax in axes:
theta = orig_theta.copy()
if ax == axes[0]:
ax.set_title(f"$Von Mises, \\mu={mu:.2f}, \\kappa={kappa}$")
else:
theta = theta[(theta >= 0) & (theta <= np.pi)]
print(len(theta))
ax.set_title(f"$Von Mises, angles\\ filtered\\ ({100 * len(theta) / (len(orig_theta)):.2f}\\ \\%)$")
segs = np.zeros((len(theta), 2, 2))
segs[:, 1, 0] = np.cos(theta)
segs[:, 1, 1] = np.sin(theta)
line_segments = LineCollection(segs, linewidths=.1, colors='blue', alpha=0.5)
ax.add_collection(line_segments)
ax.autoscale()
ax.set_aspect('equal')
plt.show()

Optimal parameters not found: Number of calls to function has reached maxfev = 100

I'm new to python, I try to give some adjustment to the data, but when I get the graph, only the original data appears and with the message "Optimal parameters not found: Number of calls to function has reached maxfev = 1000." Could you help me find my mistake?
%matplotlib inline
import matplotlib.pylab as m
from scipy.optimize import curve_fit
import numpy as num
import scipy.optimize as optimize
xData=num.array([0,0,100,200,250,300,400], dtype="float")
yData=num.array([0,0,0,0,75,100,100], dtype="float")
m.plot(xData, yData, 'ro', label='Datos originales')
def fun(x, a, b):
return a + b * num.log(x)
popt,pcov=optimize.curve_fit(fun, xData, yData,p0=[1,1], maxfev=1000)
print=popt
x=num.linspace(1,400,7)
m.plot(x,fun(x, *popt), label='Función ajustada')
m.xlabel('concentración')
m.ylabel('% mortalidad')
m.legend()
m.grid()
The model in your code is "a + b * num.log(x)". Because your data contains an x value of 0.0, the evaluation of log(0.0) gives errors and will not allow the fitting software to function. Sometimes these x values of 0.0 can be replaced with very small numbers, as log(small number) will not fail - but in this case the equation and data do not appear to match and so using that technique alone would not be sufficient here.
My thought is that a different equation would be a better model for this data. I performed an equation search using your data, and found that several different sigmoidal type equations gave suspiciously good fits to this data set - which is not surprising because of the small number of data points.
The sigmoidal equations I tried were all extremely sensitive to the initial parameter estimates. Here is a graphical Python fitter using scipy's Differential Evolution genetic algorithm module to determine the initial parameter estimates for curve_fit's non-linear solver. That scipy module uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, requiring bounds within which to search. Here those bounds are taken from the data maximum and minimun values.
I personally would not use this fit precisely because the small number of data points is giving such suspiciously good fits, and strongly recommend taking additional data points if at all possible. I could however not find any equations with less than three parameters that would fit the data.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
xData=numpy.array([0,0,100,200,250,300,400], dtype="float")
yData=numpy.array([0,0,0,0,75,100,100], dtype="float")
def func(x, a, b, c): # Sigmoid B equation from zunzun.com
return a / (1.0 + numpy.exp(-1.0 * (x - b) / c))
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = func(xData, *parameterTuple)
return numpy.sum((yData - val) ** 2.0)
def generate_Initial_Parameters():
# min and max used for bounds
maxX = max(xData)
minX = min(xData)
parameterBounds = []
parameterBounds.append([minX, maxX]) # search bounds for a
parameterBounds.append([minX, maxX]) # search bounds for b
parameterBounds.append([0.0, 2.0]) # search bounds for c
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
return result.x
# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()
# now call curve_fit without passing bounds from the genetic algorithm,
# just in case the best fit parameters are aoutside those bounds
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Fitted parameters:', fittedParameters)
print()
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData), 100)
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

How do I perform a curve fit with an array of points and touching a specific point in that array

I need help with curve fitting a given set of points. The points form a parabola and I ought to find the peak point of the result. Issue is when I do a curve fit, it sometimes doesn't touch the max y-coordinate even if the actual point is given in the input array.
Following is the code snippet. Here 1.88 is the actual peak y-coordinate (13.05,1.88). But the graph generated by the code does not touch the point due to curve fitting. So is there a way to fit the curve making sure that it touches the max point given in the input array?
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, minimize_scalar
fig = plt.gcf()
#fig.set_size_inches(18.5, 10.5)
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,1.88,1.7,1.64]
def f(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
plt.plot(x,y,"ro")
popt, pcov = curve_fit(f, x, y)
# find the peak
fm = lambda x: -f(x, *popt)
r = minimize_scalar(fm, bounds=(1, 5))
print( "maximum:", r["x"], f(r["x"], *popt) ) #maximum: 2.99846874275 18.3928199902
plt.text(1,1.9,'maximum '+str(round(r["x"],2))+'( #'+str(round(f(r["x"], *popt),2)) + ' )')
x_curve = np.linspace(min(x), max(x), 50)
plt.plot(x_curve, f(x_curve, *popt))
plt.plot(r['x'], f(r['x'], *popt), 'ko')
plt.show()
Here is a graphical code example using your equation with weighted fitting, where I have made the max point larger to more easily see the effect of the weighting. In non-weighted curve fitting, all weights are implicitly 1.0 as all data points have equal weight. Scipy's curve_fit routine uses weights in the form of uncertainties, so that giving a point a very small uncertainty (which I have done) is like giving the point a very large weight. This technique can be used to make a fit pass arbitrarily close to any single data point by any software that can perform weghted fitting.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,2.0,1.7,1.64]
# note the single very small uncertainty - try making this value 1.0
uncertainties = numpy.array([1.0, 1.0, 1.0E-6, 1.0, 1.0])
# rename data to use previous example
xData = numpy.array(x)
yData = numpy.array(y)
def func(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0, 1.0])
# curve fit the test data, first without uncertainties to
# get us closer to initial starting parameters
ssqParameters, pcov = curve_fit(func, xData, yData, p0 = initialParameters)
# now that we have better starting parameters, use uncertainties
fittedParameters, pcov = curve_fit(func, xData, yData, p0 = ssqParameters, sigma=uncertainties, absolute_sigma=True)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Results from my thin plate spline interpolation implementation are dependant of the independent variables

I implemented the thin plate spline algorithm (see also this description) in order to interpolate scattered data using Python.
My algorithm seems to work correctly when the bounding box of the initial scattered data has an aspect ratio close to 1. However, scaling one of the data points coordinates changes the interpolation result. I created a minimal working example that is representative of what I am trying to accomplish. Below are two plots showing the results of the interpolation of 50 random points.
First, the interpolation of z = x^2 on the domain x = [0, 3], y = [0, 120]:
As you can see, the interpolation fails. Now, executing the same process but after scaling the x values by a factor of 40, I get:
This time, the result looks better. Choosing a slightly different scaling factor would have resulted in a slightly different interpolation. This shows that something is wrong in my algorithm but I can't find what exactly. Here is the algorithm:
import numpy as np
import numba as nb
# pts1 = Mx2 matrix (original coordinates)
# z1 = Mx1 column vector (original values)
# pts2 = Nx2 matrix (interpolation coordinates)
def gen_K(n, pts1):
K = np.zeros((n,n))
for i in range(0,n):
for j in range(0,n):
if i != j:
r = ( (pts1[i,0] - pts1[j,0])**2.0 + (pts1[i,1] - pts1[j,1])**2.0 )**0.5
K[i,j] = r**2.0*np.log(r)
return K
def compute_z2(m, n, pts1, pts2, coeffs):
z2 = np.zeros((m,1))
x_min = np.min(pts1[:,0])
x_max = np.max(pts1[:,0])
y_min = np.min(pts1[:,1])
y_max = np.max(pts1[:,1])
for k in range(0,m):
pt = pts2[k,:]
# If point is located inside bounding box of pts1
if (pt[0] >= x_min and pt[0] <= x_max and pt[1] >= y_min and pt[1] <= y_max):
z2[k,0] = coeffs[-3,0] + coeffs[-2,0]*pts2[k,0] + coeffs[-1,0]*pts2[k,1]
for i in range(0,n):
r2 = ( (pts1[i,0] - pts2[k,0])**2.0 + (pts1[i,1] - pts2[k,1])**2.0 )**0.5
if r2 != 0:
z2[k,0] += coeffs[i,0]*( r2**2.0*np.log(r2) )
else:
z2[k,0] = np.nan
return z2
gen_K_nb = nb.jit(nb.float64[:,:](nb.int64, nb.float64[:,:]), nopython = True)(gen_K)
compute_z2_nb = nb.jit(nb.float64[:,:](nb.int64, nb.int64, nb.float64[:,:], nb.float64[:,:], nb.float64[:,:]), nopython = True)(compute_z2)
def TPS(pts1, z1, pts2, factor):
n, m = pts1.shape[0], pts2.shape[0]
P = np.hstack((np.ones((n,1)),pts1))
Y = np.vstack((z1, np.zeros((3,1))))
K = gen_K_nb(n, pts1)
K += factor*np.identity(n)
L = np.zeros((n+3,n+3))
L[0:n, 0:n] = K
L[0:n, n:n+3] = P
L[n:n+3, 0:n] = P.T
L_inv = np.linalg.inv(L)
coeffs = L_inv.dot(Y)
return compute_z2_nb(m, n, pts1, pts2, coeffs)
Finally, here is the code snippet I used to create the two plots:
import matplotlib.pyplot as plt
import numpy as np
N = 50 # Number of random points
pts = np.random.rand(N,2)
pts[:,0] *= 3.0 # initial x values
pts[:,1] *= 120.0 # initial y values
z1 = (pts[:,0])**2.0
for scale in [1.0, 40.0]:
pts1 = pts.copy()
pts1[:,0] *= scale
x2 = np.linspace(np.min(pts1[:,0]), np.max(pts1[:,0]), 40)
y2 = np.linspace(np.min(pts1[:,1]), np.max(pts1[:,1]), 40)
x2, y2 = np.meshgrid(x2, y2)
pts2 = np.vstack((x2.flatten(), y2.flatten())).T
z2 = TPS(pts1, z1.reshape(z1.shape[0], 1), pts2, 0.0)
# Display
fig = plt.figure(figsize=(4,3))
ax = fig.add_subplot(111)
C = ax.contourf(x2, y2, z2.reshape(x2.shape), np.linspace(0,9,10), extend='both')
ax.plot(pts1[:,0], pts1[:,1], 'ok')
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.colorbar(C, extendfrac=0)
plt.tight_layout()
plt.show()
Thin Plate Spline is scalar invariant, which means if you scale x and y by the same factor, the result should be the same. However, if you scale x and y differently, then the result will be different. This is common characteristics among radial basis functions. Some radial basis functions are not even scalar invariant.
When you say it "fails", what do you mean? The big question is, does it still exactly interpolate at the construction points? Assuming your code is correct and you do not have ill-conditioning, it should in which case it does not fail.
What I think is happening is that the addition of the scale is making the behavior in the x direction more dominant so you do not see the wiggles that come naturally from the interpolation.
As an aside, you can greatly speed up your code without using Numba by vectorizing.
import scipy.spatial.distance
import scipy.special
def gen_K(n,pts1):
# No need for n but kept to maintain compatability
pts1 = np.atleast_2d(pts1)
r = scipy.spatial.distance.cdist(pts1,pts1)
return scipy.special.xlogy(r**2,r)
It means you will get horrible ridges running through the surface. Resulting in a sub-optimal model fit. Read the caption below the images. Your model is experiencing the same effect, although plotted in 2D.

Can I do something like imsave() with text overlay?

I am using imsave() sequentially to make many PNGs that I will combine as an AVI and I would like to add moving text annotations. I use ImageJ to make AVIs or GIFs.
I don't want the axes, numbers, borders or anything, just the color image (as imsave() provides for example) with text (and maybe arrows) inside. These will change frame by frame. Pardon the use of jet.
I could use savefig() with ticks off and then do cropping as post processing, but is there a more convenient, direct, or "matplotlibithic" way to do this that wouldn't be so hard on my hard drive? (final thing will be pretty big).
A code snippet, added by request:
import numpy as np
import matplotlib.pyplot as plt
nx, ny = 101, 101
phi = np.zeros((ny, nx), dtype = 'float')
do_me = np.ones_like(phi, dtype='bool')
x0, y0, r0 = 40, 65, 12
x = np.arange(nx, dtype = 'float')[None,:]
y = np.arange(ny, dtype = 'float')[:,None]
rsq = (x-x0)**2 + (y-y0)**2
circle = rsq <= r0**2
phi[circle] = 1.0
do_me[circle] = False
do_me[0,:], do_me[-1,:], do_me[:,0], do_me[:,-1] = False, False, False, False
n, nper = 100, 100
phi_hold = np.zeros((n+1, ny, nx))
phi_hold[0] = phi
for i in range(n):
for j in range(nper):
phi2 = 0.25*(np.roll(phi, 1, axis=0) +
np.roll(phi, -1, axis=0) +
np.roll(phi, 1, axis=1) +
np.roll(phi, -1, axis=1) )
phi[do_me] = phi2[do_me]
phi_hold[i+1] = phi
change = phi_hold[1:] - phi_hold[:-1]
places = [(32, 20), (54,25), (11,32), (3, 12)]
plt.figure()
plt.imshow(change[50])
for (x, y) in places:
plt.text(x, y, "WOW", fontsize=16)
plt.text(5, 95, "Don't use Jet!", color="white", fontsize=20)
plt.show()
Method 1
Using an excellent answer to another question as a reference, I came up with the following simplified variant which seems to work nicely - just make sure the figsize (which is given in inches) aspect ratio matches the size ratio of the plot data:
import numpy as np
import matplotlib.pyplot as plt
test_image = np.eye(100)
fig = plt.figure(figsize=(4,4))
ax = plt.axes(frameon=False, xticks=[],yticks=[])
ax.imshow(test_image)
plt.savefig('test.png', bbox_inches='tight', pad_inches=0)
Note that I am using imshow with a test_image, which might behave differently from other plotting functions... please let me know in a comment in case you'd like to do something else.
Also note that the image will be (re-) sampled, so the figsize will influence the resolution of the written image.
As pointed out in the comments, the figsize setting doesn't match the size of the output image (or the size on screen, for that matter). To overcome this, use...
Method 2
Reading the FAQ entry Move the edge of an axes to make room for tick labels, I found a way to make the figsize parameter set the output image size directly, by moving the axes' ticks out of the visible area:
import numpy as np
import matplotlib.pyplot as plt
test_image = np.eye(100)
fig = plt.figure(figsize=(4,4))
ax = fig.add_axes([0,0,1,1])
ax.imshow(test_image)
plt.savefig('test.png')
Note that savefig has a default DPI setting (100 in my case) which - in combination with figsize - determines the number of pixels in x and y directions of the saved image. You can override this with the dpi keyword argument to savefig.
If you want to display the image on screen rather than saving it (by using plt.show() instead of the plt.savefig line in the code above), the size of the figure is dependent on (apart from the already familiar figsize parameter) the figure's DPI setting, which also has a default (80 on my system). This value can be overridden by passing the dpi keyword argument to the plt.figure() call.

Resources