How to calculate the variance using least square method if the regression line is vertical to x-axis - curve-fitting

I have some data in a 2D plane. I want to calculate their variance but the regression line could be vertical to x-axis, what's the proper way to obtain the variance?

If you are fitting a line to data but are okay with it being vertical, this looks less like linear regression (which assumes y is a function of x) and more like principal component analysis. Which can be done with sklearn as follows, computing the variance along the way.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
data = np.random.multivariate_normal([1, 1], [[0.2, 0], [0, 4]], size=100)
pca = PCA(n_components=2)
pca.fit(data)
print('Residual variance', pca.explained_variance_[1])
I fit two components here, which together explain all the variance since the data is 2D. The first component is the line that takes place of the regression line in this model. The second is where the residuals are, so the residual variance comes from there. Visualization:
line_direction = pca.components_[0]
M = np.abs(data).max()
t = np.linspace(-M, M)
center = data.mean(axis=0)
line = line_direction*t[:, None] + center
plt.plot(line[:, 0], line[:, 1], 'r')
plt.plot(data[:, 0], data[:, 1], '.')
plt.axes().set_aspect('equal', 'datalim')
plt.show()
The simulated data is random, but this is what it can look like:
Residual variance 0.23184791439896069
This is the variance in the direction perpendicular to the chosen line, not in the vertical direction (which wouldn't be appropriate here).
Related: Total least squares

Related

ValueError: Contour levels must be increasing - Contour plot in Python

I am trying to plot density estimates using a contour plot and getting the following error.
A = np.random.uniform(size=(100, 2))
#mean = np.mean(x)
#cov = np.cov(x)
mean = np.array([0.5, 0.1])
cov = np.array([[0.1, 0.0], [0.0, 1.5,]])
B = multivariate_normal.pdf(A, mean=mean, cov=cov)
# visualize
contours = plt.contour(A, B, linewidths=2)
plt.clabel(contours, inline=True, fontsize=12)
#plt.plot(x, y)
plt.colorbar();
For someone in future that has this problem in seaborn, I discovered that my data had some extreme outliers, meaning there was effectively no density to plot as 99% of the samples were around the origin. Using the 'clip' functionality in kdeplot worked to reduce the axis and thus plot the actual levels.

How do I perform a curve fit with an array of points and touching a specific point in that array

I need help with curve fitting a given set of points. The points form a parabola and I ought to find the peak point of the result. Issue is when I do a curve fit, it sometimes doesn't touch the max y-coordinate even if the actual point is given in the input array.
Following is the code snippet. Here 1.88 is the actual peak y-coordinate (13.05,1.88). But the graph generated by the code does not touch the point due to curve fitting. So is there a way to fit the curve making sure that it touches the max point given in the input array?
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, minimize_scalar
fig = plt.gcf()
#fig.set_size_inches(18.5, 10.5)
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,1.88,1.7,1.64]
def f(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
plt.plot(x,y,"ro")
popt, pcov = curve_fit(f, x, y)
# find the peak
fm = lambda x: -f(x, *popt)
r = minimize_scalar(fm, bounds=(1, 5))
print( "maximum:", r["x"], f(r["x"], *popt) ) #maximum: 2.99846874275 18.3928199902
plt.text(1,1.9,'maximum '+str(round(r["x"],2))+'( #'+str(round(f(r["x"], *popt),2)) + ' )')
x_curve = np.linspace(min(x), max(x), 50)
plt.plot(x_curve, f(x_curve, *popt))
plt.plot(r['x'], f(r['x'], *popt), 'ko')
plt.show()
Here is a graphical code example using your equation with weighted fitting, where I have made the max point larger to more easily see the effect of the weighting. In non-weighted curve fitting, all weights are implicitly 1.0 as all data points have equal weight. Scipy's curve_fit routine uses weights in the form of uncertainties, so that giving a point a very small uncertainty (which I have done) is like giving the point a very large weight. This technique can be used to make a fit pass arbitrarily close to any single data point by any software that can perform weghted fitting.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,2.0,1.7,1.64]
# note the single very small uncertainty - try making this value 1.0
uncertainties = numpy.array([1.0, 1.0, 1.0E-6, 1.0, 1.0])
# rename data to use previous example
xData = numpy.array(x)
yData = numpy.array(y)
def func(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0, 1.0])
# curve fit the test data, first without uncertainties to
# get us closer to initial starting parameters
ssqParameters, pcov = curve_fit(func, xData, yData, p0 = initialParameters)
# now that we have better starting parameters, use uncertainties
fittedParameters, pcov = curve_fit(func, xData, yData, p0 = ssqParameters, sigma=uncertainties, absolute_sigma=True)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Getting the coordinates for the "hottest areas" on a numpy heatmap

I've got a heat map numpy array with shape (600,400). The heatmap represents probabilities of detection. In my case, the probability of face detections in an image. My goal is to take this heatmap and get the coordinates (X and Y) where the highest probability occurs.
I've solved this for the case of a single face. The code for that is the following:
face_location = np.unravel_index(heatmap.argmax(), heatmap.shape)
print("Face location: " + str(face_location))
But in some cases there are multiple faces. I don't know how to adjust the algorithm to return multiple "hottest area". The issue is that any one hot area will be surrounded by gradually less hot areas. And so it's possible that after the hottest area, the next top 10 will all be right beside the initial point.
How can I adjust the algorithm to look for multiple hot areas? It's ok to assume that they won't be right beside each other.
heatmap = [[ 2.00299415e-04 2.03753079e-04 8.17560707e-04 ..., 2.23556344e-04
1.98958180e-04 9.92935777e-01]
[ 2.00642273e-04 2.04473894e-04 8.19963054e-04 ..., 2.24148811e-04
1.99438742e-04 9.92921114e-01]
[ 2.01056406e-04 2.05344462e-04 8.22864589e-04 ..., 2.24864416e-04
2.00019145e-04 9.92903233e-01]
...,
[ 7.28193991e-05 -2.73474743e-05 2.95096161e-05 ..., 5.96550672e-05
1.98282614e-05 9.99637246e-01]
[ 7.34055429e-05 -2.72389279e-05 3.02382941e-05 ..., 5.98490733e-05
2.04356711e-05 9.99619305e-01]
[ 7.37556256e-05 -2.71740992e-05 3.06735128e-05 ..., 5.99649393e-05
2.07984649e-05 9.99608397e-01]]
Perhaps consider using a mask array with a threshold probability defining the hot areas?
In [29]: threshold_probability = 0.8
In [30]: prng = np.random.RandomState(42)
In [31]: heatmap = prng.rand(600, 400)
In [32]: heatmap
Out[32]:
array([[ 0.37454012, 0.95071431, 0.73199394, ..., 0.42899403,
0.75087107, 0.75454287],
[ 0.10312387, 0.90255291, 0.50525237, ..., 0.56513318,
0.69665082, 0.92249938],
[ 0.70723863, 0.15253904, 0.57628836, ..., 0.96887786,
0.74965183, 0.13008624],
...,
[ 0.77669933, 0.98757844, 0.72686576, ..., 0.149866 ,
0.6685433 , 0.90248875],
[ 0.116007 , 0.96352904, 0.33109138, ..., 0.85776718,
0.88838363, 0.00901272],
[ 0.30810176, 0.43190563, 0.60935151, ..., 0.07498895,
0.60716006, 0.31712892]])
In [33]: hottest_areas = np.ma.MaskedArray(heatmap, heatmap < threshold_probability)
In [34]: X, Y = hottest_areas.nonzero()
In [35]: X
Out[35]: array([ 0, 0, 0, ..., 599, 599, 599])
In [36]: Y
Out[36]: array([ 1, 7, 11, ..., 376, 388, 394])
The result is a tuple containing the x and y coords of the values for which the boolean condition defining the mask is False (i.e., areas for which the probability of face is higher than threshold).
If you want to go with a threshold like davidrpugh proposed I have a different approach to propose.
Instead of finding the non zero elements, just find the connexe components of your binary image.
import numpy as np
from scipy.ndimage.measurements import label
from skimage.measure import regionprops
heatmap = np.random.rand(100, 25)
thresh = 0.9
bw = np.array(heatmap)
bw[bw < thresh] = 0
img_cc, nb_cc = label(bw)
cc = regionprops(img_cc)
face_location = np.array([c.centroid for c in cc])
import matplotlib.pyplot as plt
plt.figure()
plt.imshow(heatmap)
plt.plot(face_location[:, 1], face_location[:, 0], 'r*')
plt.figure()
plt.imshow(img_cc)
plt.plot(face_location[:, 1], face_location[:, 0], 'r*')
plt.show()
The face location are here defined by the centers of the connexe components but you can look for the maximum of each region in the image instead.

Visualize distance matrix as a graph

I am doing a clustering task and I have a distance matrix. I wish to visualize this distance matrix as a 2D graph. Please let me know if there is any way to do it online or in programming languages like R or python.
My distance matrix is as follows,
I used the classical Multidimensional scaling functionality (in R) and obtained a 2D plot that looks like:
But What I am looking for is a graph with nodes and weighted edges running between them.
Possibility 1
I assume, that you want a 2dimensional graph, where distances between nodes positions are the same as provided by your table.
In python, you can use networkx for such applications. In general there are manymethods of doing so, remember, that all of them are just approximations (as in general it is not possible to create a 2 dimensional representataion of points given their pairwise distances) They are some kind of stress-minimizatin (or energy-minimization) approximations, trying to find the "reasonable" representation with similar distances as those provided.
As an example you can consider a four point example (with correct, discrete metric applied):
p1 p2 p3 p4
---------------
p1 0 1 1 1
p2 1 0 1 1
p3 1 1 0 1
p4 1 1 1 0
In general, drawing actual "graph" is redundant, as you have fully connected one (each pair of nodes is connected) so it should be sufficient to draw just points.
Python example
import networkx as nx
import numpy as np
import string
dt = [('len', float)]
A = np.array([(0, 0.3, 0.4, 0.7),
(0.3, 0, 0.9, 0.2),
(0.4, 0.9, 0, 0.1),
(0.7, 0.2, 0.1, 0)
])*10
A = A.view(dt)
G = nx.from_numpy_matrix(A)
G = nx.relabel_nodes(G, dict(zip(range(len(G.nodes())),string.ascii_uppercase)))
G = nx.to_agraph(G)
G.node_attr.update(color="red", style="filled")
G.edge_attr.update(color="blue", width="2.0")
G.draw('distances.png', format='png', prog='neato')
In R you can try multidimensional scaling
# Classical MDS
# N rows (objects) x p columns (variables)
# each row identified by a unique row name
d <- dist(mydata) # euclidean distances between the rows
fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim
fit # view results
# plot solution
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",
main="Metric MDS", type="n")
text(x, y, labels = row.names(mydata), cex=.7)
Possibility 2
You just want to draw a graph with labeled edges
Again, networkx can help:
import networkx as nx
# Create a graph
G = nx.Graph()
# distances
D = [ [0, 1], [1, 0] ]
labels = {}
for n in range(len(D)):
for m in range(len(D)-(n+1)):
G.add_edge(n,n+m+1)
labels[ (n,n+m+1) ] = str(D[n][n+m+1])
pos=nx.spring_layout(G)
nx.draw(G, pos)
nx.draw_networkx_edge_labels(G,pos,edge_labels=labels,font_size=30)
import pylab as plt
plt.show()
Multidimensional scaling (MDS) is exactly what you want. See here and here for more.
You did not mentioned if you want a 2 dimensional graph or not. I suppose that you want to build a graph on 2 dimensions due to the fact that you need that for visualization. Considering that you have to be aware that for the most of the graphs this is simply not possible.
What can be probably done is to approximate somehow the values from distance matrix, something like small values to have relative small edges and big values to have a relative big length.
With all previous considerations one option would be graphviz. See neato function.
In general what you are interested in is force-directed drawing. See wikipedia for further reference.
You can use d3js Force Directed Graph and configure distance between nodes. d3js force layout has some clustering capability to separate nodes with similar distances. Here's an example with values as distance between nodes:
http://vida.io/documents/SyT7DREdQmGSpsBkK
Another way to visualize is to use same distance between nodes but different line thickness. In that case, you'd want to calculate stroke-width based on values:
.style("stroke-width", function(d) { return Math.sqrt(d.value / 50); });

Evaluating/Fitting an ellipse from scattered points

Here is the deal. I have multiple points (X,Y) that form an 'ellipse like' shape.
I would like to evaluate/fit the 'best' ellipse possible and get its properties (a,b,F1,F2), or just the center of the ellipse.
Any ideas/leads would be appreciated.
Gilad.
There's a Matlab function fit_ellipse that can do the job. There's also this paper on methods for orthogonal distance fitting of ellipses. A web search for orthogonal ellipse fit will probably turn up a lot of other resources as well.
The ellipse fitting method proposed by:
Z. L. Szpak, W. Chojnacki, and A. van den Hengel.
Guaranteed ellipse fitting with a confidence region and an uncertainty measure for centre, axes, and orientation.
J. Math. Imaging Vision, 2015.
may be of interest to you. They provide estimates of both algebraic and geometric ellipse
parameters, together with covariance matrices that express the uncertainty of the parameter estimates.
They also provide a means of computing a planar 95% confidence region associated with the estimate
that allows one to visualise the uncertainty in the ellipse fit.
A pre-print version of the paper is available on the authors websites (http://cs.adelaide.edu.au/~wojtek/publicationsWC.html).
A MATLAB implementation of the method is also available for download:
https://sites.google.com/site/szpakz/source-code/guaranteed-ellipse-fitting-with-a-confidence-region-and-an-uncertainty-measure-for-centre-axes-and-orientation
I will explain how I would approach the problem. I would suggest a hill climbing approach. First compute the gravity center of the points as a start point and choose two values for a and b in some way(probably arbitrary positive values will do). You need to have a fit function and I would suggest it to return the number of points (close enough to)lying on a given ellipse:
int fit(x, y, a, b)
int res := 0
for point in points
if point_almost_on_ellipse(x, y, a, b, point)
res = res + 1
end_if
end_for
return res
Now start with some step. I would choose a big enough value to be sure the best center of the elipse will never be more then step away from the first point. Choosing such a big value is not necessary, but the slowest part of the algorithm is the time it takes to get close to the best center so bigger value is better, I think.
So now we have some initial point(x, y), some initial values of a and b and an initial step. The algorithm iteratively chooses the best of the neighbours of the current point if there is any neighbour better then it, or decrease step twice otherwise. Here by 'best' I mean using the fit function. And also a position is defined by four values (x, y, a, b) and it's neighbours are 8: (x+-step, y, a, b),(x, y+-step, a, b), (x, y, a+-step, b), (x, y, a, b+-step)(if results are not good enough you can add more neighbours by also going by diagonal - for instance (x+-step, y+-step, a, b) and so on). Here is how you do that
neighbours = [[-1, 0, 0, 0], [1, 0, 0, 0], [0, -1, 0, 0], [0, 1, 0, 0],
[0, 0, -1, 0], [0, 0, 1, 0], [0, 0, 0, -1], [0, 0, 0, 1]]
iterate (cx, cy, ca, cb, step)
current_fit = fit(cx, cy, ca, cb)
best_neighbour = []
best_fit = current_fit
for neighbour in neighbours
tx = cx + neighbour[0]*step
ty = cx + neighbour[1]*step
ta = ca + neighbour[2]*step
tb = cb + neighbour[3]*step
tfit = fit(tx, ty, ta, tb)
if (tfit > best_fit)
best_fit = tfit
best_neighbour = [tx,ty,ta,tb]
endif
end_for
if best_neighbour.size == 4
cx := best_neighbour[0]
cy := best_neighbour[1]
ca := best_neighbour[2]
cb := best_neighbour[3]
else
step = step * 0.5
end_if
And you continue iterating until the value of step is smaller then a given threshold(for instance 1e-6). I have written everything in pseudo code as I am not sure which language do you want to use.
It is not guaranteed that the answer found this way will be optimal but I am pretty sure it will be good enough approximation.
Here is an article about hill climbing.
I think that Wild Magic library contains a function for ellipse fitting. There is article with method decription
The problem is to define "best". What is best in your case? The ellipse with the smallest area which contains n% of pointS?
If you define "best" in terms of probability, you can simply use the covariance matrix of your points, and compute the error ellipse.
An error ellipse for this "multivariate Gaussian distribution" would then contain the points corresponding to whatever confidence interval you decide.
Many computing packages can compute the covariance, with its corresponding eigenvalues and eigenvectors. The angle of the ellipse is the angle between the x axis and the eigenvector corresponding to the largest eigenvalue. The semi-axes are the reciprocal of the eigenvalues.
If your routine returns everything normalized (which it should), then you can decide by what factor to multiply everything to obtain an alpha-confidence interval.

Resources