Percentage scaling in my Y axis seaborn lineplot is messy [duplicate] - seaborn

This question already has an answer here:
Difference in plotting with different matplotlib versions
(1 answer)
Closed 5 years ago.
I am trying to plot using matplotlib. The plot showed a problem that the Y axis is not ordered.
Here is the code.
# -*- coding: UTF-8 -*-
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime
import numpy as np
I020 = [ line.strip('\n').split(",") for line in
open(r'D:\Users\a0476\Anaconda3\TickData\PV5sdata1.csv')][1:]
Time = [ datetime.datetime.strptime(line[0],"%H%M%S%f") for line in I020 ]
Time1 = [ mdates.date2num(line) for line in Time ]
Solar = [ line[1] for line in I020 ]
order = np.argsort(Time1)
xs = np.array(Time1)[order]
ys = np.array(Solar)[order]
plt.title('Solar data')
plt.xlabel('Time')
plt.ylabel('Solar')
ax.plot_date(xs, ys, 'k-')
hfmt = mdates.DateFormatter('%H:%M:%S')
ax.xaxis.set_major_formatter(hfmt)
plt.show()
The CSV data
time solar
7000000 50.35
8000000 41.01
9000000 69.16
10000000 94.5
11000000 111.9
12000000 103
13000000 98.6
14000000 36.45
15000000 34.74
16000000 34.17
17000000 34.6

The reason this happens is because your data is being plotted as strings.
The solution is to convert your y axis data to floats. This can be done by simply casting to a float in your list comprehension:
Solar = [float(line[1]) for line in I020]
I would also suggest to use matplotlib's auto formatting of the x -axis when using dates/times. This will rotate the labels etc to make the graph look better:
plt.gcf().autofmt_xdate()
Your example becomes:
I020 = [ line.strip('\n').split(",") for line in open('PV5sdata1.csv')][1:]
Time = [datetime.datetime.strptime(line[0],"%H%M%S%f") for line in I020]
Time1 = [mdates.date2num(line) for line in Time]
Solar = [float(line[1]) for line in I020]
xs = np.array(Time1) # You don't really need to do this but I've left it in
ys = np.array(Solar)
fig, ax = plt.subplots() # using matplotlib's Object Oriented API
ax.set_title('Solar data')
ax.set_xlabel('Time')
ax.set_ylabel('Solar')
ax.plot_date(xs, ys, 'k-')
hfmt = mdates.DateFormatter('%H:%M:%S')
ax.xaxis.set_major_formatter(hfmt)
plt.gcf().autofmt_xdate()
plt.show()
Which gives:

Related

Subtracting a best fit line with numpy.polyfit()?

So I'm working on a project and I have a set of data that I loaded in as a csv. The data has a spot that that I need to flatten out. I used the numpy.polyfit() function to find a line of best fit, but what I can't seem to figure out is how to subtract off the best fit line. Any advice?
Here is the code I'm using so far:
μ = pd.read_csv("C:\\Users\\ander\\Documents\\Data\\plots and code\\dataframe2.csv")
yvalue = "average"
xvalue = "xvalue"
X = μ[xvalue][173:852]
Y = μ[yvalue][173:852]
fit = np.polyfit(X, Y, 1)
μ = μ.subtract(fit, μ)
The polyfit function finds the linear coefficient of the best fit. In order to subtract the line from your data, you first need to create the linear function itself. For example, you can use the numpy.poly1d function.
I'll show you an example. Since we don't have access to the .csv file I made up X and Y:
import matplotlib.pyplot as plt
import numpy as np
DATA_SIZE = 500
μ_X = np.sort(np.random.uniform(0,10,DATA_SIZE))
μ_Y = 3*np.exp(-(μ_X-7)**2) + np.random.normal(0,0.08,DATA_SIZE) + 0.5*μ_X
X = μ_X[50:200]
Y = μ_Y[50:200]
plt.scatter(μ_X, μ_Y, label='Full data')
plt.scatter(X, Y, label='Selected region')
plt.legend()
plt.show()
Now we can fit the baseline from the orange data and subtract the linear function from all the data (blue).
fit = np.polyfit(X, Y, 1)
linear_baseline = np.poly1d(fit) # create the linear baseline function
μ_Y = μ_Y - linear_baseline(μ_X) # subtract the baseline from μ_Y
plt.scatter(μ_X, μ_Y, label='Linear baseline removed')
plt.legend()
plt.show()

How do I perform a curve fit with an array of points and touching a specific point in that array

I need help with curve fitting a given set of points. The points form a parabola and I ought to find the peak point of the result. Issue is when I do a curve fit, it sometimes doesn't touch the max y-coordinate even if the actual point is given in the input array.
Following is the code snippet. Here 1.88 is the actual peak y-coordinate (13.05,1.88). But the graph generated by the code does not touch the point due to curve fitting. So is there a way to fit the curve making sure that it touches the max point given in the input array?
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, minimize_scalar
fig = plt.gcf()
#fig.set_size_inches(18.5, 10.5)
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,1.88,1.7,1.64]
def f(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
plt.plot(x,y,"ro")
popt, pcov = curve_fit(f, x, y)
# find the peak
fm = lambda x: -f(x, *popt)
r = minimize_scalar(fm, bounds=(1, 5))
print( "maximum:", r["x"], f(r["x"], *popt) ) #maximum: 2.99846874275 18.3928199902
plt.text(1,1.9,'maximum '+str(round(r["x"],2))+'( #'+str(round(f(r["x"], *popt),2)) + ' )')
x_curve = np.linspace(min(x), max(x), 50)
plt.plot(x_curve, f(x_curve, *popt))
plt.plot(r['x'], f(r['x'], *popt), 'ko')
plt.show()
Here is a graphical code example using your equation with weighted fitting, where I have made the max point larger to more easily see the effect of the weighting. In non-weighted curve fitting, all weights are implicitly 1.0 as all data points have equal weight. Scipy's curve_fit routine uses weights in the form of uncertainties, so that giving a point a very small uncertainty (which I have done) is like giving the point a very large weight. This technique can be used to make a fit pass arbitrarily close to any single data point by any software that can perform weghted fitting.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [4.59,9.02,13.05,18.47,20.3]
y = [1.7,1.84,2.0,1.7,1.64]
# note the single very small uncertainty - try making this value 1.0
uncertainties = numpy.array([1.0, 1.0, 1.0E-6, 1.0, 1.0])
# rename data to use previous example
xData = numpy.array(x)
yData = numpy.array(y)
def func(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0, 1.0])
# curve fit the test data, first without uncertainties to
# get us closer to initial starting parameters
ssqParameters, pcov = curve_fit(func, xData, yData, p0 = initialParameters)
# now that we have better starting parameters, use uncertainties
fittedParameters, pcov = curve_fit(func, xData, yData, p0 = ssqParameters, sigma=uncertainties, absolute_sigma=True)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Is there a way to plot confusion matrix from H2O?

I know H2O can use
model_perf = model.model_performance(input)
model_perf.confusion_matrix
to output the confusion matrix. But is there a way to get the confusion matrix table to create plot?
You have the function you need as indicated here. So you just need to convert the output of your H2OFrames to a Pandas Dataframe. Example is shown below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
%matplotlib inline
h2o.init()
h2o.cluster().show_status()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# print(cars["economy_20mpg"].isna().sum())
cars[~cars["economy_20mpg"].isna()]["economy_20mpg"].isna().sum()
cars = cars[~cars["economy_20mpg"].isna()]
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the `y` parameter:
# first initialize your estimator
cars_gbm = H2OGradientBoostingEstimator(seed = 1234, sample_rate=.5)
# then train your model, where you specify your 'x' predictors, your 'y' the response column
# training_frame and validation_frame
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
function from sklearn:
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
classes = classes[unique_labels(y_true, y_pred)]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
extract values
# specify the threshold you want to use to create integer labels
maxf1_threshold = cars_gbm.find_threshold_by_max_metric('f1')
# specify your tru and prediciton labels
y_true = cars["economy_20mpg"].as_data_frame()
y_pred = cars_gbm.predict(cars)
# convert prediction labels (original uncalibrated probabilities into integer labels)
y_pred = (y_pred['p1'] >= maxf1_threshold).ifelse(1,0)
y_pred = y_pred.as_data_frame()
y_pred.columns = ['p1']
y_true1 = y_true.economy_20mpg
y_pred1 = y_pred.p1
class_names = np.array(cars["economy_20mpg"].levels()[0])
# Plot non-normalized confusion matrix
plot_confusion_matrix(y_true1, y_pred1, classes=class_names,
title='Confusion matrix')
image result:
Please note that there is a bug in the H2O-3 confusion matrix that has been noted here

How to break y-axis using seaborn? [duplicate]

I'm trying to create a plot using pyplot that has a discontinuous x-axis. The usual way this is drawn is that the axis will have something like this:
(values)----//----(later values)
where the // indicates that you're skipping everything between (values) and (later values).
I haven't been able to find any examples of this, so I'm wondering if it's even possible. I know you can join data over a discontinuity for, eg, financial data, but I'd like to make the jump in the axis more explicit. At the moment I'm just using subplots but I'd really like to have everything end up on the same graph in the end.
Paul's answer is a perfectly fine method of doing this.
However, if you don't want to make a custom transform, you can just use two subplots to create the same effect.
Rather than put together an example from scratch, there's an excellent example of this written by Paul Ivanov in the matplotlib examples (It's only in the current git tip, as it was only committed a few months ago. It's not on the webpage yet.).
This is just a simple modification of this example to have a discontinuous x-axis instead of the y-axis. (Which is why I'm making this post a CW)
Basically, you just do something like this:
import matplotlib.pylab as plt
import numpy as np
# If you're not familiar with np.r_, don't worry too much about this. It's just
# a series with points from 0 to 1 spaced at 0.1, and 9 to 10 with the same spacing.
x = np.r_[0:1:0.1, 9:10:0.1]
y = np.sin(x)
fig,(ax,ax2) = plt.subplots(1, 2, sharey=True)
# plot the same data on both axes
ax.plot(x, y, 'bo')
ax2.plot(x, y, 'bo')
# zoom-in / limit the view to different portions of the data
ax.set_xlim(0,1) # most of the data
ax2.set_xlim(9,10) # outliers only
# hide the spines between ax and ax2
ax.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax.yaxis.tick_left()
ax.tick_params(labeltop='off') # don't put tick labels at the top
ax2.yaxis.tick_right()
# Make the spacing between the two axes a bit smaller
plt.subplots_adjust(wspace=0.15)
plt.show()
To add the broken axis lines // effect, we can do this (again, modified from Paul Ivanov's example):
import matplotlib.pylab as plt
import numpy as np
# If you're not familiar with np.r_, don't worry too much about this. It's just
# a series with points from 0 to 1 spaced at 0.1, and 9 to 10 with the same spacing.
x = np.r_[0:1:0.1, 9:10:0.1]
y = np.sin(x)
fig,(ax,ax2) = plt.subplots(1, 2, sharey=True)
# plot the same data on both axes
ax.plot(x, y, 'bo')
ax2.plot(x, y, 'bo')
# zoom-in / limit the view to different portions of the data
ax.set_xlim(0,1) # most of the data
ax2.set_xlim(9,10) # outliers only
# hide the spines between ax and ax2
ax.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax.yaxis.tick_left()
ax.tick_params(labeltop='off') # don't put tick labels at the top
ax2.yaxis.tick_right()
# Make the spacing between the two axes a bit smaller
plt.subplots_adjust(wspace=0.15)
# This looks pretty good, and was fairly painless, but you can get that
# cut-out diagonal lines look with just a bit more work. The important
# thing to know here is that in axes coordinates, which are always
# between 0-1, spine endpoints are at these locations (0,0), (0,1),
# (1,0), and (1,1). Thus, we just need to put the diagonals in the
# appropriate corners of each of our axes, and so long as we use the
# right transform and disable clipping.
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass plot, just so we don't keep repeating them
kwargs = dict(transform=ax.transAxes, color='k', clip_on=False)
ax.plot((1-d,1+d),(-d,+d), **kwargs) # top-left diagonal
ax.plot((1-d,1+d),(1-d,1+d), **kwargs) # bottom-left diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d,d),(-d,+d), **kwargs) # top-right diagonal
ax2.plot((-d,d),(1-d,1+d), **kwargs) # bottom-right diagonal
# What's cool about this is that now if we vary the distance between
# ax and ax2 via f.subplots_adjust(hspace=...) or plt.subplot_tool(),
# the diagonal lines will move accordingly, and stay right at the tips
# of the spines they are 'breaking'
plt.show()
I see many suggestions for this feature but no indication that it's been implemented. Here is a workable solution for the time-being. It applies a step-function transform to the x-axis. It's a lot of code, but it's fairly simple since most of it is boilerplate custom scale stuff. I have not added any graphics to indicate the location of the break, since that is a matter of style. Good luck finishing the job.
from matplotlib import pyplot as plt
from matplotlib import scale as mscale
from matplotlib import transforms as mtransforms
import numpy as np
def CustomScaleFactory(l, u):
class CustomScale(mscale.ScaleBase):
name = 'custom'
def __init__(self, axis, **kwargs):
mscale.ScaleBase.__init__(self)
self.thresh = None #thresh
def get_transform(self):
return self.CustomTransform(self.thresh)
def set_default_locators_and_formatters(self, axis):
pass
class CustomTransform(mtransforms.Transform):
input_dims = 1
output_dims = 1
is_separable = True
lower = l
upper = u
def __init__(self, thresh):
mtransforms.Transform.__init__(self)
self.thresh = thresh
def transform(self, a):
aa = a.copy()
aa[a>self.lower] = a[a>self.lower]-(self.upper-self.lower)
aa[(a>self.lower)&(a<self.upper)] = self.lower
return aa
def inverted(self):
return CustomScale.InvertedCustomTransform(self.thresh)
class InvertedCustomTransform(mtransforms.Transform):
input_dims = 1
output_dims = 1
is_separable = True
lower = l
upper = u
def __init__(self, thresh):
mtransforms.Transform.__init__(self)
self.thresh = thresh
def transform(self, a):
aa = a.copy()
aa[a>self.lower] = a[a>self.lower]+(self.upper-self.lower)
return aa
def inverted(self):
return CustomScale.CustomTransform(self.thresh)
return CustomScale
mscale.register_scale(CustomScaleFactory(1.12, 8.88))
x = np.concatenate((np.linspace(0,1,10), np.linspace(9,10,10)))
xticks = np.concatenate((np.linspace(0,1,6), np.linspace(9,10,6)))
y = np.sin(x)
plt.plot(x, y, '.')
ax = plt.gca()
ax.set_xscale('custom')
ax.set_xticks(xticks)
plt.show()
Check the brokenaxes package:
import matplotlib.pyplot as plt
from brokenaxes import brokenaxes
import numpy as np
fig = plt.figure(figsize=(5,2))
bax = brokenaxes(
xlims=((0, .1), (.4, .7)),
ylims=((-1, .7), (.79, 1)),
hspace=.05
)
x = np.linspace(0, 1, 100)
bax.plot(x, np.sin(10 * x), label='sin')
bax.plot(x, np.cos(10 * x), label='cos')
bax.legend(loc=3)
bax.set_xlabel('time')
bax.set_ylabel('value')
A very simple hack is to
scatter plot rectangles over the axes' spines and
draw the "//" as text at that position.
Worked like a charm for me:
# FAKE BROKEN AXES
# plot a white rectangle on the x-axis-spine to "break" it
xpos = 10 # x position of the "break"
ypos = plt.gca().get_ylim()[0] # y position of the "break"
plt.scatter(xpos, ypos, color='white', marker='s', s=80, clip_on=False, zorder=100)
# draw "//" on the same place as text
plt.text(xpos, ymin-0.125, r'//', fontsize=label_size, zorder=101, horizontalalignment='center', verticalalignment='center')
Example Plot:
For those interested, I've expanded upon #Paul's answer and added it to the matplotlib wrapper proplot. It can do axis "jumps", "speedups", and "slowdowns".
There is no way currently to add "crosses" that indicate the discrete jump like in Joe's answer, but I plan to add this in the future. I also plan to add a default "tick locator" that sets sensible default tick locations depending on the CutoffScale arguments.
Adressing Frederick Nord's question how to enable parallel orientation of the diagonal "breaking" lines when using a gridspec with ratios unequal 1:1, the following changes based on the proposals of Paul Ivanov and Joe Kingtons may be helpful. Width ratio can be varied using variables n and m.
import matplotlib.pylab as plt
import numpy as np
import matplotlib.gridspec as gridspec
x = np.r_[0:1:0.1, 9:10:0.1]
y = np.sin(x)
n = 5; m = 1;
gs = gridspec.GridSpec(1,2, width_ratios = [n,m])
plt.figure(figsize=(10,8))
ax = plt.subplot(gs[0,0])
ax2 = plt.subplot(gs[0,1], sharey = ax)
plt.setp(ax2.get_yticklabels(), visible=False)
plt.subplots_adjust(wspace = 0.1)
ax.plot(x, y, 'bo')
ax2.plot(x, y, 'bo')
ax.set_xlim(0,1)
ax2.set_xlim(10,8)
# hide the spines between ax and ax2
ax.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
ax.yaxis.tick_left()
ax.tick_params(labeltop='off') # don't put tick labels at the top
ax2.yaxis.tick_right()
d = .015 # how big to make the diagonal lines in axes coordinates
# arguments to pass plot, just so we don't keep repeating them
kwargs = dict(transform=ax.transAxes, color='k', clip_on=False)
on = (n+m)/n; om = (n+m)/m;
ax.plot((1-d*on,1+d*on),(-d,d), **kwargs) # bottom-left diagonal
ax.plot((1-d*on,1+d*on),(1-d,1+d), **kwargs) # top-left diagonal
kwargs.update(transform=ax2.transAxes) # switch to the bottom axes
ax2.plot((-d*om,d*om),(-d,d), **kwargs) # bottom-right diagonal
ax2.plot((-d*om,d*om),(1-d,1+d), **kwargs) # top-right diagonal
plt.show()
This is a hacky but pretty solution for x-axis breaks.
The solution is based on https://matplotlib.org/stable/gallery/subplots_axes_and_figures/broken_axis.html, which gets rid of the problem with positioning the break above the spine, solved by How can I plot points so they appear over top of the spines with matplotlib?
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
def axis_break(axis, xpos=[0.1, 0.125], slant=1.5):
d = slant # proportion of vertical to horizontal extent of the slanted line
anchor = (xpos[0], -1)
w = xpos[1] - xpos[0]
h = 1
kwargs = dict(marker=[(-1, -d), (1, d)], markersize=12, zorder=3,
linestyle="none", color='k', mec='k', mew=1, clip_on=False)
axis.add_patch(Rectangle(
anchor, w, h, fill=True, color="white",
transform=axis.transAxes, clip_on=False, zorder=3)
)
axis.plot(xpos, [0, 0], transform=axis.transAxes, **kwargs)
fig, ax = plt.subplots(1,1)
plt.plot(np.arange(10))
axis_break(ax, xpos=[0.1, 0.12], slant=1.5)
axis_break(ax, xpos=[0.3, 0.31], slant=-10)
if you want to replace an axis label, this would do the trick:
from matplotlib import ticker
def replace_pos_with_label(fig, pos, label, axis):
fig.canvas.draw() # this is needed to set up the x-ticks
labs = axis.get_xticklabels()
labels = []
locs = []
for text in labs:
x = text._x
lab = text._text
if x == pos:
lab = label
labels.append(lab)
locs.append(x)
axis.xaxis.set_major_locator(ticker.FixedLocator(locs))
axis.set_xticklabels(labels)
fig, ax = plt.subplots(1,1)
plt.plot(np.arange(10))
replace_pos_with_label(fig, 0, "-10", axis=ax)
replace_pos_with_label(fig, 6, "$10^{4}$", axis=ax)
axis_break(ax, xpos=[0.1, 0.12], slant=2)

Best learning algorithms concentric and not linearly separable data

Below are two scatter plots. The first one is for data points that have values of x and y, and I would like to know if there is a clustering algorithm that will automatically recognize that there are two clusters. They are concentric and not linearly separable. K-means is not right for several reasons. The other plot is similar but it has x, y and color values, and I would like to know what learning algorithm would be best at classifying or predicting the correct color from the values of x and y.
I got good classifier results for this problem using the sklearn MLPClassifier algorithm. Here is the scatter and contour plots:
Detailed code at: https://www.linkedin.com/pulse/couple-scikit-learn-classifiers-peter-thorsteinson. The simplified code below shows how it works:
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
# Generate the artificial data set and display the resulting scatter plot
x = []
y = []
z = []
for i in range(500):
rand = np.random.uniform(0.0, 2*math.pi)
randx = np.random.normal(0.0, 30.0)
randy = np.random.normal(0.0, 30.0)
if np.random.random() > 0.5:
z.append(0)
x.append(100*math.cos(rand) + randx)
y.append(100*math.sin(rand) + randy)
else:
z.append(1)
x.append(300*math.cos(rand) + randx)
y.append(300*math.sin(rand) + randy)
plt.axis('equal')
plt.axis([-500, 500, -500, 500])
plt.scatter(x, y, c=z)
plt.show()
# Run the MLPClassifier algorithm on the training data
XY = pd.DataFrame({'x': x, 'y': y})
print(XY.head())
Z = pd.DataFrame({'z': z})
print(Z.head())
XY_train, XY_test, Z_train, Z_test = train_test_split(XY, Z, test_size = 0.20)
mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)
mlp.fit(XY_train, Z_train.values.ravel())
# Make predictions on the test data and display resulting scatter plot
predictions = mlp.predict(XY_test)
print(confusion_matrix(Z_test,predictions))
print(classification_report(Z_test,predictions))
plt.axis('equal')
plt.axis([-500, 500, -500, 500])
plt.scatter(XY_test.x, XY_test.y, c=predictions)
plt.show()

Resources