Problem of understanding the graph seaborn.boxplot() - seaborn

I don't understand the seaborn.boxplot() graph below.
data source for cvs file
The code is:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
df.head()
plt.figure(figsize = (8,8))
sns.color_palette("Paired")
sns.boxplot(x="Gender",y="Purchase", hue="Age", data=df, palette="Paired")
plt.legend(bbox_to_anchor=(1.05,1),loc=2, borderaxespad=0)
plt.grid(True)
plt.draw()
That produces:
df[(df.Gender == 'F') & (df.Age =='55+')].Purchase.describe()
That produces:
count 5083.000000
mean 9007.036199
std 4801.556874
min 12.000000
25% 6039.500000
50% 8084.000000
75% 10067.000000
max 23899.000000
Name: Purchase, dtype: float64
I find some values but not all. For example, I do not see the maximum.
But most of all, I don't understand these clusters of black dots that
I circled in red on the graph. I don't know what they correspond to.
Do you have any idea what they represent?

As Johann C has indicated, the whiskers are 1.5 times the interquartile range (the values from 25 to 75% i.e. cover the middle 50% of the values). The values outside of this interquartile range are known as outliers and this is what is being represented when you are labelling by ???. In theory the whiskers would be equal length from top and bottom of the interquartile box but as the min value is 12 the whiskers are cut off here. From the looks of it, it suggests that you have a right skew distribution.

From what it looks, these are outliers which are so numerous they overlap. You might thus want to check if you're actually dealing with two separate populations whose samples have been thrown together, or a bimodal distribution as such. Both deserves investigation IMO. However, that'd be better discussed in a statistics channel (it's not specific to seaborn).

Related

How can we divide our long-time domain signal into equal segments and then apply Wavelet transform?

I have a time-domain signal and the samples size is 80000. I want to divide these samples into equal sizes of segments and want to apply wavelet transform to them.
How I can do this step. please guide me.
Thank you
One way to segment your original data is simply to use numpy's reshape function.
Assuming that you want to reshape your data into 2000 samples long segments:
import numpy as np
original_time_series = np.random.random(80000)
window_size = 2000
reshaped_time_series = original_time_series.reshape((window_size,-1))
Of course, you will need to ensure that the total number of samples in your time series is a multiple of the window_size. Otherwise, you can trim your input time series to match this requirement.
You can then apply your wavelet transform to each and every segment in your reshaped array.
The previous answer assumes that you want non-overlapping segments. Depending on what you are trying to achieve, you may prefer using a striding -or sliding- window (e.g. with a 50% overlap). This questions is already covered in detail here.

ImageMagick: custom rank filter? Like erode or dilate or median but a different rank, e.g. a 'percentile' filter?

When applying a rank filter with ImageMagick such as erode or delate or median, it takes the minimum (erode) or maximum (dilate) or middle (median) value of all pixels within a certain radius or custom shape around each source pixel.
Is it also possible to take a custom rank of the surrounding pixel values? For example when using a 5x5 square, with the erode or dilate or median filters, respectively the lowest, highest, of middle value of the 25 pixels surrounding each pixel is taken. What I'm looking for is a way to take the 8th or 23rd value, for example.
Perhaps this could be expressed as a percentile, e.g. take the 20% percentile value of an arbitrary area surrounding each pixel. Which in case of a 7x7 square would be the 0.2*7*7 = the 9th value out of the total 49 in order.
The erode, dilate and median filters would correspond to the 0%, 100%, and 50% percentiles, respectively.
I can't think of a way to do that easily in ImageMagick - other than compiling in your own "process module". The best introduction to that is by snibgo and actually for Windows - see the sortpixels.c example here
One tool that can do this simply from the command-line is libvips with its im_rank() function.
So, if you want index 8 from a sorted list of 5x5 neighbours, you could do:
vips im_rank input.png result.png 5 5 8
I did a quick test by generating a random image with ImageMagick and choosing successively greater values for index and the output images got successively brighter - but that was the sum-total of my testing. I have no reason to believe it will not work - it is an excellent library and very fast and frugal.
If you can live with Python, you can do this:
#!/usr/bin/env python3
import numpy as np
from PIL import Image
from scipy.ndimage import generic_filter
from scipy import stats
# Modal filter
def modal(P):
"""
We receive P[0]..P[8] with the pixels in the 3x3 surrounding window
Sort neighbours and take N'th in list
"""
N = 3
P.sort()
return P[N]
# Open image and make into Numpy array
im = Image.open('image.png').convert('L')
im = np.array(im)
# Run modal filter, change filter size here
result = generic_filter(im, modal, (3, 3))
# Save result
Image.fromarray(result).save('result.png')
You can change the filter size/shape to, say 5x5 by changing this line:
result = generic_filter(im, modal, (5, 5))
And as it is, it will take the third smallest neighbour, as counting starts from 0. So, use N=0 if you want the minimum in the 3x3 neighbourhood, or N=8 if you want the maximum in a 3x3 neighbourhood.
Another option might be to use CImg which is a simple-to-use, header-only C++ library (no DLLs/libXXX.a/libXXX.so files) that can actually read and write PGM files natively without needing anything external. So you could run an ImageMagick command and cause it to write a PGM file on stdout and read that in using CImg, process it and write it out filtered again. That would probably be easier than writing an ImageMagick "process module" and you wouldn't need to build ImageMagick from source anew for every release.
Keywords: ImageMagick, vips, libvips, filter, rank, ranking, median, dilate, erode, window, 3x3, 5x5, NxN, percentile, Python, PIL, Pillow, image, image processing.

Does there exist a way to directly figure out the "smoothness" of a digital image?

There exist several ways to evaluate an image, brightness, saturation, hue, intensity, contrast etc. And we always hear about the operation of smoothing or sharperning an image. From this, there must exist a way to evaluate the overall smoothness of an image and an exact way to figure out this value in one formula probably based on wavelet. Or fortunately anyone could even provide the MATLAB function or combination of them to directly calculate this value.
Thanks in advance!
Smoothness is a vague term. What considered smooth for one application might not be considered smooth for another.
In the common case, smoothness is a function of the color gradients. Take a 2d gradient on the 3 color channels, then take their magnitude, sqrt(dx^2 + dy^2) and average, sum or some function over the 3 channels. That can give you local smoothness which you can then sum/average/least squares over the image.
In the more common case, however, linear changes in color is also smooth (think 2 color gradients, or how light might be reflected from an object). For that, a second differential could be more suitable. A laplacian does exactly that.
I've had much luck using the laplacian operator for calculating smoothness in Python with the scipy/numpy libraries. Similar utilities exist for matlab and other tools.
Note that the resulting value isn't something absolute from the math books, you should only use it relative to itself and using constants you deem fit.
Specific how to:
First get scipy. If you are on Linux it's it available on pypi. For Windows you'll have to use a precompiled version here. You should open the image using scipy.ndimage.imread and then use scipy.ndimage.filters.laplace on the image you read. You don't actually have to mix the channels, you can simply call numpy.average and it should be close enough.
import scipy as np
import scipy.ndimage as ndi
print np.average(np.absolute(ndi.filters.laplace(ndi.imread(path).astype(float) / 255.0)))
This would give the average smoothness (for some meaning of smoothness) of the image. I use np.absolute since values can be positive or negative and we don't want them to even out when averaging. I convert to float and divide by 255 to have values between 0.0 and 1.0 instead of 0 to 256, since it's easier to work with.
If you want to see the what the laplacian found, you can use matplotlib:
import matplotlib.pyplot as plt
v = np.absolute(ndi.filters.laplace(ndi.imread(path).astype(float) / 255.0))
v2 = np.average(v, axis=2) # Mixing the channels down
plt.imshow(v2);
plt.figure();
plt.imshow(v2 > 0.05);
plt.show()

How to plot geo-referenced image so that it "fits" the plot coordinate system in Matplotlib

A very similar question, solved the same way: how to use 'extent' in matplotlib.pyplot.imshow
I have a list of geographical coordinates (a "tracklog") that describe a geographical trajectory. Also, I have the means of obtaining an image spanning the tracklog coverage, where I know the "geographical coordinates" of the corners of the image.
My plot currently looks like this (notice the ticks - x=longitudes, y=latitudes, in UTM, WGS84):
Then suppose I know the corner coordinates of the following image (or a version of it without the blue track), and would like to plot it SO THAT IT FITS THE COORDINATE SYSTEM of the plot.
How would I do it?
(as a side note, in case that matters, I plan to use tiles)
As per the comment of Joe Kington (waiting for his actual answer so that I can accept it), the following code works as expected, giving a pannable and zoomable fixed-aspect "georeferenced" tile over which I am able to plot tracklogs:
import matplotlib.pyplot as plt
import Image
import numpy
imarray = numpy.asarray(Image.open('map.jpg'))
plt.plot([0,1], [0,1], 'o', c='red', ms=20) ## some reference circles for debugging
plt.imshow(imarray, extent=[0,1,0,1]) ## some random map whose corners have known coordinates
plt.axis('equal')
plt.show()
There is really not much of an answer here, but if you are using matplotlib, and you geos-tuff, take a look at matplotlib.basemap.
By default all operations are done on UTM maps, but you can choose your own projection.
Take also a look on the list of good tutorials in http://www.geophysique.be, for example.

Set autoscale limits on plot to have buffer around all points

I would like to plot a set of points using pyplot in matplotlib but have none of the points be on the edge of my axes. The autoscale (or something) sets the xlim and ylim such that often the first and last points lie at x = xmin or xmax making it difficult to read in some situations.
This is more often problematic with loglog() or semilog() plots because the autoscale would like xmin and xmax to be exact powers of ten, but if my data contains only three points, e.g. at xdata = [10**2,10**3,10**4] then the first and last points will lie on the border of the plot.
Attempted Workaround
This is my solution to add a 10% buffer to either side of the graph. But is there a way to do this more elegantly or automatically?
from numpy import array, log10
from matplotlib.pyplot import *
xdata = array([10**2,10**3,10**4])
ydata = xdata**2
figure()
loglog(xdata,ydata,'.')
xmin,xmax = xlim()
xbuff = 0.1*log10(xmax/xmin)
xlim(xmin*10**(-xbuff),xmax*10**(xbuff))
I am hoping for a one- or two-line solution that I can easily use whenever I make a plot like this.
Linear Plot
To make clear what I'm doing in my workaround, I should add an example in linear space (instead of log space):
plot(xdata,ydata)
xmin,xmax = xlim()
xbuff = 0.1*(xmax-xmin)
xlim(xmin-xbuff,xmax+xbuff))
which is identical to the previous example but for a linear axis.
Limits too large
A related problem is that sometimes the limits are too large. Say my data is something like ydata = xdata**0.25 so that the variance in the range is much less than a decade but ends at exactly 10**1. Then, the autoscale ylim are 10**0 to 10**1 though the data are only in the top portion of the plot. Using my workaround above, I can increase ymax so that the third point is fully within the limits but I don't know how to increase ymin so that there is less whitespace at the lower portion of my plot. i.e., the point is that I don't always want to spread my limits apart but would just like to have some constant (or proportional) buffer around all my points.
#askewchan I just succesfully achieved how to change matplotlib settings by editing matplotlibrc configuration file and running python directly from terminal. Don't know the reason yet, but matplotlibrc is not working when I run python from spyder3 (my IDE). Just follow steps here matplotlib.org/users/customizing.html.
1) Solution one (default for all plots)
Try put this in matplotlibrc and you will see the buffer increase:
axes.xmargin : 0.1 # x margin. See `axes.Axes.margins`
axes.ymargin : 0.1 # y margin See `axes.Axes.margins`
Values must be between 0 and 1.
Obs.: Due to bugs, scale is not correctly working yet. It'll be fixed for matplotlib 1.5 (mine is 1.4.3 yet...). More info:
axes.xmargin/ymargin rcParam behaves differently than pyplot.margins() #2298
Better auto-selection of axis limits #4891
2) Solution two (individually for each plot inside the code)
There is also the margins function (for put directly in the code). Example:
import numpy as np
from matplotlib import pyplot as plt
t = np.linspace(-6,6,1000)
plt.plot(t,np.sin(t))
plt.margins(x=0.1, y=0.1)
plt.savefig('plot.png')
Obs.: Here scale is working (0.1 will increase 10% of buffer before and after x-range and y-range).
A similar question was posed to the matplotlib-users list earlier this year. The most promising solution involves implementing a Locator (based on MaxNLocator in this case) to override MaxNLocator.view_limits.

Resources