K-nearest-neighbour density estimation using the same data set, k=5 - nearest-neighbor

it is about a non-parameteric density estimation.
So, we have 2 different data 220 values of "good data" and 30 values of "broke data"
we should use k-nearest-neighbour density estimation for estimate p(x |c="good data")
In case k=5 we have p(x |c=good) =(5/220)*(1/V).
If I have right understood, by k-nearest-neighbour we should determine V and then get
p(x |c=good)
If we must find out V for 5 points, then we can solve p(x|c=good)
I have a problem, how to plot und calcucale this probability.
There is picture from book http://content.foto.mail.ru/mail/zurix/_mypagephoto/h-67.jpg
What does blue curve mean on graphic of K nearest-neighbor density estimation(you can see attach)? Can this curve show boundaries of different V? If yes, where exactly boundary between is classes, each class consists of 5 points???
Thank you in advance!!

it's difficult to guess what the two curves mean without any additional information such as the figure caption or the book title.
My best guess is that the green curve is the true (one dimensional) density from which a sample of data points was drawn. The blue curves seem to be the resulting density estimation function for three different values of k .
This should illustrate the importance of choosing k properly, for k = 1, this overfits the data (high variance of the resulting density estimation function), for k = 30, this 'oversmoothes' the data (high bias of the resulting density estimation function) as it does not reproduce the bump around 0.3 .
In fact, looking at the k=1 example, it looks to me like this is not using a pure 1 / V but rather some weighting function. For a pure 1/V estimate per point, I would expect a piecewise constant function (only pieces of horizontal lines).

Related

Parametric Scoring Function or Algorithm

I'm trying to come up with a way to arrive at a "score" based on an integer number of "points" that is adjustable using a small number (3-5?) of parameters. Preferably it would be simple enough to reasonably enter as a function/calculation in a spreadsheet for tuning the parameters by the "designer" (not a programmer or mathematician). The first point has the most value and eventually additional points have a fixed or nearly fixed value. The transition from the initial slope of point value to final slope would be smooth. See example shapes below.
Points values are always positive integers (0 pts = 0 score)
At some point, curve is linear (or nearly), all additional points have fixed value
Preferably, parameters are understandable to a lay person, e.g.: "smoothness of the curve", "value of first point", "place where the additional value of points is fixed", etc
For parameters, an example of something ideal would be:
Value of first point: 10
Value of point #: 3 is: 5
Minimum value of additional points: 0.75
Exact shape of curve not too important as long as the corner can be more smooth or more sharp.
This is not for a game but more of a rating system with multiple components (several of which might use this kind of scale) will be combined.
This seems like a non-traditional kind of question for SO/SE. I've done mostly financial software in my career, I'm hoping there some domain wisdom for this kind of thing I can tap into.
Implementation of Prune's Solution:
Google Sheet
Parameters:
Initial value (a)
Second value (b)
Minimum value (z)
Your decay ratio is b/a. It's simple from here: iterate through your values, applying the decay at each step, until you "peg" at the minimum:
x[n] = max( z, a * (b/a)^n )
// Take the larger of the computed "decayed" value,
// and the specified minimum.
The sequence x is your values list.
You can also truncate intermediate results if you want integers up to a certain point. Just apply the floor function to each computed value, but still allow z to override that if it gets too small.
Is that good enough? I know there's a discontinuity in the derivative function, which will be noticeable if the minimum and decay aren't pleasantly aligned. You can adjust this with a relative decay, translating the exponential decay curve from y = 0 to z.
base = z
diff = a-z
ratio = (b-z) / diff
x[n] = z + diff * ratio^n
In this case, you don't need the max function, since the decay has a natural asymptote of 0.

Savitzky–Golay filter for 2D images

I would like to ask about Savitzky–Golay filter on 2D-images.
What are the best coefficient and order to choose for finding local details in the image.
Moreover, if someone has an explanation for coefficients and the orders one the 2D-images, it would be perfect.
Thanks in advance
Please check out this website:
https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter#Two-dimensional_convolution_coefficients
UPDATE: (Thank you for the suggestion, #Rasclatt)
Which has been reproduced here:
Two-dimensional smoothing and differentiation can also be applied to tables of data values, such as intensity values in a photographic image which is composed of a rectangular grid of pixels.[16] [17] The trick is to transform part of the table into a row by a simple ordering of the indices of the pixels. Whereas the one-dimensional filter coefficients are found by fitting a polynomial in the subsidiary variable, z to a set of m data points, the two-dimensional coefficients are found by fitting a polynomial in subsidiary variables v and w to a set of m × m data points. The following example, for a bicubic polynomial and m = 5, illustrates the process, which parallels the process for the one dimensional case, above.[18]
The square of 25 data values, d1 − d25
becomes a vector when the rows are placed one after another.
The Jacobian has 10 columns, one for each of the parameters a00 − a03 and 25 rows, one for each pair of v and w values. Each row has the form
The convolution coefficients are calculated as
The first row of C contains 25 convolution coefficients which can be multiplied with the 25 data values to provide a smoothed value for the central data point (13) of the 25.
check out the below links which use SURE(Stein's unbiased risk estimator) to minimizes the mean squared error between your estimate and the image. This method is useful for denoising and data smoothing.
this link is for optimization of parameters for 1D Savitzky Golay Filter(this will be helpful to understand the 2D part)
https://ieeexplore.ieee.org/abstract/document/6331560/?part=1
this link is for optimization of parameters of 2D Savitzky Golay Filter
https://ieeexplore.ieee.org/document/6738095/

Given data range, need clever algorithm to calculate granularity of graph axis scales

Scenario:
Drawing a graph. Have data points which range from A to B, and want to decide on a granularity for drawing the axis scales. Eg, for 134 to 151 the scale might run from 130 to 155, to start and end on "round" numbers in the decimal system. But the numbers might run from 134.31 to 134.35, in which case a scale from 130 to 135 would (visually) compress out the "significance" in the data -- it would be better to draw the scale from 134 to 135, or maybe even 134.3 to 134.4. And the data values might instead run from 0.013431 to 0.013435, or from 1343100 to 1343500.
So I'm trying to figure out an elegant way to calculate the "granularity" to round the low bound down to and the upper bound up to, to produce a "pleasing" chart. One could just "hack" it somehow, but that produces little confidence that "odd" cases will be handled well.
Any ideas?
Just an idea:
Add about 10% to your range, tune this figure empirically
Divide size of range by number of tick marks you want to have
Take the base 10 logarithm of that number
Multiply the result by three, then round to the nearest integer
The remainder modulo 3 will tell you whether you want the least significant decimal to change in steps of 1, 2, or 5
The result of an integer division by 3 will tell you the power of ten to use
Take the (extended) range and compute the extremal tick points it contains, according to the tick frequencey just computed
Ensure that all data points actually lie within that range, add ticks if not
If needed, add minor ticks by decreasing the integer above by one
I found a very helpful calculation which is very similar to the axis scale of excel graphs:
It is written for excel but I used and transformed it into objective-c code for setting up my graph axis.

finding peaks and troughs, Part II (with corresponding definition)

this is an update to the previous question that I had about locating peaks and troughs. The previous question was this:
peaks and troughs in MATLAB (but with corresponding definition of a peak and trough)
This time around, I did the suggested answer, but I think there is still something wrong with the final algorithm. Can you please tell me what I did wrong in my code? Thanks.
function [vectpeak, vecttrough]=peaktroughmodified(x,cutoff)
% This function is a modified version of the algorithm used to identify
% peaks and troughs in a series of prices. This will be used to identify
% the head and shoulders algorithm. The function gives you two vectors:
% PEAKS - an indicator vector that identifies the peaks in the function,
% and TROUGHS - an indicator vector that identifies the troughs of the
% function. The input is the vector of exchange rate series, and the cutoff
% used for refining possible peaks and troughs.
% Finding all possible peaks and troughs of our vector.
[posspeak,possploc]=findpeaks(x);
[posstrough,posstloc]=findpeaks(-x);
posspeak=posspeak';
posstrough=posstrough';
% Initialize vector of peaks and troughs.
numobs=length(x);
prelimpeaks=zeros(numobs,1);
prelimtroughs=zeros(numobs,1);
numpeaks=numel(possploc);
numtroughs=numel(posstloc);
% Indicator for possible peaks and troughs.
for i=1:numobs
for j=1:numpeaks
if i==possploc(j);
prelimpeaks(i)=1;
end
end
end
for i=1:numobs
for j=1:numtroughs
if i==posstloc(j);
prelimtroughs(i)=1;
end
end
end
% Vector that gives location.
location=1:1:numobs;
location=location';
% From the list of possible peaks and troughs, find the peaks and troughs
% that fit Chang and Osler [1999] definition.
% "A peak is a local minimum at least x percent higher than the preceding
% trough, and a trough is a local minimum at least x percent lower than the
% preceding peak." [Chang and Osler, p.640]
% cutoffs
peakcutoff=1.0+cutoff; % cutoff for peaks
troughcutoff=1.0-cutoff; % cutoff for troughs
% First peak and first trough are initialized as previous peaks/troughs.
prevpeakloc=possploc(1);
prevtroughloc=posstloc(1);
% Initialize vectors of final peaks and troughs.
vectpeak=zeros(numobs,1);
vecttrough=zeros(numobs,1);
% We first check whether we start looking for peaks and troughs.
for i=1:numobs
if prelimpeaks(i)==1;
if i>prevtroughloc;
ratio=x(i)/x(prevtroughloc);
if ratio>peakcutoff;
vectpeak(i)=1;
prevpeakloc=location(i);
else vectpeak(i)=0;
end
end
elseif prelimtroughs(i)==1;
if i>prevpeakloc;
ratio=x(i)/x(prevpeakloc);
if ratio<troughcutoff;
vecttrough(i)=1;
prevtroughloc=location(i);
else vecttrough(i)=0;
end
end
else
vectpeak(i)=0;
vecttrough(i)=0;
end
end
end
I just ran it, and it seems to work if you make this change:
peakcutoff= 1/cutoff; % cutoff for peaks
troughcutoff= cutoff; % cutoff for troughs
I tested it with the following code, with a cutoff of 0.1 (peaks must be 10 times larger than troughs), and it looks reasonable
x = randn(1,100).^2;
[vectpeak,vecttrough] = peaktroughmodified(x,0.1);
peaks = find(vectpeak);
troughs = find(vecttrough);
plot(1:100,x,peaks,x(peaks),'o',troughs,x(troughs),'o')
I strongly urge you to read up on vectorization in matlab. There are many wasted lines in your program, and it makes it difficult to read and will also make it very slow with big datasets. For instance, prelimpeaks and prelimtroughs can be completely defined without loops, in a single line for each:
prelimpeaks(possploc) = 1;
prelimtroughs(posstloc) = 1;
I think there are better techniques for finding peaks and troughs than the percentage threshold technique given above. Fit the least squares fit parabola to the data set, a technique for doing this is in the 1946 Frank Peters paper, "Parabolic Correlation, a New Descrptive Statistic." The fitted parabola will likely have an index of curvature, as Peters defines it. Find peaks and troughs by testing which points, when eliminated, minimize the absolute value of the index of curvature of the parabola. Once these points are discovered, test for which are peaks and which are troughs by considering how the index of curvature changes when the point is excluded, which will depend on whether the original parabola had a positive or negative index of curvature. If you become concerned about contiguous points the elimination of which achieves the minimum absolute value curvature, constrain by setting a minimum distance the identified points must be from each other. Another constraint would have to be the number of points identified. Without this constraint, this algorithm would remove all but two points, a straight line without curvature.
Sometimes there are steep changes between contiguous points and both should be included in extreme points. Perhaps a percentage threshold test for contiguous points that overrides the minimum distance constraint would be useful.
Another solution might be to compute the Fast Fourier Transform of the series and remove points that minimize the lower spectra. FFT functions are more readily available than code that finds least square fits parabola. There is a matrix manipulation technique for determining the least square fit parabola that is easier to manage than Peter's approach. I saw it documented on the web someplace, but lost the link. Advice from anybody able to arrive at a least square fit parabola using matrix vector notation would be appreciated.

Implementing the intelligent recursive algorithm in matlab

Well am referring the following paper and trying to implement the algorithm as given in matlab
The only problem is how do i find a noisy pixel i.e Pixel with impulse noise?
X seems to be the impulse pixel in an image which i have to calculate
_
____________________________________________
Input – Noisy Image h
_______________________________________________
Step 1: Compute X
for every pixel repeat steps from 2 to 7
Step 2: Initialize w = 3
Step 3: If X(i,j) ≠ Impulse pixel
goto step 7
Step 4: ∆i,j = { h(i1,j1) | i-(w-1)/2 ≤ i1 ≤ i+(w-1)/2,
j-(w-1)/2 ≤ j1 ≤ j+(w-1)/2}
b=no. of black pixels in the window
w=no. of white pixels in the window
Step 5: If ∆i,j ≠ NULL
p(i,j) = mean(∆i,j
)
d(i,j) = | h(i,j) – p(i,j) |
else if (w < wmax)
w=w+2
goto step 4
else
if (b>w)
h(i,j)=0
else
h(i,j)=255
Step 7: Goto next pixel
Step 8: Calculate threshold t, from detailed coefficient
matrix d
for every pixel
Step 9: If (d(i,j)>t)
h(i,j)=p(i,j)
____________________________
Edit: To implement the PSM or the median filter method we
need to set some parameters and a threshold value. This
threshold value is dependent on the image and the noise
density. So, to restore different images we need to check for
a range of threshold values and find out the best one. So, in
our proposed algorithm we removed the need to define a threshold value. The algorithm is intelligent and determines
the threshold automatically.
The article you are trying to implement is obviously badly written...
For instance in the algorithm w means 2 things: the size of the window, and the number of white pixels!!!
Both the step 1 and 7, are refering to the same loop.
Anyway, to me, the "impulse pixels" are all the pixels a which are either equal to 0 or 255.
Basically, they are the pixels which are part of the "salt and pepper" noise.
So basically, you can find them by doing:
[impulsepixelsY,impulasPixelX]=find((im==0)|(im==255));
From the paper it seems that the "impulse pixels" are just the noisy pixels, in the case of salt & pepper noise. Furthermore, it also seems that the algorithm provides an "intelligent" mechanism to calculate the denoised value of a noisy pixel if its value is above a threshold (which it calculates adaptively).
So, what about "If X(i,j) ≠ Impulse pixel " ? Well, apparently, the authors assume to know (!) which pixels are noisy (!!), which makes the whole thing rather ridiculous, since this info is almost impossible to know.
I might also add that the rather stunning results presented in the paper are most probably due to this fact.
P.S. Regarding the argument that <"impulse pixels" are all the pixels a which are either equal to 0 or 255>, it is wrong. The set of pixels that have either 0 or 255 intensity value, includes the noisy pixels as well as proper pixels that just happen to have such a value. In this case, the algorithm will most probably collapse since it will denoise healthy pixels as well.

Resources