Weighted random number (without predefined values!) - random

currently I'm needing a function which gives a weighted, random number.
It should chose a random number between two doubles/integers (for example 4 and 8) while the value in the middle (6) will occur on average, about twice as often than the limiter values 4 and 8.
If this were only about integers, I could predefine the values with variables and custom probabilities, but I need the function to give a double with at least 2 digits (meaning thousands of different numbers)!
The environment I use, is the "Game Maker" which provides all sorts of basic random-generators, but not weighted ones.
Could anyone possibly lead my in the right direction how to achieve this?
Thanks in advance!

The sum of two independent continuous uniform(0,1)'s, U1 and U2, has a continuous symmetrical triangle distribution between 0 and 2. The distribution has its peak at 1 and tapers to zero at either end. We can easily translate that to a range of (4,8) via scaling by 2 and adding 4, i.e., 4 + 2*(U1 + U2).
However, you don't want a height of zero at the endpoints, you want half the peak's height. In other words, you want a triangle sitting on a rectangular base (i.e., uniform), with height h at the endpoints and height 2h in the middle. That makes life easy, because the triangle must have a peak of height h above the rectangular base, and a triangle with height h has half the area of a rectangle with the same base and height h. It follows that 2/3 of your probability is in the base, 1/3 is in the triangle.
Combining the elements above leads to the following pseudocode algorithm. If rnd() is a function call that returns continuous uniform(0,1) random numbers:
define makeValue()
if rnd() <= 2/3 # Caution, may want to use 2.0/3.0 for many languages
return 4 + (4 * rnd())
else
return 4 + (2 * (rnd() + rnd()))
I cranked out a million values using that and plotted a histogram:

For the case someone needs this in Game Maker (or a different language ) as an universal function:
if random(1) <= argument0
return argument1 + ((argument2-argument1) * random(1))
else
return argument1 + (((argument2-argument1)/2) * (random(1) + random(1)))
Called as follows (similar to the standard random_range function):
val = weight_random_range(FACTOR, FROM, TO)
"FACTOR" determines how much of the whole probability figure is the "base" for constant probability. E.g. 2/3 for the figure above.
0 will provide a perfect triangle and 1 a rectangle (no weightning).

Related

Understanding Support Vector Regression (SVR) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working with SVR, and using this resource. Erverything is super clear, with epsilon intensive loss function (from figure). Prediction comes with tube, to cover most training sample, and generalize bounds, using support vectors.
Then we have this explanation. This can be described by introducing (non-negative) slack variables , to measure the deviation of training samples outside -insensitive zone. I understand this error, outside tube, but don't know, how we can use this in optimization. Could somebody explain this?
In local source. I'm trying to achieve very simple optimization solution, without libraries. This what I have for loss function.
import numpy as np
# Kernel func, linear by default
def hypothesis(x, weight, k=None):
k = k if k else lambda z : z
k_x = np.vectorize(k)(x)
return np.dot(k_x, np.transpose(weight))
.......
import math
def boundary_loss(x, y, weight, epsilon):
prediction = hypothesis(x, weight)
scatter = np.absolute(
np.transpose(y) - prediction)
bound = lambda z: z \
if z >= epsilon else 0
return np.sum(np.vectorize(bound)(scatter))
First, let's look at the objective function. The first term, 1/2 * w^2 (wish this site had LaTeX support but this will suffice) correlates with the margin of the SVM. The article you linked doesn't, in my opinion, explain this very well and calls this term describing "the model's complexity", but perhaps this is not the best way of explaining it. Minimizing this term maximizes the margin (while still representing the data well), which is the predominant goal of using SVM's doing regression.
Warning, Math Heavy Explanation: The reason this is the case is that when maximizing the margin, you want to find the "farthest" non-outlier points right on the margin and minimize its distance. Let this farthest point be x_n. We want to find its Euclidean distance d from the plane f(w, x) = 0, which I will rewrite as w^T * x + b = 0 (where w^T is just the transpose of the weights matrix so that we can multiply the two). To find the distance, let us first normalize the plane such that |w^T * x_n + b| = epsilon, which we can do WLOG as w is still able to form all possible planes of the form w^T * x + b= 0. Then, let's note that w is perpendicular to the plane. This is obvious if you have dealt a lot with planes (particularly in vector calculus), but can be proven by choosing two points on the plane x_1 and x_2, then noticing that w^T * x_1 + b = 0, and w^T * x_2 + b = 0. Subtracting the two equations we get w^T(x_1 - x_2) = 0. Since x_1 - x_2 is just any vector strictly on the plane, and its dot product with w is 0, then we know that w is perpendicular to the plane. Finally, to actually calculate the distance between x_n and the plane, we take the vector formed by x_n' and some point on the plane x' (The vectors would then be x_n - x', and projecting it onto the vector w. Doing this, we get d = |w * (x_n - x') / |w||, which we can rewrite as d = (1 / |w|) * | w^T * x_n - w^T x'|, and then add and subtract b to the inside to get d = (1 / |w|) * | w^T * x_n + b - w^T * x' - b|. Notice that w^T * x_n + b is epsilon (from our normalization above), and that w^T * x' + b is 0, as this is just a point on our plane. Thus, d = epsilon / |w|. Notice that maximizing this distance subject to our constraint of finding the x_n and having |w^T * x_n + b| = epsilon is a difficult optimization problem. What we can do is restructure this optimization problem as minimizing 1/2 * w^T * w subject to the first two constraints in the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon. You may think that I have forgotten the slack variables, and this is true, but when just focusing on this term and ignoring the second term, we ignore the slack variables for now, I will bring them back later. The reason these two optimizations are equivalent is not obvious, but the underlying reason lies in discrimination boundaries, which you are free to read more about (it's a lot more math that frankly I don't think this answer needs more of). Then, note that minimizing 1/2 * w^T * w is the same as minimizing 1/2 * |w|^2, which is the desired result we were hoping for. End of the Heavy Math
Now, notice that we want to make the margin big, but not so big that includes noisy outliers like the one in the picture you provided.
Thus, we introduce a second term. To motivate the margin down to a reasonable size the slack variables are introduced, (I will call them p and p* because I don't want to type out "psi" every time). These slack variables will ignore everything in the margin, i.e. those are the points that do not harm the objective and the ones that are "correct" in terms of their regression status. However, the points outside the margin are outliers, they do not reflect well on the regression, so we penalize them simply for existing. The slack error function that is given there is relatively easy to understand, it just adds up the slack error of every point (p_i + p*_i) for i = 1,...,N, and then multiplies by a modulating constant C which determines the relative importance of the two terms. A low value of C means that we are okay with having outliers, so the margin will be thinned and more outliers will be produced. A high value of C indicates that we care a lot about not having slack, so the margin will be made bigger to accommodate these outliers at the expense of representing the overall data less well.
A few things to note about p and p*. First, note that they are both always >= 0. The constraint in your picture shows this, but it also intuitively makes sense as slack should always add to the error, so it is positive. Second, notice that if p > 0, then p* = 0 and vice versa as an outlier can only be on one side of the margin. Last, all points inside the margin will have p and p* be 0, since they are fine where they are and thus do not contribute to the loss.
Notice that with the introduction of the slack variables, if you have any outliers then you won't want the condition from the first term, that is, |w^T * x_n + b| = epsilon as the x_n would be this outlier, and your whole model would be screwed up. What we allow for, then, is to change the constraint to be |w^T * x_n + b| = epsilon + (p + p*). When translated to the new optimization's constraint, we get the full constraint from the picture you attached, that is, |y_i - f(x_i, w)| <= epsilon + p + p*. (I combined the two equations into one here, but you could rewrite them as the picture is and that would be the same thing).
Hopefully after covering all this up, the motivation for the objective function and the corresponding slack variables makes sense to you.
If I understand the question correctly, you also want code to calculate this objective/loss function, which I think isn't too bad. I have not tested this (yet), but I think this should be what you want.
# Function for calculating the error/loss for a SVM. I assume that:
# - 'x' is 2d array representing the vectors of the data points
# - 'y' is an array representing the values each vector actually gives
# - 'weights' is an array of weights that we tune for the regression
# - 'epsilon' is a scalar representing the breadth of our margin.
def optimization_objective(x, y, weights, epsilon):
# Calculates first term of objective (note that norm^2 = dot product)
margin_term = np.dot(weight, weight) / 2
# Now calculate second term of objective. First get the sum of slacks.
slack_sum = 0
for i in range(len(x)): # For each observation
# First find the absolute distance between expected and observed.
diff = abs(hypothesis(x[i]) - y[i])
# Now subtract epsilon
diff -= epsilon
# If diff is still more than 0, then it is an 'outlier' and will have slack.
slack = max(0, diff)
# Add it to the slack sum
slack_sum += slack
# Now we have the slack_sum, so then multiply by C (I picked this as 1 aribtrarily)
C = 1
slack_term = C * slack_sum
# Now, simply return the sum of the two terms, and we are done.
return margin_term + slack_term
I got this function working on my computer with small data, and you may have to change it a little to work with your data if, for example, the arrays are structured differently, but the idea is there. Also, I am not the most proficient with python, so this may not be the most efficient implementation, but my intent was to make it understandable.
Now, note that this just calculates the error/loss (whatever you want to call it). To actually minimize it requires going into Lagrangians and intense quadratic programming which is a much more daunting task. There are libraries available for doing this but if you want to do this library free as you are doing with this, I wish you good luck because doing that is not a walk in the park.
Finally, I would like to note that most of this information I got from notes I took in my ML class I took last year, and the professor (Dr. Abu-Mostafa) was a great help to have me learn the material. The lectures for this class are online (by the same prof), and the pertinent ones for this topic are here and here (although in my very biased opinion you should watch all the lectures, they were a great help). Leave a comment/question if you need anything cleared up or if you think I made a mistake somewhere. If you still don't understand, I can try to edit my answer to make more sense. Hope this helps!

Number of ways N circles of different radius can be arranged in a line

Question: Given M points on a line separated by 11 unit. Find the number of ways N circles of different radii can be drawn so that they don't intersect or overlap or one inside another?? Provided that the centers of circles should be those MM points.
Example 1: N=3,M=6,r1=1,r2=1,r3=1 Answer: 24 ways.
Example 2: N=2,M=5 ,r1=1,r2=2 Answer: 6 ways.
Example 3: N=1,M=10,r=50. Answer =10 ways.
I found this question online and have not been able to solve it till now. Till now I have been able to only work up this much that any circle can take spaces from n−rn−r to n−2rn−2r. But among other issues how can I adjust for edge cases in which a circle with radius 33 takes n−4n−4th point, now the last point will be left untouched but I cannot place any circle with a radius greater than 1. I am not able to see any generalized mathematical solution to this.
If the center of the circles can be placed on non-integer x and y coordinates, then it is either impossible due to the length being too short, or infinitely many due to have enough length and there are infinitely many translations.
So, since you have to compute the results, I will assume that the coordinates of (M,M) are integer numbers.
If there is a single circle, then the solution is the number of points the circle can be legally placed.
If there are at least two circles, then you need to calculate the sum of the diameters and if that happens to be larger than the total length of the line we are speaking about, then you have no solution. If that is not the case, then you need to subtract the sum of diameters from the total length, getting Complementer. You also have N! permutations to compute the order of the circles. And you will have Complementer - 1 possible locations where you can distribute the gaps between the circles. The lengths of the gaps are G1, ..., Gn-1
We know that G1 + ... + Gn-1 = Complementer
The number of possible distributions of G1, ..., Gn-1 is D. The formula therefore would b
N! * D
The remaining question is: how can we compute D?
Solution:
function distr(depth, maxDepth, amount)
if (depth = maxDepth) then
return 1 //we need to put the remaining elements on the last slot
end if
sum = 1 //if we put amount here, that is a trivial case
for i = amount - 1 to 0 do
sum = distr(depth + 1, maxDepth, amount - i)
end for
return sum
end distr
You need to call distr with depth = 1, maxDepth = N-1, amout = Complementer

Surface Area Heuristic (SAH) kd-tree of triangles - flat cells

I've implemented a SAH kd-tree based upon the paper On building fast kd-Trees for Ray Tracing, and on doing that in O(N log N) by Wald and Havran. Note I haven't done their splicing and merging suggestion right at the end to speed up the building of the tree, just the SAH part.
I'm testing the algorithm with an axis-aligned cube where each face is split into two triangles, so I have N = 12 triangles in total. The bottom-left vertex (i.e. the one nearest the axes) is actually at the origin.
Face Triangle
----------------
Front: 0, 1
Left: 6, 7
Right: 2, 3
Top: 4, 5
Bottom: 10, 11
Back: 8, 9
Assuming node traversal cost Ct = 1.0 and intersection cost Ci = 10.0. We first find the cost of not splitting which is Cns = N * Ci = 12 * 10.0 = 120.0.
Now we go through each axis in turn and do an incremental sweep through the candidate split planes to see if the cost of splitting is cheaper. The first split plane is p = <x,0>. We have Nl = 0, Np = 2 and Nr = 10 (that is, the number of triangles on the left, in the plane, and to the right of the plane). The two triangles in the plane are number 6 and 7 (the left face). All others are to the right.
The SAH(p,V,Nl,Nr,Np) function is now executed. This takes the split plane, the voxel V to be split, and the numbers of left/right/plane triangles. It computes the probability of hitting the left (flat) voxel as Pl = SA(Vl)/SA(V) = 50/150 = 1/3, the right probability as Pr = SA(Vr)/SA(V) = 150/150 = 1. Now it runs the cost function twice; first with the planar triangles on the left, then with the planar triangles on the right to get Cl and Cr respectively.
The cost function C(Pl,Pr,Nl,Nr) returns bias * (Ct + Ci(Pl * Nl + Pr * Nr))
Cl: cost with planar triangles on the left (Nl = 2, Nr = 10)
bias = 1 we aren't biasing as neither left nor right voxel is empty.
Cl = 1 * (1 + 10(1/3 * 2 + 1 * 10)) = 107.666
Cr: cost with planar triangles on the right (Nl = 0, Nr = 12)
bias = 0.8 empty cell bias comes into play.
Cr = 0.8 * (1 + 10(1/3 * 0 + 1 * 12)) = 96.8
The algorithm determines that Cr = 96.8 is better than splitting the two triangles off into a flat cell Cl = 107.666 and also better than not splitting the voxel at all Cns = 120.0. No other candidate splits are found to be cheaper. We therefore split into an empty left child, and a right child containing all the triangles. When we recurse into the right child to continue the tree building, we perform exactly the same steps as above. It's only because of a max depth termination criterion that this doesn't cause a stack overflow. The resultant tree is very deep.
The paper claims to have thought about this sort of thing:
During clipping, special care has to be taken to correctly handle
special cases like “flat” (i.e., zero-volume) cells, or cases where
numerical inaccuracies may occur (e.g., for cells that are very thin
compared to the size of the triangle). For example, we must make sure
not to “clip away” triangles lying in a flat cell. Note that such
cases are not rare exceptions, but are in fact encouraged by the SAH,
as they often produce minimal expected cost.
This local approximation can easily get stuck in a local minimum: As
the local greedy SAH overestimates CV (p), it might stop subdivision
even if the correct cost would have indicated further subdivision. In
particular, the local approximation can lead to premature termination
for voxels that require splitting off flat cells on the sides: many
scenes (in particular, architectural ones) contain geometry in the
form of axis-aligned boxes (a light fixture, a table leg or table top,
. . . ), in which case the sides have to be “shaved off” until the
empty interior is exposed. For wrongly chosen parameters, or when
using cost functions different from the ones we use (in particular,
ones in which a constant cost is added to the leaf estimate), the
recursion can terminated prematurely. Though this pre-mature exit
could also be avoided in a hardcoded way—e.g., only performing the
automatic termination test for non-flat cells—we propose to follow our
formulas, in which case no premature exit will happen.
Am I doing anything wrong? How can I correct this?
Probably a bit late by now :-/, but just in case: I think the thing that's "wrong" here is in the concept of "empty cell" bias: what you want to "encourage" (by giving it a less-than-one bias) is cutting away empty space, but that implies that the cell being cut away actually does have a volume. Apparently, in your implementation you're only checking if one side has zero triangles, and are thus applying the bias even if no space is cut away at all.
Fix: apply the bias only if one side has count==0 and non-zero width.

Geometry matrix colinearity prove

Let's say I have this matrix with n=4 and m=5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Let's say I have a diagonal from the (1,2) point to the (4,5) point. And I have a point P(3,4). How can I check in my algorithm that P is on the diagonal?
TL;DR
Instead of an n-by-m matrix, think about it like a x-y grid. You can get the equation of a line on that grid, and once you have that equation, you put the x coordinate of the point you are interested in checking into your equation. If the y value you calculate from the equation matches the y coordinate of the point you are checking, the point lies on the line.
But How Do I Maths?
First some quick terminology. We have 3 points of interest in this case - the two points that define the line (or "diagonal", as the OP calls it), and the one point that we want to check. I'm going to designate the coordinates of the "diagonal" points with the numbers 1 and 2, and the point we want to check with the letter i. Additionally, for the math we need to do later, I need to treat the horizontal and vertical coordinates of the points separately, and I'll use your n-by-m convention to do so. So when I write n1 in an equation below, that is the n coordinate of the first point used to define the diagonal (so the 1 part of the point (1,2) that you give in your example).
What we are looking for is the equation of a line on our grid. This equation will have the form n = (slope) * m + (intercept).
Okay, now that we have the definitions taken care of, we can write the equations. The first step to solving the problem is finding the slope of your line. This will be the change in the vertical coordinate divided by the change in the horizontal component between the two points that define the line (so (n2 - n1) / (m2 - m1)). Using the values from your example, this will be (4 - 1) / (5 - 2) = 3 / 3 = 1. Note that since you are doing a division here, it is possible that your answer will not be a whole number, so make sure you keep that in mind when declaring your variables in whatever programming language you end up using - unintentional rounding in this step can really mess things up later.
Once we have our slope, the next step is calculating our intercept. We can do this by plugging our slope and the m and n coordinates into the equation for the line we are trying to get. So we start with the equation n1 = (slope) * m1 + (intercept). We can rearrange this equation to (intercept) = n1 - (slope) * m1. Plugging in the values from our example, we get (intercept) = 1 - (1 * 2) = -1.
So now we have the general equation of our line, which for our example is n = (1) * m + (-1).
Now that we have the (slope) and (intercept), we can plug in the coordinates of any point we want to check and see if the numbers match up. Our example point has a m coordinate of 4, so we can plug that into our equation.
n = (1) * (4) + (-1) = 3
Since the n coordinate we calculated using our equation matches the n coordinate of our point in our example, we can say that the sample point DOES fall on the line.
Suppose we wanted to also check to see if the point (2,5) was also on the line. When we plug that point's m coordinate into our equation, we get...
n = (1) * (5) + (-1) = 4
Since the n coordinate we calculated with our equation (4) doesn't match the n coordinate of the point we were checking (2), we know this point DOES NOT fall on the line.

Algorithm to smooth a curve while keeping the area under it constant

Consider a discrete curve defined by the points (x1,y1), (x2,y2), (x3,y3), ... ,(xn,yn)
Define a constant SUM = y1+y2+y3+...+yn. Say we change the value of some k number of y points (increase or decrease) such that the total sum of these changed points is less than or equal to the constant SUM.
What would be the best possible manner to adjust the other y points given the following two conditions:
The total sum of the y points (y1'+y2'+...+yn') should remain constant ie, SUM.
The curve should retain as much of its original shape as possible.
A simple solution would be to define some delta as follows:
delta = (ym1' + ym2' + ym3' + ... + ymk') - (ym1 + ym2 + ym3 + ... + ymk')
and to distribute this delta over the rest of the points equally. Here ym1' is the value of the modified point after modification and ym1 is the value of the modified point before modification to give delta as the total difference in modification.
However this would not ensure a totally smoothed curve as area near changed points would appear ragged. Does a better solution/algorithm exist for the this problem?
I've used the following approach, though it is a bit OTT.
Consider adding d[i] to y[i], to get s[i], the smoothed value.
We seek to minimise
S = Sum{ 1<=i<N-1 | sqr( s[i+1]-2*s[i]+s[i-1] } + f*Sum{ 0<=i<N | sqr( d[i])}
The first term is a sum of the squares of (an approximate) second derivative of the curve, and the second term penalises moving away from the original. f is a (positive) constant. A little algebra recasts this as
S = sqr( ||A*d - b||)
where the matrix A has a nice structure, and indeed A'*A is penta-diagonal, which means that the normal equations (ie d = Inv(A'*A)*A'*b) can be solved efficiently. Note that d is computed directly, there is no need to initialise it.
Given the solution d to this problem we can compute the solution d^ to the same problem but with the constraint One'*d = 0 (where One is the vector of all ones) like this
d^ = d - (One'*d/Q) * e
e = Inv(A'*A)*One
Q = One'*e
What value to use for f? Well a simple approach is to try out this procedure on sample curves for various fs and pick a value that looks good. Another approach is to pick a estimate of smoothness, for example the rms of the second derivative, and then a value that should attain, and then search for an f that gives that value. As a general rule, the bigger f is the less smooth the smoothed curve will be.
Some motivation for all this. The aim is to find a 'smooth' curve 'close' to a given one. For this we need a measure of smoothness (the first term in S) and a measure of closeness (the second term. Why these measures? Well, each are easy to compute, and each are quadratic in the variables (the d[]); this will mean that the problem becomes an instance of linear least squares for which there are efficient algorithms available. Moreover each term in each sum depends on nearby values of the variables, which will in turn mean that the 'inverse covariance' (A'*A) will have a banded structure and so the least squares problem can be solved efficiently. Why introduce f? Well, if we didn't have f (or set it to 0) we could minimise S by setting d[i] = -y[i], getting a perfectly smooth curve s[] = 0, which has nothing to do with the y curve. On the other hand if f is gigantic, then to minimise s we should concentrate on the second term, and set d[i] = 0, and our 'smoothed' curve is just the original. So it's reasonable to suppose that as we vary f, the corresponding solutions will vary between being very smooth but far from y (small f) and being close to y but a bit rough (large f).
It's often said that the normal equations, whose use I advocate here, are a bad way to solve least squares problems, and this is generally true. However with 'nice' banded systems -- like the one here -- the loss of stability through using the normal equations is not so great, while the gain in speed is so great. I've used this approach to smooth curves with many thousands of points in a reasonable time.
To see what A is, consider the case where we had 4 points. Then our expression for S comes down to:
sqr( s[2] - 2*s[1] + s[0]) + sqr( s[3] - 2*s[2] + s[1]) + f*(d[0]*d[0] + .. + d[3]*d[3]).
If we substitute s[i] = y[i] + d[i] in this we get, for example,
s[2] - 2*s[1] + s[0] = d[2]-2*d[1]+d[0] + y[2]-2*y[1]+y[0]
and so we see that for this to be sqr( ||A*d-b||) we should take
A = ( 1 -2 1 0)
( 0 1 -2 1)
( f 0 0 0)
( 0 f 0 0)
( 0 0 f 0)
( 0 0 0 f)
and
b = ( -(y[2]-2*y[1]+y[0]))
( -(y[3]-2*y[2]+y[1]))
( 0 )
( 0 )
( 0 )
( 0 )
In an implementation, though, you probably wouldn't want to form A and b, as they are only going to be used to form the normal equation terms, A'*A and A'*b. It would be simpler to accumulate these directly.
This is a constrained optimization problem. The functional to be minimized is the integrated difference of the original curve and the modified curve. The constraints are the area under the curve and the new locations of the modified points. It is not easy to write such codes on your own. It is better to use some open source optimization codes, like this one: ool.
what about to keep the same dynamic range?
compute original min0,max0 y-values
smooth y-values
compute new min1,max1 y-values
linear interpolate all values to match original min max y
y=min1+(y-min1)*(max0-min0)/(max1-min1)
that is it
Not sure for the area but this should keep the shape much closer to original one. I got this Idea right now while reading your question and now I face similar problem so I try to code it and try right now anyway +1 for the getting me this Idea :)
You can adapt this and combine with the area
So before this compute the area and apply #1..#4 and after that compute new area. Then multiply all values by old_area/new_area ratio. If you have also negative values and not computing absolute area then you have to handle positive and negative areas separately and find multiplication ration to best fit original area for booth at once.
[edit1] some results for constant dynamic range
As you can see the shape is slightly shifting to the left. Each image is after applying few hundreds smooth operations. I am thinking of subdivision to local min max intervals to improve this ...
[edit2] have finished the filter for mine own purposes
void advanced_smooth(double *p,int n)
{
int i,j,i0,i1;
double a0,a1,b0,b1,dp,w0,w1;
double *p0,*p1,*w; int *q;
if (n<3) return;
p0=new double[n<<2]; if (p0==NULL) return;
p1=p0+n;
w =p1+n;
q =(int*)((double*)(w+n));
// compute original min,max
for (a0=p[0],i=0;i<n;i++) if (a0>p[i]) a0=p[i];
for (a1=p[0],i=0;i<n;i++) if (a1<p[i]) a1=p[i];
for (i=0;i<n;i++) p0[i]=p[i]; // store original values for range restoration
// compute local min max positions to p1[]
dp=0.01*(a1-a0); // min delta treshold
// compute first derivation
p1[0]=0.0; for (i=1;i<n;i++) p1[i]=p[i]-p[i-1];
for (i=1;i<n-1;i++) // eliminate glitches
if (p1[i]*p1[i-1]<0.0)
if (p1[i]*p1[i+1]<0.0)
if (fabs(p1[i])<=dp)
p1[i]=0.5*(p1[i-1]+p1[i+1]);
for (i0=1;i0;) // remove zeros from derivation
for (i0=0,i=0;i<n;i++)
if (fabs(p1[i])<dp)
{
if ((i> 0)&&(fabs(p1[i-1])>=dp)) { i0=1; p1[i]=p1[i-1]; }
else if ((i<n-1)&&(fabs(p1[i+1])>=dp)) { i0=1; p1[i]=p1[i+1]; }
}
// find local min,max to q[]
q[n-2]=0; q[n-1]=0; for (i=1;i<n-1;i++) if (p1[i]*p1[i-1]<0.0) q[i-1]=1; else q[i-1]=0;
for (i=0;i<n;i++) // set sign as +max,-min
if ((q[i])&&(p1[i]<-dp)) q[i]=-q[i]; // this shifts smooth curve to the left !!!
// compute weights
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
{
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
b0=0.5*(p[i0]+p[i1]);
b1=fabs(p[i1]-p[i0]);
if (b1>=1e-6)
for (b1=0.35/b1,i=i0;i<=i1;i++) // compute weights bigger near local min max
w[i]=0.8+(fabs(p[i]-b0)*b1);
}
// smooth few times
for (j=0;j<5;j++)
{
for (i=0;i<n ;i++) p1[i]=p[i]; // store data to avoid shifting by using half filtered data
for (i=1;i<n-1;i++) // FIR smooth filter
{
w0=w[i];
w1=(1.0-w0)*0.5;
p[i]=(w1*p1[i-1])+(w0*p1[i])+(w1*p1[i+1]);
}
for (i=1;i<n-1;i++) // avoid local min,max shifting too much
{
if (q[i]>0) // local max
{
if (p[i]<p[i-1]) p[i]=p[i-1]; // can not be lower then neigbours
if (p[i]<p[i+1]) p[i]=p[i+1];
}
if (q[i]<0) // local min
{
if (p[i]>p[i-1]) p[i]=p[i-1]; // can not be higher then neigbours
if (p[i]>p[i+1]) p[i]=p[i+1];
}
}
}
for (i0=0,i1=1;i1<n;i0=i1,i1++) // loop through all local min,max intervals
{
for (;(!q[i1])&&(i1<n-1);i1++); // <i0,i1>
// restore original local min,max
a0=p0[i0]; b0=p[i0];
a1=p0[i1]; b1=p[i1];
if (a0>a1)
{
dp=a0; a0=a1; a1=dp;
dp=b0; b0=b1; b1=dp;
}
b1-=b0;
if (b1>=1e-6)
for (dp=(a1-a0)/b1,i=i0;i<=i1;i++)
p[i]=a0+((p[i]-b0)*dp);
}
delete[] p0;
}
so p[n] is the input/output data. There are few things that can be tweaked like:
weights computation (constants 0.8 and 0.35 means weights are <0.8,0.8+0.35/2>)
number of smooth passes (now 5 in the for loop)
the bigger the weight the less the filtering 1.0 means no change
The main Idea behind is:
find local extremes
compute weights for smoothing
so near local extremes are almost none change of the output
smooth
repair dynamic range per each interval between all local extremes
[Notes]
I did also try to restore the area but that is incompatible with mine task because it distorts the shape a lot. So if you really need the area then focus on that and not on the shape. The smoothing causes signal to shrink mostly so after area restoration the shape rise on magnitude.
Actual filter state has none markable side shifting of shape (which was the main goal for me). Some images for more bumpy signal (the original filter was extremly poor on this):
As you can see no visible signal shape shifting. The local extremes has tendency to create sharp spikes after very heavy smoothing but that was expected
Hope it helps ...

Resources