What is a "good" R value when comparing 2 signals using cross correlation? - algorithm

I apologize for being a bit verbose in advance: if you want to skip all the background mumbo jumbo you can see my question down below.
This is pretty much a follow up to a question I previously posted on how to compare two 1D (time dependent) signals. One of the answers I got was to use the cross-correlation function (xcorr in MATLAB), which I did.
Background information
Perhaps a little background information will be useful: I'm trying to implement an Independent Component Analysis algorithm. One of my informal tests is to (1) create the test case by (a) generate 2 random vectors (1x1000), (b) combine the vectors into a 2x1000 matrix (called "S"), and multiply this by a 2x2 mixing matrix (called "A"), to give me a new matrix (let's call it "T").
In summary: T = A * S
(2) I then run the ICA algorithm to generate the inverse of the mixing matrix (called "W"), (3) multiply "T" by "W" to (hopefully) give me a reconstruction of the original signal matrix (called "X")
In summary: X = W * T
(4) I now want to compare "S" and "X". Although "S" and "X" are 2x1000, I simply compare S(1,:) to X(1,:) and S(2,:) to X(2,:), each which is 1x1000, making them 1D signals. (I have another step which makes sure that these vectors are the proper vectors to compare to each other and I also normalize the signals).
So my current quandary is how to 'grade' how close S(1,:) matches to X(1,:), and likewise with S(2,:) to X(2,:).
So far I have used something like: r1 = max(abs(xcorr(S(1,:), X(1,:)))
My question
Assuming that using the cross correlation function is a valid way to go about comparing the similarity of two signals, what would be considered a good R value to grade the similarity of the signals? Wikipedia states that this is a very subjective area, and so I defer to the better judgment of those who might have experience in this field.
As you might realize, I'm not coming from a EE/DSP/statistical background at all (I'm a medical student) so I'm going through a sort of "baptism through fire" right now, and I appreciate all the help I can get. Thanks!

(edit: as far as directly answering your question about R values, see below)
One way to approach this would be to use cross-correlation. Bear in mind that you have to normalize amplitudes and correct for delays: if you have signal S1, and signal S2 is identical in shape, but half the amplitude and delayed by 3 samples, they're still perfectly correlated.
For example:
>> t = 0:0.001:1;
>> y = #(t) sin(10*t).*exp(-10*t).*(t > 0);
>> S1 = y(t);
>> S2 = 0.4*y(t-0.1);
>> plot(t,S1,t,S2);
These should have a perfect correlation coefficient. A way to compute this is to use maximum cross-correlation:
>> f = #(S1,S2) max(xcorr(S1,S2));
f =
#(S1,S2) max(xcorr(S1,S2))
>> disp(f(S1,S1)); disp(f(S2,S2)); disp(f(S1,S2));
12.5000
2.0000
5.0000
The maximum value of xcorr() takes care of the time-delay between signals. As far as correcting for amplitude goes, you can normalize the signals so that their self-cross-correlation is 1.0, or you can fold that equivalent step into the following:
ρ2 = f(S1,S2)2 / (f(S1,S1)*f(S2,S2);
In this case ρ2 = 5 * 5 / (12.5 * 2) = 1.0
You can solve for ρ itself, i.e. ρ = f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2)), just bear in mind that both 1.0 and -1.0 are perfectly correlated (-1.0 has opposite sign)
Try it on your signals!
with respect to what threshold to use for acceptance/rejection, that really depends on what kind of signals you have. 0.9 and above is fairly good but can be misleading. I would consider looking at the residual signal you get after you subtract out the correlated version. You could do this by looking at the time index of the maximum value of xcorr():
>> t = 0:0.001:1;
>> y = #(a,t) sin(a*t).*exp(-a*t).*(t > 0);
>> S1=y(10,t);
>> S2=0.4*y(9,t-0.1);
>> f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2))
ans =
0.9959
This looks pretty darn good for a correlation. But let's try fitting S2 with a scaled/shifted multiple of S1:
>> [A,i]=max(xcorr(S1,S2)); tshift = i-length(S1);
>> S2fit = zeros(size(S2)); S2fit(1-tshift:end) = A/f(S1,S1)*S1(1:end+tshift);
>> plot(t,[S2; S2fit]); % fit S2 using S1 as a basis
>> plot(t,[S2-S2fit]); % residual
Residual has some energy in it; to get a feel for how much, you can use this:
>> S2res=S2-S2fit;
>> dot(S2res,S2res)/dot(S2,S2)
ans =
0.0081
>> sqrt(dot(S2res,S2res)/dot(S2,S2))
ans =
0.0900
This says that the residual has about 0.81% of the energy (9% of the root-mean-square amplitude) of the original signal S2. (the dot product of a 1D signal with itself will always be equal to the maximum value of cross-correlation of that signal with itself.)
I don't think there's a silver bullet for answering how similar two signals are with each other, but hopefully I've given you some ideas that might be applicable to your circumstances.

A good starting point is to get a sense of what a perfect match will look like by calculating the auto-correlations for each signal (i.e. do the "cross-correlation" of each signal with itself).

THIS IS A COMPLETE GUESS - but I'm guessing max(abs(xcorr(S(1,:),X(1,:)))) > 0.8 implies success. Just out of curiosity, what kind of values do you get for max(abs(xcorr(S(1,:),X(2,:))))?
Another approach to validate your algorithm might be to compare A and W. If W is calculated correctly, it should be A^-1, so can you calculate a measure like |A*W - I|? Maybe you have to normalize by the trace of A*W.
Getting back to your original question, I come from a DSP background, so I get to deal with fairly noise-free signals. I understand that's not a luxury you get in biology :) so my 0.8 guess might be very optimistic. Perhaps looking at some literature in your field, even if they aren't using cross-correlation exactly, might be useful.

Usually in such cases people talk about "false acceptance rate" and "false rejection rate".
The first one describes how many times algorithm says "similar" for non-similar signals, the second one is the opposite.
Selecting a threshold thus becomes a trade-off between these criteria. To make FAR=0, threshold should be 1, to make FRR=0 threshold should be -1.
So probably, you will need to decide which trade-off between FAR and FRR is acceptable in your situation and this will give the right value for threshold.
Mathematically this can be expressed in different ways. Just a couple of examples:
1. fix some of rates at acceptable value and minimize other one
2. minimize max(FRR,FAR)
3. minimize aFRR+bFAR

Since they should be equal, the correlation coefficient should be high, between .99 and 1. I would take the max and abs functions out of your calculation, too.
EDIT:
I spoke too soon. I confused cross-correlation with correlation coefficient, which is completely different. My answer might not be worth much.

I would agree that the result would be subjective. Something that would involve the sum of the squares of the differences, element by element, would have some value. Two identical arrays would give a value of 0 in that form. You would have to decide what value then becomes "bad". Make up 2 different vectors that "aren't too bad" and find their cross-correlation coefficient to be used as a guide.
(parenthetically: if you were doing a correlation coefficient where 1 or -1 would be great and 0 would be awful, I've been told by bio-statisticians that a real-life value of 0.7 is extremely good. I understand that this is not exactly what you are doing but the comment on correlation coefficient came up earlier.)

Related

How to set a convergence tolerance to an specific variable using Dymola?

So, I have a model of a tube with pressure loss, where the unknown is the mass flow rate. Normally, and on most models of this problem, the conservation equations are used to calculate the mass flow rate, but such models have lots of convergence issues (because of the blocked flow at the end of the tube which results in an infinite pressure derivative at the end). See figure below for a representation of the problem on the left and the right a graph showing the infinite pressure derivative.
Because of that I'm using a model which is more robust, though it outputs not the mass flow rate but the tube length, which is known. Therefore an iterative loop is needed to determine the mass flow rate. Ok then, I coded a function length that given the tube geometry, mass flow rate and boundary conditions it outputs the calculated tube length and made the equations like so:
parameter Real L;
Real m_flow;
...
equation
L = length(geometry, boundary, m_flow)
It simulates fine, but it takes ages... And it shouldn't because the mass flow rate is rather insensitive to the tube length, e.g. if L=3 I could say that m_flow has converged if the output of length is within L ± 0.1. On the other hand the default convergence tolerance of DASSL in Dymola is 0.0001, which is fine for all other variables, but a major setback to my model here...
That being said, I'd like to know if there's a (hacky) way of setting a specific tolerance L (from annotations or something). I was unable to find any solution online or in Dymola's user manual... So far I managed a workaround by making a second function which uses a Newton-Raphson method to determine the mass flow rate, something like:
function massflowrate
input geometry, boundary, m_flow_start, tolerance;
output m_flow;
protected
Real error, L, dL, dLdm_flow, Delta_m_flow;
algorithm
error = geometry.L;
m_flow = m_flow_start;
while error>tolerance loop
L = length(geometry, boundary, m_flow);
error = abs(boundary.L - L);
dL = length(geometry, boundary, m_flow*1.001);
dLdm_flow = dL/(0.001*m_flow);
Delta_m_flow = (geometry.L - L)/dLdm_flow;
m_flow = m_flow + Delta_m_flow;
end while;
end massflowrate;
And then I use it in the equations section:
parameter Real L;
Real m_flow;
...
equation
m_flow = massflowrate(geometry, boundary, delay(m_flow,10), tolerance)
Nevertheless, this solutions is not without it's problems, the real equations are very non-linear and depending on the boundary conditions the solver reaches a never-ending loop... =/
PS: I'm sorry for the long post and the lack of a MWE, the real equations are very long and with loads of thermodynamics which I believe not to be of any help, be that as it may, if necessary, I'm able to provide the real model.
Is the length-function smooth? To me that it being non-smooth seems like a likely cause for problems, and the suggestions by #Phil might also be good ideas.
However, it should also be possible to do what you want as follows:
Real m_flow(nominal=1e9);
Explanation: The equations are normally solved to a certain tolerance in unknowns - in this case m_flow.
The tolerance for each variable is a relative/absolute tolerance taking into the nominal value, and Dymola does not allow you to set different tolerances for different variables.
Thus the simple way to compute m_flow less accurately is by setting a high nominal value for it, since the error tolerance will be tol*(abs(m_flow)+abs(nominal(m_flow))) or something like that.
The downside is that it may be too inaccurate, e.g. causing additional events, or that the error is so random that the solver is still slowed down.

How does one do Algebra in Lua?

I've looked and tried but i cant find anything really helpful so thank you in advance.
My problem is i have a changing variable, "balance" for the moment i have it represented as 200. I need to use this equation to find how much money i should withdraw in a game, but I don't know how to write a LUA script that solves algebra
The equation is: 200/(x+x^2+x^3+x^4+x^5)=0.00001001 how would i set about solving for x?
I have tried adding .0000001 if 200/(x+x^2+x^3+x^4+x^5) doesn't equal 0.00001001 but it is very impractical and I haven't gotten it to work. This is The only way I can come up with at the moment. Any help would be appreciated.
This solution finds zero of any continuous function (not only algebraical and not only differentiable) and requires knowing the diapazone of the root to be found.
local function find_zero(f, x_left, x_right, eps)
eps = eps or 0.0000000001 -- precision
local f_left, f_right = f(x_left), f(x_right)
assert(x_left <= x_right and f_left * f_right <= 0, "Wrong diapazone")
while x_right - x_left > eps do
local x_middle = (x_left + x_right) / 2
local f_middle = f(x_middle)
if f_middle * f_left > 0 then
x_left, f_left = x_middle, f_middle
else
x_right, f_right = x_middle, f_middle
end
end
return (x_left + x_right) / 2
end
local function my_func(x)
return 200/(x+x^2+x^3+x^4+x^5) - 0.00001001
end
-- Assuming that the root is between 1 and 1000
local x = find_zero(my_func, 1.0, 1000.0)
print(x) --> 28.643931367544
200/(x+x^2+x^3+x^4+x^5)=0.00001001 is equivalent to 200 = 0.00001001 * (x+x^2+x^3+x^4+x^5), so you have a polynomial equation to solve, and traditionally it is this form of the equation that people like to deal with.
If you want to stay in Lua, then if the form of the equation is predictable enough that you can find a place where the right side is always less than the left (e.g. x = 0) and a place where the right sight is always greater than the left (e.g. very large values of x) then you can use binary search - not terribly efficient, but certain and easy to code.
For general polynomial equations, one well known method is https://en.wikipedia.org/wiki/Newton's_method. Given f(x) = 0 and a guess for x, a better guess might be x - f(x) / f'(x), where f'(x) is the derivative of f(x). There are a few pathological cases where this fails for various reasons, though, so again you probably want to know that your equations is reliably tractable.
Since you have Lua, you may be able to bring in C code that calls out to a maths library such as http://commons.apache.org/proper/commons-math/. They have a routine called LaguerreSolver() which will reasonably reliably solve polynomial equations for you, defending itself against all of the pathological cases. Most math libraries contain a lot more work than any single person is likely to put in for an individual problem, and are of correspondingly higher quality than do it yourself approach such as I describe above.

Trouble implementing Perceptron in Scala

I'm taking the CalTech online course Learning From Data, and I'm stumped with creating a Perceptron in Scala. I chose Scala because I'm learning it and wanted to challenge myself. I understand the theory, and I also understand others' solutions in Python and Ruby. But I can't figure out why my own Scala code doesn't work.
For a background in the Perceptron code: Learning_algorithm
I'm running Scala 2.11 on OSX 10.10.
Per the algorithm, I start off with weights (0.0, 0.0, 0.0), where weight[2] is a learned bias component. I've already generated a test set in the space [-1, 1],[-1,1] on the X-Y plane. I do this by a) picking two random points and drawing a line through them, then b) generating some other random points and calculating if they are on one side of the line or the other. As far as I can tell by plotting it in Python, this generates linearly separable data.
My next step is to take my initialized weights and check against every point to find miss-classified points, i.e. points that don't generate the right +1 or -1 result. Here is the code that simply calculates dot-product of the weight and the vector x:
def h(weight:List[Double], p:Point ): Double = if ( (weight(0)*p.x + weight(1)*p.y + weight(2)) > 0) 1 else -1
It's the initial weights, so they are all miss-classified. I then update the weights, like so:
def newH(weight:List[Double], p:Point, y:Double): List[Double] = {
val newWt = scala.collection.mutable.ArrayBuffer[Double](0.0, 0.0, 0.0)
newWt(0) = weight(0) + p.x*y
newWt(1) = weight(1) + p.y*y
newWt(2) = weight(2) + 1*y
return newWt.toList
}
Then I identify miss-classified points again by checking the test set against the value output by h() above, and continue iterating.
This follows the algorithm (or is supposed to, at least) that Prof Yaser shows here: Library
The problem is that the algorithm never converges. My weights -- the third component of which is the bias -- keep getting more negative or more positive. My weight vector after every adjustment resembles this:
Weights: List(16.43341624736786, 11627.122008800507, -34130.0)
Weights: List(15.533397436141968, 11626.464265227318, -34131.0)
Weights: List(14.726969361305237, 11626.837346673012, -34132.0)
Weights: List(14.224745154380798, 11627.646470665932, -34133.0)
Weights: List(14.075232982635498, 11628.026384592056, -34134.0)
I'm a Scala newbie so my code is probably atrocious. But am I missing something in Scala, e.g. reassignment, that could be causing my weight to be messed up? Or have I completely misunderstood how the Perceptron even operates? Is my weight update just wrong?
Thanks for any help you can give me on this!
Thanks Till. I've discovered the two problems with my code and I'll share them, but to address your point: Someone else asked about this on the class's forum and it looks like what the Wiki formula does is simply to change the learning rate. Alpha can be picked randomly, and y-h(weight, p) would give you weights like
-1-1 = 2
In the case that y=-1 and h()=1, or
1-(-1) = 2
In the case that y=1 and h()=-1
My/the class formula takes 1*p.x instead of alpha*2, which seems to be a matter of different learning rates. Hope that makes sense.
My two problems were as follows:
The y value passed into the recalculation formula newH needs to be the target value of y, that is, the "correct y" that was discovered while generating the test points. I was passing in the y that was generated through h(), which is the guessed-at function. This makes sense obviously since we are looking to correc the weight by using the target y, not the incorrect y.
I was doing a comparison of target y and h()=yin Scala, but was comparison an element obtained from a map through .get(). My Scala map looks like Map[Point,Double] where the Double value refers to the y value generated during the test set creation. But doing a .get() gives you Option[Double] and not a Double value at all. This is explained in Scala Map#get and the return of Some() and makes a lot of sense now. I did map.get(<some Point>).get() for now, since I was focusing on debugging and not code perfection, and then I was accurately able to compare two Double values.

Lua: Code optimization vector length calculation

I have a script in a game with a function that gets called every second. Distances between player objects and other game objects are calculated every second there. The problem is that there can be thoretically 800 function calls in 1 second(max 40 players * 2 main objects(1 up to 10 sub-objects)). I have to optimize this function for less processing. this is my current function:
local square = math.sqrt;
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return square(x*x+y*y+z*z);
end;
-- for example followed by: for i = 800, 1 do getDistance(posA, posB); end
I found out, that the localization of the math.sqrt function through
local square = math.sqrt;
is a big optimization regarding to the speed, and the code
x*x+y*y+z*z
is faster than this code:
x^2+y^2+z^2
I don't know if the localization of x, y and z is better than using the class method "." twice, so maybe square(a.x*b.x+a.y*b.y+a.z*b.z) is better than the code local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
square(x*x+y*y+z*z);
Is there a better way in maths to calculate the vector length or are there more performance tips in Lua?
You should read Roberto Ierusalimschy's Lua Performance Tips (Roberto is the chief architect of Lua). It touches some of the small optimizations you're asking about (such as localizing library functions and replacing exponents with their mutiplicative equivalents). Most importantly, it conveys one of the most important and overlooked ideas in engineering: sometimes the best solution involves changing your problem. You're not going to fix a 30-million-calculation leak by reducing the number of CPU cycles the calculation takes.
In your specific case of distance calculation, you'll find it's best to make your primitive calculation return the intermediate sum representing squared distance and allow the use case to call the final Pythagorean step only if they need it, which they often don't (for instance, you don't need to perform the square root to compare which of two squared lengths is longer).
This really should come before any discussion of optimization, though: don't worry about problems that aren't the problem. Rather than scouring your code for any possible issues, jump directly to fixing the biggest one - and if performance is outpacing missing functionality, bugs and/or UX shortcomings for your most glaring issue, it's nigh-impossible for micro-inefficiencies to have piled up to the point of outpacing a single bottleneck statement.
Or, as the opening of the cited article states:
In Lua, as in any other programming language, we should always follow the two
maxims of program optimization:
Rule #1: Don’t do it.
Rule #2: Don’t do it yet. (for experts only)
I honestly doubt these kinds of micro-optimizations really help any.
You should be focusing on your algorithms instead, like for example get rid of some distance calculations through pruning, stop calculating the square roots of values for comparison (tip: if a^2<b^2 and a>0 and b>0, then a<b), etc etc
Your "brute force" approach doesn't scale well.
What I mean by that is that every new object/player included in the system increases the number of operations significantly:
+---------+--------------+
| objects | calculations |
+---------+--------------+
| 40 | 1600 |
| 45 | 2025 |
| 50 | 2500 |
| 55 | 3025 |
| 60 | 3600 |
... ... ...
| 100 | 10000 |
+---------+--------------+
If you keep comparing "everything with everything", your algorithm will start taking more and more CPU cycles, in a cuadratic way.
The best option you have for optimizing your code isn't not in "fine tuning" the math operations or using local variables instead of references.
What will really boost your algorithm will be eliminating calculations that you don't need.
The most obvious example would be not calculating the distance between Player1 and Player2 if you already have calculated the distance between Player2 and Player1. This simple optimization should reduce your time by a half.
Another very common implementation consists in dividing the space into "zones". When two objects are on the same zone, you calculate the space between them normally. When they are in different zones, you use an approximation. The ideal way of dividing the space will depend on your context; an example would be dividing the space into a grid, and for players on different squares, use the distance between the centers of their squares, that you have computed in advance).
There's a whole branch in programming dealing with this issue; It's called Space Partitioning. Give this a look:
http://en.wikipedia.org/wiki/Space_partitioning
Seriously?
Running 800 of those calculations should not take more than 0.001 second - even in Lua on a phone.
Did you do some profiling to see if it's really slowing you down? Did you replace that function with "return (0)" to verify performance improves (yes, function will be lost).
Are you sure it's run every second and not every millisecond?
I haven't see an issue running 800 of anything simple in 1 second since like 1987.
If you want to calc sqrt for positive number a, take a recursive sequense
x_0 = a
x_n+1 = 1/2 * (x_n + a / x_n)
x_n goes to sqrt(a) with n -> infinity. first several iterations should be fast enough.
BTW! Maybe you'll try to use the following formula for length of vector instesd of standart.
local getDistance = function(a, b)
local x, y, z = a.x-b.x, a.y-b.y, a.z-b.z;
return x+y+z;
end;
It's much more easier to compute and in some cases (e.g. if distance is needed to know whether two object are close) it may act adequate.

How to implement this iteration/convergence step by guessing a value in Matlab?

I have two parameters fL and fV, both functions of T and P. If I make a function called func(T), which takes only T as input, then how do I go about implementing this step in Matlab:
Guess P
if |(fL/fV)-1|<0.0001 % where fL and fV are both functions of T and P
then print P
else P=P*(fL/fV)
Initially it is advised to guess the P in the beginning of the algorithm. All other steps before this involve formula calculation and doesn't involve any converging, so I didn't write all those formulas. The important thing to note is even though we take only T as input for our function, the pressure is guessed in the beginning of the code and is not part of any input by the user.
Thanks!
In order to "guess" P, you can either proceed using a) an educated guess or b) a random guess. So, for example if you were dealing with pressure in the day to day surroundings, 100kPa would be a reasonable guess. A random guess would mean initializing P to a random variable generated over a meaningful domain. So in my example, it could be a random variable uniformly distributed between 90kPa and 110kPa. Which of these approaches you choose depends on your specific problem.
You can code your requirements as follows
minP=90;maxP=110;
P=minP+(maxP-minP)*rand;%# a random guess between 90 & 100
<some code here where you calculate fL and fV
if abs(fL/fV-1)<0.0001
fprintf('%f',P)
else
P=P*fL/fV;
end

Resources