Calculating spam probability - probability

I am building a website in python/django and want to predict wether a user submission is valid or wether it is spam.
Users have an accept rate on their submissions, like this website has.
Users can moderate other users' submissions; and these moderations are later metamoderated by an admin.
Given this:
the registered user A with an submission accept rate of 60% submits something.
user B moderates A's post as a valid submission. However, user B is wrong 70% of the time.
user C moderates A's post as spam. User C is usually right. If user C says something is spam/ no spam, this will be correct 80% of the time.
How can I predict the chance of A's post being spam?
Edit: I made a python script simulating this scenario:
#!/usr/bin/env python
import random
def submit(p):
"""Return 'ham' with (p*100)% probability"""
return 'ham' if random.random() < p else 'spam'
def moderate(p, ham_or_spam):
"""Moderate ham as ham and spam as spam with (p*100)% probability"""
if ham_or_spam == 'spam':
return 'spam' if random.random() < p else 'ham'
if ham_or_spam == 'ham':
return 'ham' if random.random() < p else 'spam'
NUMBER_OF_SUBMISSIONS = 100000
USER_A_HAM_RATIO = 0.6 # Will submit 60% ham
USER_B_PRECISION = 0.3 # Will moderate a submission correctly 30% of the time
USER_C_PRECISION = 0.8 # Will moderate a submission correctly 80% of the time
user_a_submissions = [submit(USER_A_HAM_RATIO) \
for i in xrange(NUMBER_OF_SUBMISSIONS)]
print "User A has made %d submissions. %d of them are 'ham'." \
% ( len(user_a_submissions), user_a_submissions.count('ham'))
user_b_moderations = [ moderate( USER_B_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_b_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_b_moderations) if i == j]
print "User B has correctly moderated %d submissions." % \
len(user_b_moderations_which_are_correct)
user_c_moderations = [ moderate( USER_C_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_c_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_c_moderations) if i == j]
print "User C has correctly moderated %d submissions." % \
len(user_c_moderations_which_are_correct)
i = 0
j = 0
k = 0
for a, b, c in zip(user_a_submissions, user_b_moderations, user_c_moderations):
if b == 'spam' and c == 'ham':
i += 1
if a == 'spam':
j += 1
elif a == "ham":
k += 1
print "'spam' was identified as 'spam' by user B and 'ham' by user C %d times." % j
print "'ham' was identified as 'spam' by user B and 'ham' by user C %d times." % k
print "If user B says it's spam and user C says it's ham, it will be spam \
%.2f percent of the time, and ham %.2f percent of the time." % \
( float(j)/i*100, float(k)/i*100)
Running the script gives me this output:
User A has made 100000 submissions. 60194 of them are 'ham'.
User B has correctly moderated 29864 submissions.
User C has correctly moderated 79990 submissions.
'spam' was identified as 'spam' by user B and 'ham' by user C 2346 times.
'ham' was identified as 'spam' by user B and 'ham' by user C 33634 times.
If user B says it's spam and user C says it's ham, it will be spam 6.52 percent of the time, and ham 93.48 percent of the time.
Is the probability here reasonable? Would this be the correct way to simulate the scenario?

Bayes' Theorem tells us:
Let's change the letters for the events A and B to X and Y resp. because you're using A, B and C to stand for people, and it would make things confusing:
P(X|Y) = P(Y|X) P(X) / P(Y)
Edit: the following is slightly wrong because X should be this post _by A_ is spam, not just "this post is spam" (and thus Y should just be "B accepts A's post, C rejects it"). I'm not redoing the math right here because the numbers are changed anyway -- see other edit below for the right number and correct arithmetic.
You want X to mean "this post is spam", Y to stand for the combination of circumstances A has posted it, B approved it, C rejected it (and let's assume conditional independence of the circumstances in question).
We need P(X), the a priori probability that any post (no matter who makes it or approves it) is spam; P(Y), the a priori probability that a post would be made by A, approved by B, rejected by C (whether it's spam or not); and P(Y | X), same as the latter but with the post being spam.
As you may be noticing, you haven't really given us all the bits and pieces we need for the computation. You three points tell us: a given post by A is spam with a probability of 0.4 (that seems to be how the first point reads); B's acceptance probabilty is 0.3 but we have no idea how that differs for spam and non-spam, except that there should be "little" difference (low accuracy); C's is 0.8 and again we don't know how that's influenced by spam vs non-spam, except that there should be "large" difference (high accuracy).
So we need some more numbers! The fact that C has high accuracy while accepting 80% of the posts tells us that overall spam must be surprisingly low -- if spam overall was the same 40% as for A, then C would have to accept half of it (even if he was perfect on always accepting non-spam) to get the overall 80% accept rate, and that would be hardly "high accuracy". So say spam overall is just 20% and C only accepts 1/4 of it (and rejects 1/16 of non-spam), pretty good accuracy indeed and overall matching the numbers you're giving.
Guessing for B, who accepts at 30% overall, and now "knowing" that spam overall is 20%, we could guess that B accepts 1/4 of the spam and only 5/16 of non-spam.
So: P(X)=0.2; P(Y)=0.3*0.2=0.06 (B's overall acceptance times C's rejection prob); P(Y|X)=0.4*0.25*0.75=0.075 (A's prob of spamming time B's prob of accepting spam time C's prob of rejecting spam).
So P(X|Y)=0.075*0.2/0.06=0.25 -- unless I've made some arithmetic error (quite possible, the point is mostly to show you how one can reason in such cases;-), the probability of this particular post being spam is 0.25 -- a bit higher than the probability of any random post being spam, lower than the probability of a random post by A being spam.
But of course (even under the simplifying hypothesis of conditional independence all over hte place;=) this computation is highly sensitive to my guesses/hypotheses about the ratios of false positives vs false negatives for B and C, and the overall spam ratio. There are five numbers of this kind involved (overall spam prob, conditional prob for each of B and C for spam and non-spam) and you only give us two relevant (linear) constraints (unconditional prob of acceptance for B and C) and two vague "handwaving" statements (about low and high accuracy), so there's plenty of degrees of freedom there.
If you can better estimate the five key numbers, the computation can be made more precise.
And, BTW, Python (and a fortiori Django) have absolutely nothing to do with the case -- I recommend you remove those irrelevant tags to get a broader range of responses!
Edit: the user clarifies (in a comment -- shd really edit his Q!):
When I said "B's moderations' accept
rate is a mere 30%" I mean that for
every ten times B moderates something
spam/no spam he makes the wrong
decision 7 times. So there is a 70%
chance he will tag something spam/no
spam when it is not. For user C, "His
moderations' accept rate is 80%" means
that if C says something is spam or no
spam, he is right 80% of the time.
Overall chance of a registered user
spamming is 20%.
...and asks me to redo the math (I'm assuming false positives and negatives are equally likely for each of B and C). Note that B's an excellent "contrarian indicator", since he's wrong 70% of the time!-).
Anyway: B's overall acceptance rate of A's posts must then be 0.6*0.3 (for when he accepts A's nonspam) + 0.4*0.7 (for when he accepts A's spam) = 0.18 + 0.28 = 0.46; C's must be 0.8*0.4 + 0.2*0.6 = 0.32 + 0.12 = 0.44. So we have...:
P(X)=0.4 (I had it wrong at 0.2 before since I was ignoring the fact that A's probability of spamming is 0.4 -- the overall prob of spam is not relevant, since we know that the post is A's!); P(Y)=0.46*0.56=0.2576 (B's overall acceptance rate for A times C's rejection prob for A); P(Y|X)=0.7*0.8=0.56 (B's prob of accepting spam time C's prob of rejecting spam).
So P(X|Y)=0.56*0.4/0.2576=0.87 (rounding). IOW: while a priori the probability that a post of A's is spam is 0.4, both B's acceptance and C's rejection heighten it, so this specific post of A's has about 87% chance of being spam.

One could use bayesean classification to detect spam and select your training sets for spam and ham based on the results of modification. Results could possibly also be weighted by the rate of accepted posts by the user.
Results with a high probability of being spam could be pushed into a moderation workflow (if you have dedicated moderators). Similarly, a sample of results of former moderations could be submitted into a meta-moderation workflow to get a view on the quality of the classification (i.e. have we got an unacceptably high rate of false positives and negatives).
Finally, an 'appeal' where a user could complain about postings being unfairly classified could also push postings into a meta-moderation workflow. If the user has a history of rejection on appeal or excessively high rates of submitting appeals (maybe an attempt at a DOS attack) their postings might get assigned progressively lower priority in the appeals workflow.

We go at it more empirically.
We've found one of the best indicators of spam is the number of external links in a post/comment, since the whole point of the spam is to get you to go somewhere and buy something and/or make the friendly googlebot think the linked-to page is more interesting.
Our general rules for unregistered users are: 1 link might be OK, 2 is 80%+ probably spam, 3 or more and they're toast. We keep a list of the main domain names that appear in rejected posts and those become triggers even if in a 1 or 2 linker. You can also use a RBL, but be careful since they can be really draconian.
Something this simple might not work for you, but it drastically reduced the load on our moderators and we have had no complaints from real humans.

Related

Q-Learning values get too high

I've recently made an attempt to implement a basic Q-Learning algorithm in Golang. Note that I'm new to Reinforcement Learning and AI in general, so the error may very well be mine.
Here's how I implemented the solution to an m,n,k-game environment:
At each given time t, the agent holds the last state-action (s, a) and the acquired reward for it; the agent selects a move a' based on an Epsilon-greedy policy and calculates the reward r, then proceeds to update the value of Q(s, a) for time t-1
func (agent *RLAgent) learn(reward float64) {
var mState = marshallState(agent.prevState, agent.id)
var oldVal = agent.values[mState]
agent.values[mState] = oldVal + (agent.LearningRate *
(agent.prevScore + (agent.DiscountFactor * reward) - oldVal))
}
Note:
agent.prevState holds previous state right after taking the action and before the environment responds (i.e. after the agent makes it's move and before the other player makes a move) I use that in place of the state-action tuple, but I'm not quite sure if that's the right approach
agent.prevScore holds the reward to previous state-action
The reward argument represents the reward for current step's state-action (Qmax)
With agent.LearningRate = 0.2 and agent.DiscountFactor = 0.8 the agent fails to reach 100K episodes because of state-action value overflow.
I'm using golang's float64 (Standard IEEE 754-1985 double precision floating point variable) which overflows at around ±1.80×10^308 and yields ±Infiniti. That's too big a value I'd say!
Here's the state of a model trained with a learning rate of 0.02 and a discount factor of 0.08 which got through 2M episodes (1M games with itself):
Reinforcement learning model report
Iterations: 2000000
Learned states: 4973
Maximum value: 88781786878142287058992045692178302709335321375413536179603017129368394119653322992958428880260210391115335655910912645569618040471973513955473468092393367618971462560382976.000000
Minimum value: 0.000000
The reward function returns:
Agent won: 1
Agent lost: -1
Draw: 0
Game continues: 0.5
But you can see that the minimum value is zero, and the maximum value is too high.
It may be worth mentioning that with a simpler learning method I found in a python script works perfectly fine and feels actually more intelligent! When I play with it, most of the time the result is a draw (it even wins if I play carelessly), whereas with the standard Q-Learning method, I can't even let it win!
agent.values[mState] = oldVal + (agent.LearningRate * (reward - agent.prevScore))
Any ideas on how to fix this?
Is that kind of state-action value normal in Q-Learning?!
Update:
After reading Pablo's answer and the slight but important edit that Nick provided to this question, I realized the problem was prevScore containing the Q-value of previous step (equal to oldVal) instead of the reward of the previous step (in this example, -1, 0, 0.5 or 1).
After that change, the agent now behaves normally and after 2M episodes, the state of the model is as follows:
Reinforcement learning model report
Iterations: 2000000
Learned states: 5477
Maximum value: 1.090465
Minimum value: -0.554718
and out of 5 games with the agent, there were 2 wins for me (the agent did not recognize that I had two stones in a row) and 3 draws.
The reward function is likely the problem. Reinforcement learning methods try to maximize the expected total reward; it gets a positive reward for every time step in the game, so the optimal policy is to play as long as possible! The q-values, which define the value function (expected total reward of taking an action in a state then behaving optimally) are growing because the correct expectation is unbounded. To incentivize winning, you should have a negative reward every time step (kind of like telling the agent to hurry up and win).
See 3.2 Goals and Rewards in Reinforcement Learning: An Introduction for more insight into the purpose and definition of reward signals. The problem you are facing is actually exercise 3.5 in the book.
If I've understood well, in your Q-learning update rule, you are using the current reward and the previous reward. However, the Q-learning rule only uses one reward (x are states and u are actions):
On the other hand, you are assuming that the current reward is the same that Qmax value, which is not true. So probably you are misunderstanding the Q-learning algorithm.

Ruby factorial code runs too slow

I am working on this hackerrank challenge.
I wanted to try the problem with ruby, and this is the code that I wrote:
gets.to_i.times do
num_people = gets.to_i
num_people_factorial = (1..num_people).inject(:*) || 1
numerator = num_people_factorial
if num_people > 2
denominator = 2*((1..num_people - 2).inject(:*) || 1)
elsif num_people == 2
denominator = 2
else
denominator = 10000
end
puts numerator / denominator
end
The problem is, when I run it on my computer, I get the right answer, but when I run it through Hackerrank's system, the test cases time out - they don't execute quickly enough to be graded.
How can I optimize this code?
EDIT: When I submit this code, I see this result:
You must be vigilant to avoid the dreaded "XY" question! From the title you've assumed you need to compute a factorial. Why? You are over-thinking. The question is if there are N people in the room, how many handshakes will take place. Each of N people shakes hands with (N-1) people, but that counts each handshake twice, so the answer is:
N*(N-1)/2
I can't be sure, but I expect this will pass the benchmark test.
As you can quickly and easily compute this number, even without a computer, calculator or even paper and pencil, you now have another way to impress other guests at dinner parties.

What is a simple and memory efficient algorithm for determining a link metric?

I want to obtain a metric for the quality of a wireless link between two nodes.
The problem is, those nodes are not exchanging messages very often, but each message contains a time when the next message is scheduled to be send.
Currently, I'm using something like this:
if (message arrived in time)
link_quality = link_quality/2 + 0.5
else
link_quality = link_quality/2
as suggested in rfc3626
Now link quality obviously changes a lot, a single lost packet will cut it in half. It is only used for hysteresis.
Assume there are two nodes, A and B. A's link_quality for B means how well it currently receives messages form B. It then announces 1 + link_quality * METRIC_MAX (0 is invalid) to B, so B knows how well it can send messages to A.
Now the value A announces is subject to abrupt changes, so I've thought I'd do something like this
link_metric = (3 * link_metric + new_link_metric) / 4
Now this is slightly better, but it's still subject to a lot of fluctuation.
If I increase the 'weight' of the old value further, it will take quite a while before link_metric has a realistic value.
What would you suggest?
If I understood correctly what you meant, your definition of quality it based on packet loss only.
you can try to give more weight to the old quality you measured:
n=<try some different weights to get the behavior of it>
if (message arrived in time)
link_quality = ((n-1)*(link_quality)+1)/n
else
link_quality = ((n-1)*(link_quality)+0)/n
or you can try to use moving average (average of last N tries - equally weighted), to do this you need to store more data:
static int index
static int link_metrics[N]
link_quality-=(link_metrics[index]/N)
if (message arrived in time)
link_metrics[index]=1
else
link_metrics[index]=0
link_quality+=(link_metrics[index]/N)
index=(index+1)%N
And again play with N to see the behavior of this method.
This is a simple moving average there are more types of MA you can try.

Are Haskell List Comprehensions Inefficient?

I started doing Project Euler and got to problem number 9. Since I was using Project Euler to learn Haskell, I decided to use list comprehensions (as shown in Learn You A Haskell). I do that and GHCI takes awhile to figure out the triplet, which I figured is normal because of the calculations involved. Now, at work yesterday (I don't work as a programmer professionally, yet) I was talking to a friend who knows VBA and he wanted to try to find the answers in VBA. I thought it would be a fun challenge as well, and I churn out some basic for loops and if statements, but what got me was that it was much faster than Haskell was.
My question is: are Haskell's list comprehension incredibly inefficient? At first I thought it was just because I was in GHC's interactive mode, but then I realized VBA is interpreted too.
Please note, I didn't post my code because of it being an answer to project euler. If it will answer my question (as in I'm doing something wrong) then I will gladly post the code.
[edit]
Here is my Haskell list comprehension:
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
I guess I could've lowered the range on c but is that what is really slowing it down?
There are two things you could be doing with this problem that could make your code slow. One is how you are trying values for a, b and c. If you loop through all possible values for a, b, c from 1 to 1000, you'll be spending a long time. To give a hint, you can make use of a+b+c=1000 if you rearrange it for c. The other is that if you only use a list comprehension, it will process every possible value for a, b and c. The problem tells you that there is only one unique set of numbers that satisfies the problem, so if you change your answer from this:
[ a * b * c | .... ]
to:
head [ a * b * c | ... ]
then Haskell's lazy evaluation means that it will stop after finding the first answer. This is the Haskell equivalent of breaking out of your VBA loop when you find the first answer. When I used both these tips, I had an answer that completed very quickly (under a second) in ghci.
Addendum: I missed at first the condition a < b < c. You can also make use of this in your list comprehensions; it is valid to say things along the lines of:
[(a, b) | b <- [1..100], a <- [1..b-1]]
Consider this simplified version of your list comprehension:
[(a,b,c) | a <- [1..1000], b <- [1..1000], c <- [1..1000]]
This will give all possible combinations of a, b, and c. It's kind of like saying, "how many ways can three one-thousand-sided dice land?" The answer is 1000*1000*1000 = 1,000,000,000 different combinations. If it took 0.001 seconds to generate each combination, it would therefore take 1,000,000 seconds (~11.5 days) to finish all combinations. (OK, 0.001 seconds is actually pretty slow for a computer, but you get the idea)
When you add predicates to your list comprehension, it still takes the same amount of time to compute; in fact, it takes longer since it needs to check the predicate for each of the 1 billion combinations it computes.
Now consider your comprehension. It looks like it should be much faster, right?
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
There are 1000 choices for c. How many are there for b and a? Well, the average choice for c is 500. For all choices of c, then, there are an average of 500 choices for b (since b can range from 1 to c). Likewise, for all choices of c and b, there are an average of 250 choices for a. That's very hand-wavy, but I'm fairly sure it's accurate. So 1000 choices for c * 1000/2 choices for b * 1000/4 choices for a = 1 billion / 8 ~= 100 million. It's 8x faster, but if you paid attention, you'll notice it's actually the same big-Oh complexity as the simplified version above. If we compared "simplified" vs "improved" versions of the same problem, but from [1..100000] instead of [1..1000], the "improved" would still only be 8x faster than the "simplified".
Don't get me wrong, 8x is a wonderful constant-factor speedup. But unless you want to wait a couple hours to get the solution, you'll need to get a better big-Oh.
As Neil noted, the way to reduce the complexity of this problem is, for a given b and c, choose the a that satisfies a+b+c=1000. That way, you're not trying a bunch of as that will fail. This will drop the big-Oh complexity; you'll only be considering approximately 1000 * 500 * 1 = 500,000 combinations, instead of ~100,000,000.
Once you get the solution to the problem you can check out other peoples versions of Haskell solutions on the Project Euler site to get an idea of how other people have solved the problem. Incidentally, here is a link to the referenced problem: http://projecteuler.net/index.php?section=problems&id=9
In addition to what everyone else has said about generating fewer elements in the generators, you can also switch to using Int instead of Integer as the type of the numbers. The default is Integer, but your numbers are small enough to fit in an Int.
(Also, to nitpick, Haskell list comprehensions have no speed. Haskell is a language definition with very little operational semantics. A particular Haskell implementation might have slow list comprehensions, though.)

What is a "good" R value when comparing 2 signals using cross correlation?

I apologize for being a bit verbose in advance: if you want to skip all the background mumbo jumbo you can see my question down below.
This is pretty much a follow up to a question I previously posted on how to compare two 1D (time dependent) signals. One of the answers I got was to use the cross-correlation function (xcorr in MATLAB), which I did.
Background information
Perhaps a little background information will be useful: I'm trying to implement an Independent Component Analysis algorithm. One of my informal tests is to (1) create the test case by (a) generate 2 random vectors (1x1000), (b) combine the vectors into a 2x1000 matrix (called "S"), and multiply this by a 2x2 mixing matrix (called "A"), to give me a new matrix (let's call it "T").
In summary: T = A * S
(2) I then run the ICA algorithm to generate the inverse of the mixing matrix (called "W"), (3) multiply "T" by "W" to (hopefully) give me a reconstruction of the original signal matrix (called "X")
In summary: X = W * T
(4) I now want to compare "S" and "X". Although "S" and "X" are 2x1000, I simply compare S(1,:) to X(1,:) and S(2,:) to X(2,:), each which is 1x1000, making them 1D signals. (I have another step which makes sure that these vectors are the proper vectors to compare to each other and I also normalize the signals).
So my current quandary is how to 'grade' how close S(1,:) matches to X(1,:), and likewise with S(2,:) to X(2,:).
So far I have used something like: r1 = max(abs(xcorr(S(1,:), X(1,:)))
My question
Assuming that using the cross correlation function is a valid way to go about comparing the similarity of two signals, what would be considered a good R value to grade the similarity of the signals? Wikipedia states that this is a very subjective area, and so I defer to the better judgment of those who might have experience in this field.
As you might realize, I'm not coming from a EE/DSP/statistical background at all (I'm a medical student) so I'm going through a sort of "baptism through fire" right now, and I appreciate all the help I can get. Thanks!
(edit: as far as directly answering your question about R values, see below)
One way to approach this would be to use cross-correlation. Bear in mind that you have to normalize amplitudes and correct for delays: if you have signal S1, and signal S2 is identical in shape, but half the amplitude and delayed by 3 samples, they're still perfectly correlated.
For example:
>> t = 0:0.001:1;
>> y = #(t) sin(10*t).*exp(-10*t).*(t > 0);
>> S1 = y(t);
>> S2 = 0.4*y(t-0.1);
>> plot(t,S1,t,S2);
These should have a perfect correlation coefficient. A way to compute this is to use maximum cross-correlation:
>> f = #(S1,S2) max(xcorr(S1,S2));
f =
#(S1,S2) max(xcorr(S1,S2))
>> disp(f(S1,S1)); disp(f(S2,S2)); disp(f(S1,S2));
12.5000
2.0000
5.0000
The maximum value of xcorr() takes care of the time-delay between signals. As far as correcting for amplitude goes, you can normalize the signals so that their self-cross-correlation is 1.0, or you can fold that equivalent step into the following:
ρ2 = f(S1,S2)2 / (f(S1,S1)*f(S2,S2);
In this case ρ2 = 5 * 5 / (12.5 * 2) = 1.0
You can solve for ρ itself, i.e. ρ = f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2)), just bear in mind that both 1.0 and -1.0 are perfectly correlated (-1.0 has opposite sign)
Try it on your signals!
with respect to what threshold to use for acceptance/rejection, that really depends on what kind of signals you have. 0.9 and above is fairly good but can be misleading. I would consider looking at the residual signal you get after you subtract out the correlated version. You could do this by looking at the time index of the maximum value of xcorr():
>> t = 0:0.001:1;
>> y = #(a,t) sin(a*t).*exp(-a*t).*(t > 0);
>> S1=y(10,t);
>> S2=0.4*y(9,t-0.1);
>> f(S1,S2)/sqrt(f(S1,S1)*f(S2,S2))
ans =
0.9959
This looks pretty darn good for a correlation. But let's try fitting S2 with a scaled/shifted multiple of S1:
>> [A,i]=max(xcorr(S1,S2)); tshift = i-length(S1);
>> S2fit = zeros(size(S2)); S2fit(1-tshift:end) = A/f(S1,S1)*S1(1:end+tshift);
>> plot(t,[S2; S2fit]); % fit S2 using S1 as a basis
>> plot(t,[S2-S2fit]); % residual
Residual has some energy in it; to get a feel for how much, you can use this:
>> S2res=S2-S2fit;
>> dot(S2res,S2res)/dot(S2,S2)
ans =
0.0081
>> sqrt(dot(S2res,S2res)/dot(S2,S2))
ans =
0.0900
This says that the residual has about 0.81% of the energy (9% of the root-mean-square amplitude) of the original signal S2. (the dot product of a 1D signal with itself will always be equal to the maximum value of cross-correlation of that signal with itself.)
I don't think there's a silver bullet for answering how similar two signals are with each other, but hopefully I've given you some ideas that might be applicable to your circumstances.
A good starting point is to get a sense of what a perfect match will look like by calculating the auto-correlations for each signal (i.e. do the "cross-correlation" of each signal with itself).
THIS IS A COMPLETE GUESS - but I'm guessing max(abs(xcorr(S(1,:),X(1,:)))) > 0.8 implies success. Just out of curiosity, what kind of values do you get for max(abs(xcorr(S(1,:),X(2,:))))?
Another approach to validate your algorithm might be to compare A and W. If W is calculated correctly, it should be A^-1, so can you calculate a measure like |A*W - I|? Maybe you have to normalize by the trace of A*W.
Getting back to your original question, I come from a DSP background, so I get to deal with fairly noise-free signals. I understand that's not a luxury you get in biology :) so my 0.8 guess might be very optimistic. Perhaps looking at some literature in your field, even if they aren't using cross-correlation exactly, might be useful.
Usually in such cases people talk about "false acceptance rate" and "false rejection rate".
The first one describes how many times algorithm says "similar" for non-similar signals, the second one is the opposite.
Selecting a threshold thus becomes a trade-off between these criteria. To make FAR=0, threshold should be 1, to make FRR=0 threshold should be -1.
So probably, you will need to decide which trade-off between FAR and FRR is acceptable in your situation and this will give the right value for threshold.
Mathematically this can be expressed in different ways. Just a couple of examples:
1. fix some of rates at acceptable value and minimize other one
2. minimize max(FRR,FAR)
3. minimize aFRR+bFAR
Since they should be equal, the correlation coefficient should be high, between .99 and 1. I would take the max and abs functions out of your calculation, too.
EDIT:
I spoke too soon. I confused cross-correlation with correlation coefficient, which is completely different. My answer might not be worth much.
I would agree that the result would be subjective. Something that would involve the sum of the squares of the differences, element by element, would have some value. Two identical arrays would give a value of 0 in that form. You would have to decide what value then becomes "bad". Make up 2 different vectors that "aren't too bad" and find their cross-correlation coefficient to be used as a guide.
(parenthetically: if you were doing a correlation coefficient where 1 or -1 would be great and 0 would be awful, I've been told by bio-statisticians that a real-life value of 0.7 is extremely good. I understand that this is not exactly what you are doing but the comment on correlation coefficient came up earlier.)

Resources