Assessing the significance of a BLASTn score? - bioinformatics

I am running standalone command line blast to align many query sequences against a large database sequence of nucleotides. I can modify the command line parameters of the blastn program to change various parameters such as the match/mismatch scores.
I am wondering - for the 'bit score' that blastn outputs, does it make sense to compare the bit scores for alignments with identical query and database sequences but different match/mismatch parameters? I am trying to assess how well blast is performing with various parameter values, but I want to make sure that everything is being compared on even grounds. Thanks.

It's not clear to me why you think that comparing bit scores will give you an insight as to how well BLAST is performing. The usual method for doing
Unfortunately, much of the work on BLAST and other alignment programs is based on looking at local, ungapped alignments and empirically extending those that theory to gapped alignments. In particular, the bit scores are calculated like this:
S' = ( lambda * S - ln(K) ) / ln(2)
In the formula above, K and lambda are constants for your substitution matrix, S is the score (sum of substitution and gap scores), and S' is the bit score. This means that your bit scores will certainly change as a result of varying the gap open/gap extend parameters, which means that your comparison is invalid. This is an unfortunate result of the fact that there is little theory about gapped alignments, so the optimal gap scores for a given system have to be measured empirically.
Because bit scores aren't comparable, I suggest you do your assessment based on an alternate set of data that doesn't involve the alignment scores. For example, if I'm interested in the optimal gap opening/gap extension parameters for comparing protein sequences, I can look at proteins of known structure and assess each parameter set based on its ability make alignments that make structural sense. This avoids comparing the alignment scores entirely, which is good because comparing bit scores on their own isn't obviously useful.

I'm not sure you can do that.
Do you really need to vary the match/mismatch parameters? What is your aim?

It's not necessarily true that bit scores aren't comparable. From the BLAST documentation on NCBI's web site:
"Bit scores are normalized, which means that the bit scores from different alignments can be compared, even if different scoring matrices have been used."
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch16

Related

How many simulations need to do?

Hello my problem is more related with the validation of a model. I have done a program in netlogo that i'm gonna use in a report for my thesis but now the question is, how many repetitions (simulations) i need to do for justify my results? I already have read some methods using statistical approach and my colleagues have suggested me some nice mathematical operations, but i also want to know from people who works with computational models what kind of statistical test or mathematical method used to know that.
There are two aspects to this (1) How many parameter combinations (2) How many runs for each parameter combination.
(1) Generally you would do experiments, where you vary some of your input parameter values and see how some model output changes. Take the well known Schelling segregation model as an example, you would vary the tolerance value and see how the segregation index is affected. In this case you might vary the tolerance from 0 to 1 by 0.01 (if you want discrete) or you could just take 100 different random values in the range [0,1]. This is a matter of experimental design and is entirely affected by how fine you wish to examine your parameter space.
(2) For each experimental value, you also need to run multiple simulations so that you can can calculate the average and reduce the impact of randomness in the simulation run. For example, say you ran the model with a value of 3 for your input parameter (whatever it means) and got a result of 125. How do you know whether the 'real' answer is 125 or something else. If you ran it 10 times and got 10 different numbers in the range 124.8 to 125.2 then 125 is not an unreasonable estimate. If you ran it 10 times and got numbers ranging from 50 to 500, then 125 is not a useful result to report.
The number of runs for each experiment set depends on the variability of the output and your tolerance. Even the 124.8 to 125.2 is not useful if you want to be able to estimate to 1 decimal place. Look up 'standard error of the mean' in any statistics text book. Basically, if you do N runs, then a 95% confidence interval for the result is the average of the results for your N runs plus/minus 1.96 x standard deviation of the results / sqrt(N). If you want a narrower confidence interval, you need more runs.
The other thing to consider is that if you are looking for a relationship over the parameter space, then you need fewer runs at each point than if you are trying to do a point estimate of the result.
Not sure exactly what you mean, but maybe you can check the books of Hastie and Tishbiani
http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
specially the sections on resampling methods (Cross-Validation and bootstrap).
They also have a shorter book that covers the possible relevant methods to your case along with the commands in R to run this. However, this book, as a far as a I know, is not free.
http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4614-7137-0
Also, could perturb the initial conditions to see you the outcome doesn't change after small perturbations of the initial conditions or parameters. On a larger scale, sometimes you can break down the space of parameters with regard to final state of the system.
1) The number of simulations for each parameter setting can be decided by studying the coefficient of variance Cv = s / u, here s and u are standard deviation and mean of the result respectively. It is explained in detail in this paper Coefficient of variance.
2) The simulations where parameters are changed can be analyzed using several methods illustrated in the paper Testing methods.
These papers provide scrupulous analyzing methods and refer to other papers which may be relevant to your question and your research.

Test the randomness of a black box that outputs random 64-bit floats

I got this interview question and need to write a function for it. I failed.
Because it is a phone interview question, I don't think what I am supposed to code really need to be perfect random tester.
Any ideas?
How to write some code to be a reasonable randomness tester within like 30 minutes during an interview?
edit
The distribution in this question is uniformly distributed
As this is an interview question, I think the interviewers are looking to assess in two ways:
Ability to understand what the requirements of the problem really are.
Ability to think of some code that would address those requirements.
This could be a really good interview question in certain settings, especially if the interviewer were willing to prompt the candidate with questions as and when necessary.
In terms of understanding the requirements of the question, it helps if you know that this is a really difficult problem, witness the Diehard tests mentioned in pjs's answer. Fundamentally I think a candidate would need to demonstrate appreciation of two things:
(a) The overall distribution of the numbers should match the desired distribution (I'm assuming it is uniform in this case, but as #pjs points out in comments this assumption should be made explicit).
(b) Each number drawn should be independent from the previous numbers drawn.
With half an hour to code something up in a phone interview, you can't go very far. If I were answering this question I would try to suggest something like:
(a) To test the distribution, come up with a set of equal-sized bins for the floating point numbers, and count the numbers that fall into each bin. Plot a histogram and eyeball it (plotting the data is always a good idea). To extend this, you could use a chi-squared test, as described in amit's answer.
However, as discussed in the comments, and here
The main problem with chi squared test is the choice of number and size of the intervals. Although rules of thumb can help produce good results, there is no panacea for all kinds of applications.
To this end, the Kolmogorov-Smirnov test can be used. The idea behind this test is that if you a plot of the ordered data should be a good fit against the perfect ordered data (known as the cumulative distribution). For a uniform distribution the perfect ordered data is a straight line: you expect the 10th percentile of the data to be 10% of the way through the range, the 20th to be 20% of the way through the range and so on. So, programmatically, you could sort the data, plot it against the ideal value and you should get a straight line. There is also a formal, quantitative statistical test you can apply, which is based on the differences between the actual and ideal values.
(b) To test independence, there are multiple approaches. Autocorrelation at various time lags is one fairly obvious one: to what extent is the value at time t similar to the value at time t+1, for example. The runs test is another nice one: you convert all the numbers into 1 or 0 depending on whether they fall above or below the median, and then the distribution of the length of runs can be used to construct a statistical test. The runs test can also be used to test for runs in one direction or another, as described here and here (this might be more useful in your case). Both of these have fairly straightforward implementations so long as you have the formulas to hand!
Apart from the diehard tests, other good sources discussing random number generators include here and here.
The way to check if a random number generator (or any other probability for that matter) is matching a desired model (in your case, uniform distribution) - you should use a statistical test, the Pearson's chi squared test.
The test is based on collecting observations, and matching them to the expected probability in according to the theoretic model you are assuming the numbers come from.
At the end, the test gives you the probability that the collected sample indeed came from the given model.
A simple example:
Given a cube, and the draws: [5,3,5,5,1,1] Is the cube balanced? (p=1/6 for each of {1,...,6})
Given the above observations we create the Expected vector: E = [1,1,1,1,1,1] (each entry is N/6 - 6 because this is the number of outcomes and N is the number of draws, 6 in the above example). And the Observed vector: O=[2,0,1,0,3,0]
From this we compute the statistic:
Xi^2 = sum((O_i - E_i)^2 / E_i) = 1/1 + 1/1 + 0/1 + 1/1 + 4/1 + 1/1 = 8
Now, we need to check what is the probability for P(Xi^2>=8), according to the chi^2 distribution (one degree of freedom). This probability is ~0.005 (a bit less..). So we can reject the hypothesis that the sample comes from unbiased cube with pretty high probability.
You're saying that they wanted you to recreate/reinvent the "diehard" battery of tests that it took Marsaglia many years to develop? I'd call them on unreasonable expectations.
Whatever distribution the random floats are suppposed to have, say uniform distribution over the interval [0,1], you can use the Kolmogorov-Smirnov test http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test to test to see if a sample does not follow the desired distribution. This can have advantages over chi-squared test if you have many possible values (because if you have more possible values than samples, then you have to define buckets for the chi-squared test, which makes the test less powerful compared to general distribution checking like Kolmogorov-Smirnov)

Classifying english words into rare and common

I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.
Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.
Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").
While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.

Algorithm to score similarness of sets of numbers

What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity

Approximate string matching algorithms

Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms?
Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
Hope this is helpful or at least interesting.
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
We are using the Levenshtein distance method to check for duplicate customers in our database. It works quite well.
Alternative algorithms to look at are agrep (Wikipedia entry on agrep),
FASTA and BLAST biological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
Use FM Index with Backtracking, similar to the one in Bowtie fuzzy aligner
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/

Resources