Find similar records in dataset - algorithm

I have a dataset of 25 integer fields and 40k records, e.g.
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
etc.
I'm testing with MySQL but am not tied to it.
Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.
https://en.wikipedia.org/wiki/Euclidean_distance
https://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index
https://en.wikipedia.org/wiki/Similarity_matrix

I know it's an old post, but for anyone who comes by it seeking similar algorithms, one that works particularly well is Cosine Similarity. Find a way to vectorize your records, then look for vectors with minimum angle between them. If vectorizing a record is not trivial, then you can vectorize similarity between them via some known algorithm, and then look at cosine similarity of the similarity vectors to the perfect match vector (assuming perfect matches aren't the goal since they're easy to find anyway). I get tremendous results with this matching even comparing things like lists of people in various countries working on a particular project with various contributions to the project. Vectorization implies looking at number of country matches, country mismatches, ratio of people in a matching country between two datasets, etc etc etc. I use string edit distance functions like Levenshtein distance for getting numeric value from string dissimilarities, but one could use phonetic matching, etc. As long as the target number is not 0 (vector [0 0 ... 0] is the subspace of ANY vector and thus its angle would be undefined. Sometimes to get away from the problem, such as the case of edit distance, I give a perfect match (e.d. 0) a negative weight, so that perfect matches are really emphasized. -1 and 1 are farther away than 1 and 2, which makes a lot of sense - perfect match is better than anything with even 1 misspelling.
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) where dot is the dot-product, and Norm is the Euclidian magnitude of the vector.
Good luck!

Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):
SELECT id,
(
ABS(field1-2)
+ ABS(field2-2)
+ ABS(field3-3)
+ ABS(field4-1)
+ ABS(field5-0)
+ ABS(field6-3)
+ ABS(field7-2)
+ ABS(field8-0)
+ ABS(field9-1)
+ ABS(field10-0)
+ ABS(field11-2)
+ ABS(field12-2)
+ ABS(field13-3)
+ ABS(field14-2)
+ ABS(field15-0)
+ ABS(field16-1)
+ ABS(field17-0)
+ ABS(field18-2)
+ ABS(field19-3)
+ ABS(field20-1)
+ ABS(field21-0)
+ ABS(field22-1)
+ ABS(field23-3)
+ ABS(field24-2)
+ ABS(field25-2)
)/25
AS distance
FROM mytable
ORDER BY distance ASC
LIMIT 20;

Related

Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attribute

I am having a hard time with the question below. I am not sure if I got it correct, but either way, I need some help futher understanding it if anyone has time to explain, please do.
Design L1 and L2 distance functions to assess the similarity of bank customers. Each customer is characterized by the following attributes:
− Age (customer’s age, which is a real number with the maximum age is 90 years and minimum age 15 years)
− Cr (“credit rating”) which is ordinal attribute with values ‘very good’, ‘good, ‘medium’, ‘poor’, and ‘very poor’.
− Av_bal (avg account balance, which is a real number with mean 7000, standard deviation is 4000)
Using the L1 distance function computes the distance between the following 2 customers: c1 = (55, good, 7000) and c2 = (25, poor, 1000). [15 points]
Using the L2 distance function computes the distance between the above mentioned 2 customers
Using the L2 distance function computes the distance between the above mentioned 2 customers.
Answer with L1
d(c1,c2) = (c1.cr-c2.cr)/4 +(c1.avg.bal –c2.avg.bal/4000)* (c1.age-mean.age/std.age)-( c2.age-mean.age/std.age)
The question as is, leaves some room for interpretation. Mainly because similarity is not specified exactly. I will try to explain what the standard approach would be.
Usually, before you start, you want to normalize values such that they are rougly in the same range. Otherwise, your similarity will be dominated by the feature with the largest variance.
If you have no information about the distribution but just the range of the values you want to try to nomalize them to [0,1]. For your example this means
norm_age = (age-15)/(90-15)
For nominal values you want to find a mapping to ordinal values if you want to use Lp-Norms. Note: this is not always possible (e.g., colors cannot intuitively be mapped to ordinal values). In you case you can transform the credit rating like this
cr = {0 if ‘very good’, 1 if ‘good, 2 if ‘medium’, 3 if ‘poor’, 4 if ‘very poor’}
afterwards you can do the same normalization as for age
norm_cr = cr/4
Lastly, for normally distributed values you usually perform standardization by subtracting the mean and dividing by the standard deviation.
norm_av_bal = (av_bal-7000)/4000
Now that you have normalized your values, you can go ahead and define the distance functions:
L1(c1, c2) = |c1.norm_age - c2.norm_age| + |c1.norm_cr - c2.norm_cr |
+ |c1.norm_av_bal - c2.norm_av_bal|
and
L2(c1, c2) = sqrt((c1.norm_age - c2.norm_age)2 + (c1.norm_cr -
c2.norm_cr)2 + (c1.norm_av_bal -
c2.norm_av_bal)2)

Math function with three variables (correlation)

I want to analyse some data in order to program a pricing algorithm.
Following dates are available:
I need a function/correlationfactor of the three variables/dimension which show the change of the Median (price) while the three dimensions (pers_capacity, amount of bedrooms, amount of bathrooms) grow.
e.g. Y(#pers_capacity,bedroom,bathroom) = ..
note:
- in the screenshot below are not all the data available (just a part of it)
- median => price per night
- yellow => #bathroom
e.g. For 2 persons, 2 bedrooms and 1 bathroom is the median price 187$ per night
Do you have some ideas how I can calculate the correlation/equation (f(..)=...) in order to get a reliable factor?
Kind regards
One typical approach would be formulating this as a linear model. Given three variables x, y and z which explain your observed values v, you assume v ≈ ax + by + cz + d and try to find a, b, c and d which match this as closely as possible, minimizing the squared error. This is called a linear least squares approximation. You can also refer to this Math SE post for one example of a specific linear least squares approximation.
If your your dataset is sufficiently large, you may consider more complicated formulas. Things like
v ≈
a1x2 +
a2y2 +
a3z2 +
a4xy +
a5xz +
a6yz +
a7x +
a8y +
a9z +
a10
The above is non-linear in the variables but still linear in the coefficients ai so it's still a linear least squares problem.
Or you could apply transformations to your variables, e.g.
v ≈
a1x +
a2y +
a3z +
a4exp(x) +
a5exp(y) +
a6exp(z) +
a7
Looking at the residual errors (i.e. difference between predicted and observed values) in any of these may indicate terms worth adding.
Personally I'd try all this in R, since computing linear models is just one line in that language, and visualizing data is fairly easy as well.

Non-linear comparison sorting / scoring

I have an array I want to sort based on assigning a score to each element in the array.
Let's say the possible score range is 0-100. And to get that score we are going to use 2 comparison data points, one with a weighting of 75 and one with a weighting of 25. Let's call them valueA and valueB. And we will transpose each value into a score. So:
valueA (range = 0-10,000)
valueB (range = 0-70)
scoreA (range = 0 - 75)
scoreB (range = 0 - 25)
scoreTotal = scoreA + scoreB (0 - 100)
Now the question is how to transpose valueA to scoreA in a non-linear way with heavier weighting for being close to the min value. What I mean by that is that for valueA, 0 would be a perfect score (75), but a value of say 20 would give a mid-point score of 37.5 and a value of say 100 would give a very low score of say 5, and then everything greater would trend towards 0 (e.g. a value of 5,000 would be essentially 0). Ideally I could setup a curve with a few data points (say 4 quartile points) and then the algorithm would fit to that curve. Or maybe the simplest solution is to create a bunch of points on the curve (say 10) and do a linear transposition between each of those 10 points? But I'm hoping there is a much simpler algorithm to accomplish this without figuring out all the points on the curve myself and then having to tweak 10+ variables. I'd rather 1 or 2 inputs to define how steep the curve is. Possible?
I don't need something super complex or accurate, just a simple algorithm so there is greater weighting for being close to the min of the range, and way less weighting for being close to the max of the range. Hopefully this makes sense.
My stats math is so rusty I'm not even sure what this is called for searching for a solution. All those years of calculus and statistics for naught.
I'm implementing this in Objective C, but any c-ish/java-ish pseudo code would be fine.
A function you may want to try is
max / [(log(x+2)/log(2))^N]
where max is either 75 or 25 in your case. The log(x+2)/log(2) part ensures that f(0) == max (you can substitute log(x+C)/log(C) here for any C > 0; a higher C will slow the curve's descent); the ^N determines how quickly your function drops to 0 (you can play around with the function here to get a picture of what's going on)

Compare two arrays of points [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm trying to find a way to find similarities in two arrays of different points. I drew circles around points that have similar patterns and I would like to do some kind of auto comparison in intervals of let's say 100 points and tell what coefficient of similarity is for that interval. As you can see it might not be perfectly aligned also so point-to-point comparison would not be a good solution also (I suppose). Patterns that are slightly misaligned could also mean that they are matching the pattern (but obviously with a smaller coefficient)
What similarity could mean (1 coefficient is a perfect match, 0 or less - is not a match at all):
Points 640 to 660 - Very similar (coefficient is ~0.8)
Points 670 to 690 - Quite similar (coefficient is ~0.5-~0.6)
Points 720 to 780 - Let's say quite similar (coefficient is ~0.5-~0.6)
Points 790 to 810 - Perfectly similar (coefficient is 1)
Coefficient is just my thoughts of how a final calculated result of comparing function could look like with given data.
I read many posts on SO but it didn't seem to solve my problem. I would appreciate your help a lot. Thank you
P.S. Perfect answer would be the one that provides pseudo code for function which could accept two data arrays as arguments (intervals of data) and return coefficient of similarity.
Click here to see original size of image
I also think High Performance Mark has basically given you the answer (cross-correlation). In my opinion, most of the other answers are only giving you half of what you need (i.e., dot product plus compare against some threshold). However, this won't consider a signal to be similar to a shifted version of itself. You'll want to compute this dot product N + M - 1 times, where N, M are the sizes of the arrays. For each iteration, compute the dot product between array 1 and a shifted version of array 2. The amount you shift array 2 increases by one each iteration. You can think of array 2 as a window you are passing over array 1. You'll want to start the loop with the last element of array 2 only overlapping the first element in array 1.
This loop will generate numbers for different amounts of shift, and what you do with that number is up to you. Maybe you compare it (or the absolute value of it) against a threshold that you define to consider two signals "similar".
Lastly, in many contexts, a signal is considered similar to a scaled (in the amplitude sense, not time-scaling) version of itself, so there must be a normalization step prior to computing the cross-correlation. This is usually done by scaling the elements of the array so that the dot product with itself equals 1. Just be careful to ensure this makes sense for your application numerically, i.e., integers don't scale very well to values between 0 and 1 :-)
i think HighPerformanceMarks's suggestion is the standard way of doing the job.
a computationally lightweight alternative measure might be a dot product.
split both arrays into the same predefined index intervals.
consider the array elements in each intervals as vector coordinates in high-dimensional space.
compute the dot product of both vectors.
the dot product will not be negative. if the two vectors are perpendicular in their vector space, the dot product will be 0 (in fact that's how 'perpendicular' is usually defined in higher dimensions), and it will attain its maximum for identical vectors.
if you accept the geometric notion of perpendicularity as a (dis)similarity measure, here you go.
caveat:
this is an ad hoc heuristic chosen for computational efficiency. i cannot tell you about mathematical/statistical properties of the process and separation properties - if you need rigorous analysis, however, you'll probably fare better with correlation theory anyway and should perhaps forward your question to math.stackexchange.com.
My Attempt:
Total_sum=0
1. For each index i in the range (m,n)
2. sum=0
3. k=Array1[i]*Array2[i]; t1=magnitude(Array1[i]); t2=magnitude(Array2[i]);
4. k=k/(t1*t2)
5. sum=sum+k
6. Total_sum=Total_sum+sum
Coefficient=Total_sum/(m-n)
If all values are equal, then sum would return 1 in each case and total_sum would return (m-n)*(1). Hence, when the same is divided by (m-n) we get the value as 1. If the graphs are exact opposites, we get -1 and for other variations a value between -1 and 1 is returned.
This is not so efficient when the y range or the x range is huge. But, I just wanted to give you an idea.
Another option would be to perform an extensive xnor.
1. For each index i in the range (m,n)
2. sum=1
3. k=Array1[i] xnor Array2[i];
4. k=k/((pow(2,number_of_bits))-1) //This will scale k down to a value between 0 and 1
5. sum=(sum+k)/2
Coefficient=sum
Is this helpful ?
You can define a distance metric for two vectors A and B of length N containing numbers in the interval [-1, 1] e.g. as
sum = 0
for i in 0 to 99:
d = (A[i] - B[i])^2 // this is in range 0 .. 4
sum = (sum / 4) / N // now in range 0 .. 1
This now returns distance 1 for vectors that are completely opposite (one is all 1, another all -1), and 0 for identical vectors.
You can translate this into your coefficient by
coeff = 1 - sum
However, this is a crude approach because it does not take into account the fact that there could be horizontal distortion or shift between the signals you want to compare, so let's look at some approaches for coping with that.
You can sort both your arrays (e.g. in ascending order) and then calculate the distance / coefficient. This returns more similarity than the original metric, and is agnostic towards permutations / shifts of the signal.
You can also calculate the differentials and calculate distance / coefficient for those, and then you can do that sorted also. Using differentials has the benefit that it eliminates vertical shifts. Sorted differentials eliminate horizontal shift but still recognize different shapes better than sorted original data points.
You can then e.g. average the different coefficients. Here more complete code. The routine below calculates coefficient for arrays A and B of given size, and takes d many differentials (recursively) first. If sorted is true, the final (differentiated) array is sorted.
procedure calc(A, B, size, d, sorted):
if (d > 0):
A' = new array[size - 1]
B' = new array[size - 1]
for i in 0 to size - 2:
A'[i] = (A[i + 1] - A[i]) / 2 // keep in range -1..1 by dividing by 2
B'[i] = (B[i + 1] - B[i]) / 2
return calc(A', B', size - 1, d - 1, sorted)
else:
if (sorted):
A = sort(A)
B = sort(B)
sum = 0
for i in 0 to size - 1:
sum = sum + (A[i] - B[i]) * (A[i] - B[i])
sum = (sum / 4) / size
return 1 - sum // return the coefficient
procedure similarity(A, B, size):
sum a = 0
a = a + calc(A, B, size, 0, false)
a = a + calc(A, B, size, 0, true)
a = a + calc(A, B, size, 1, false)
a = a + calc(A, B, size, 1, true)
return a / 4 // take average
For something completely different, you could also run Fourier transform using FFT and then take a distance metric on the returning spectra.

Algorithm to give more weight to the first word

Right now, I'm trying to create an algorithm that gives a score to a user, depending on his input in a text field.
This score is supposed to encourage the user to add more text to his personal profile.
The way the algorithm should work, is that it should account a certain weight to the first word, and a little less weight to the second word. The third word will receive a little less weight than the second word, and so on.
The goal is to encourage users to expand their texts, but to avoid spam in general as well. For instance, the added value of the 500th word shouldn't be much at all.
The difference between a text of 100 words and a text of 500 words should be substantial.
Am I making any sense so far?
Right now, I wouldn't know where to begin with this question. I've tried multiple Google queries, but didn't seem to find anything of the sort. Can anyone point me in the right direction?
I suppose such an algorithm must already exist somewhere (or at least the general idea probably exists) but I can't seem to be able to find some help on the subject.
Can anyone point me in the right direction?
I'd really appreciate any help you can give me.
Thanks a lot.
// word count in user description
double word_count = ...;
// word limit over which words do not improve score
double word_limit = ...;
// use it to change score progression curve
// if factor = 1, progression is linear
// if factor < 1, progression is steeper at the beginning
// if factor > 1, progression is steeper at the end
double factor = ...;
double score = pow(min(word_count, word_limit) / word_limit, factor);
It depends how complex you want/need it to be, and whether or not you want a constant reduction in the weight applied to a particular word.
The simplest would possibly be to apply a relatively high weight (say 1000) to the first word, and then each subsequent word has a weight one less than the weight of the previous word; so the second word has a weight of 999, the third word has a weight of 998, etc. That has the "drawback" that the sum of the weights doesn't increase past the 1000 word mark - you'll have to decide for yourself whether or not that's bad for your particular situation. That may not do exactly what you need to do, though.
If you don't want a linear reduction, it could be something simple such as the first word has a weight of X, the second word has a weight equal to Y% of X, the third word has a weight equal to Y% of Y% of X, etc. The difference between the first and second word is going to be larger than the difference between the second and third word, and by the time you reach the 500th word, the difference is going to be far smaller. It's also not difficult to implement, since it's not a complex formula.
Or, if you really need to, you could use a more complex mathematical function to calculate the weight - try googling 'exponential decay' and see if that's of any use to you.
It is not very difficult to implement a custom scoring function. Here is one in pseudo code:
function GetScore( word_count )
// no points for the lazy user
if word_count == 0
return 0
// 20 points for the first word and then up to 90 points linearly:
else if word_count >= 1 and word_count <= 100
return 20 + 70 * (word_count - 1) / (100)
// 90 points for the first 100 words and then up to 100 points linearly:
else if word_count >= 101 and word_count <= 1000
return 90 + 10 * (word_count - 100) / (900)
// 100 points is the maximum for 1000 words or more:
else
return 100
end function
I would go with something like result = 2*sqrt(words_count), anyway you can use any function that has derivative less then 1 e.g. log

Resources