Need a weighted scoring algorithm to combine scores with different scale - algorithm

I am working on a scoring system for tickets and each ticket can potentially have up to 4 kinds of different scores. What I would like to do is to combine these four scores in the a final score and prioritize the tickets. I'd also like to assign a weight to each of the 4 score. The details of the 4 scores are listed below:
Score A: 1-5 scale, desired relative weight: 2
Score B: 1-4 scale, desired relative weight: 3
Score C: 1-10 scale, desired relative weight: 2
Score D: 1-5 scale, desired relative weight: 1
Some requirements:
(1) Each ticket may come with arbitrary number of scores, so sometimes we have all 4, sometimes we have no scores(Default final score needed).
(2) If the ticket gets a high score from multiple sources, the final score should be even higher, vice versa.
(3) The score with higher weight plays a bigger role in deciding the final score
(4) The final score should be in 1-4 scale.
I wonder if there are any existing algorithm for solving this kind of issue? Thanks ahead.
Desired input and output example:
(1) Input: {A:N/A, B:4, C:9, D:N/A}
Output: {Final: 4}
Since for both score it's a high score
(2) Input: {A:3, B:N\A, C:8, D:1}
Output: {Final:3}
Although score D is small, it has small weight, so still we get a relative big final score.
(3) Input: {A:N\A, B:N\A, C:N\A, D:N\A}
Output: {Final:2}
Arguable default score.
The overall idea is to rank the tickets according to the four scores.

Define a initial relative weight W for every score.
Convert every initial score S from it's initial scale A into a universal score S' on a universal scale B from minB to maxB.
If a score is missing give it a default value for example
Calculate the final score with your new S
a and b are your weights for the score and the weight.
If you make a large, then only really the biggest score will play a value, if you make b large, then only really the biggest weight will play a value.
Having a and b between [1;2] shouldn't be too extreme. With a or b being 1 you will have a normal weighting system, that doesn't weight bigger scores more.

Related

Applying weights to KNN dimensions

When doing a KNN searches in ES/OS it seems to be recommended to normalize the data in the knn vectors to prevent single dimensions from over powering the the final scoring.
In my current example I have a 3 dimensional vector where all values are normalized to values between 0 and 1
[0.2, 0.3, 0.2]
From the perspective of Euclidian distance based scoring this seems to give equal weight to all dimensions.
In my particular example I am using an l2 vector:
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
}
However, if I want to give more weight to one of my dimensions (say by a factor of 2), would it be acceptable to single out that dimension and normalize between 0-2 instead of the base range of 0-1?
Example:
[0.2, 0.3, 1.2] // Third vector is now between 0-2
The distance computation for this term would now be (2 * (xi - yi))^2 and lead to bigger diffs compared to the rest. As a result the overall score would be more sensitive to differences in this particular term.
In OS the score is calculated as 1 / (1 + Distance Function) so the higher the value returned from the distance function, the lower the score will be.
Is there a method to deciding what the weighting range should be? Setting the range too high would likely make the dimension too dominant?

Dataset for meaningless “Nearest Neighbor”?

In the paper "When Is 'Nearest Neighbor' Meaningful?" we read that, "We show that under certain broad conditions (in terms of data and query distributions, or workload), as dimensionality increases, the distance to the nearest
neighbor approaches the distance to the farthest neighbor. In other words, the contrast in distances to different data points becomes nonexistent. The conditions
we have identified in which this happens are much broader than the independent and identically distributed (IID) dimensions assumption that other work
assumes."
My question is how I should generate a dataset that resembles this effect? I have created three points each with 1000 dimensions with random numbers ranging from 0-255 for each dimension but points create different distances and do not reproduce what is mentioned above. It seems changing dimensions (e.g. 10 or 100 or 1000 dimensions) and ranges (e.g. [0,1]) do not change anything. I still get different distances which should not be any problem for e.g. clustering algorithms!
I hadn't heard of this before either, so I am little defensive, since I have seen that real and synthetic datasets in high dimensions really do not support the claim of the paper in question.
As a result, what I would suggest, as a first, dirty, clumsy and maybe not good first attempt is to generate a sphere in a dimension of your choice (I do it like like this) and then place a query at the center of the sphere.
In that case, every point lies in the same distance with the query point, thus the Nearest Neighbor has a distance equal to the Farthest Neighbor.
This, of course, is independent from the dimension, but it's what came at a thought after looking at the figures of the paper. It should be enough to get you stared, but surely, better datasets may be generated, if any.
Edit about:
distances for each point got bigger with more dimensions!!!!
this is expected, since the higher the dimensional space, the sparser the space is, thus the greater the distance is. Moreover, this is expected, if you think for example, the Euclidean distance, which gets grater as the dimensions grow.
I think the paper is right. First, your test: One problem with your test may be that you are using too few points. I used 10000 point and below are my results (evenly distributed point in [0.0 ... 1.0] in all dimensions). For DIM=2, min/max differ almost by a factor of 1000, for DIM=1000, they only differ by a factor of 1.6, for DIM=10000 by 1.248 . So I'd say these results confirm the paper's hypothesis.
DIM/N = 2 / 10000
min/avg/max= 1.0150906548224441E-5 / 0.019347838262624064 / 0.9993862941797146
DIM/N = 10 / 10000.0
min/avg/max= 0.011363500131326938 / 0.9806472676701363 / 1.628460468042207
DIM/N = 100 / 10000
min/avg/max= 0.7701271349716637 / 1.3380320375218808 / 2.1878136533925328
DIM/N = 1000 / 10000
min/avg/max= 2.581913326565635 / 3.2871335447262178 / 4.177669393187736
DIM/N = 10000 / 10000
min/avg/max= 8.704666143050158 / 9.70540814778645 / 10.85760200249862
DIM/N = 100000 / 1000 (N=1000!)
min/avg/max= 30.448610133282717 / 31.14936583713578 / 31.99082677476165
I guess the explanation is as follows: Lets take three randomly generated vectors, A, B and C. The total distance is based on the sum of the distances of each individual row of these vectors. The more dimensions the vectors have, the more the total sum of differences will approach a common average. In other word, it is highly unlikely that a vector C has in all elements a larger distance to A than another vector B has to A. With increasing dimensions, C and B will have increasingly similar distance to A (and to each other).
My test dataset was created as follow. The dataset is essentially a cube ranging from 0.0 to 1.0 in every dimension. The coordinates were created with uniform distribution in every dimension between 0.0 and 1.0. Example code (N=10000, DIM=[2..10000]):
public double[] generate(int N, int DIM) {
double[] data = new double[N*DIM];
for (int i = 0; i < N; i++) {
int pos = DIM*i;
for (int d = 0; d < DIM; d++) {
data[pos+d] = R.nextDouble();
}
}
return data;
}
Following the equation given at the bottom of the accepted answer here, we get:
d=2 -> 98460
d=10 -> 142.3
d=100 -> 1.84
d=1,000 -> 0.618
d=10,000 -> 0.247
d=100,000 -> 0.0506 (using N=1000)

How do I create a population sample that follows specified demographics?

I have the following class:
class Person
{
GenderEnum Gender;
RaceEnum Race;
double Salary;
...
}
I want to create 1000 instances of this class such that the collection of 1000 Persons follow these 5 demographic statistics:
50% male; 50% female
55% white; 20% black; 15% Hispanic; 5% Asian; 2% Native American; 3% Other;
10% < $10K; 15% $10K-$25K; 35% $25K-$50K; 20% $50K-$100K; 15% $100K-$200K; 5% over $200K
Mean salary for females is 77% of mean salary for males
Mean Salary as a percentage of mean white salary:
white - 100%.
black - 75%.
Hispanic - 83%.
Asian - 115%.
Native American - 94%.
Other - 100%.
The categories above are exactly what I want but the percentages given are just examples. The actual percentages will be inputs to my application and will be based on what district my application is looking at.
How can I accomplish this?
What I've tried:
I can pretty easily create 1000 instances of my Person class and assign the Gender and race to match my demographics. (For my project I'm assuming male/female ratio is independent of race). I can also randomly create a list of salaries based on the specified percentage brackets. Where I run into trouble is figuring out how to assign those salaries to my Person instances in such a way that the mean salaries across gender and mean salaries across race match the specified conditions.
I think you can solve this by assuming that the distribution of income for all categories is the same shape as the one you gave, but scaled by a factor which makes all the values larger or smaller. That is, the income distribution has the same number of bars and the same mass proportion in each bar, but the bars are shifted towards smaller values or towards larger values, and all bars are shifted by the same factor.
If that's reasonable, then this has an easy solution. Note that the mean value of the income distribution over all people is sum(p[i]*c[i], i, 1, #bars), which I'll call M, where p[i] = mass proportion of bar i and c[i] = center of bar i. For each group j, you have the mean sum(s[j]*p[i]*c[i], i, 1, #bars) = s[j]*M where s[j] is the scale factor for group j. Furthermore you know that the overall mean is equal to the sum of the means of the groups, weighting each by the proportion of people in that category, i.e. M = sum(s[j]*M*q[j], j, 1, #groups) where q[j] is the proportion of people in the group. Finally you are given specific values for the mean of each group relative to the mean for white people, i.e. you know (s[j]*M)/(s[k]*M) = s[j]/s[k] = some fraction, where k is the index for the white group. From this much you can solve these equations for s[k] (the scaling factor for the white group) and then s[j] from that.
I've spelled this out for the racial groups only. You can repeat the process for men versus women, starting with the distribution you found for each racial group and finding an additional scaling factor. I would guess that if you did it the other way, gender first and then race, you would get the same results, but although it seems obvious I wouldn't be sure unless I worked out a proof of it.

algorithm for randomly distributing objects with constraints

Help me find a good algorithm?
I have a bag full of n balls. (Let's say 28 balls for an example.)
The balls in this bag each have 1 color. There are <= 4 different colors of balls in the bag. (Let's say red, green, blue, and purple are possibilities.)
I have three buckets, each with a number of how many balls they need to end up with. These numbers total n. (For example, let's say bucket A needs to end up with 7 balls, bucket B needs to end up with 11 balls, bucket C needs to end up with 10 balls.)
The buckets also may or may not have color restrictions - colors that they will not accept. (Bucket A doesn't accept purple balls or green balls. Bucket B doesn't accept red balls. Bucket C doesn't accept purple balls or blue balls.)
I need to distribute the balls efficiently and randomly (equal probability of all possibilities).
I can't just randomly put balls in buckets that have space to accept them, because that could bring me to a situation where the only bucket that has space left in it does not accept the only color that is left in the bag.
It is given that there is always at least 1 possibility for distributing the balls. (I will not have a bag of only red balls and some bucket with number > 0 doesn't accept red balls.)
All of the balls are considered distinct, even if they are the same color. (One of the possibilities where bucket C gets red ball 1 and not red ball 2 is different from the possibility where everything is the same except bucket C gets red ball 2 and not red ball 1.)
Edit to add my idea:
I don't know if this has equal probability of all possibilities, as I would like. I haven't figured out the efficiency - It doesn't seem too bad.
And this contains an assertion that I'm not sure if it's always true.
Please comment on any of these things if you know.
Choose a ball from the bag at random. (Call it "this ball".)
If this ball fits and is allowed in a number of buckets > 0:
Choose one of those buckets at random and put this ball in that bucket.
else (this ball is not allowed in any bucket that it fits in):
Make a list of colors that can go in buckets that are not full.
Make a list of balls of those colors that are in full buckets that this ball is allowed in.
If that 2nd list is length 0 (There are no balls of colors from the 1st list in the bucket that allows the color of this ball):
ASSERT: (Please show me an example situation where this might not be the case.)
There is a 3rd bucket that is not involved in the previously used buckets in this algorithm.
(One bucket is full and is the only one that allows this ball.
A second bucket is the only one not full and doesn't allow this ball or any ball in the first bucket.
The 3rd bucket is full must allow some color that is in the first bucket and must have some ball that is allowed in the second bucket.)
Choose, at random, a ball from the 3rd bucket balls of colors that fit in the 2nd bucket, and move that ball to the 2nd bucket.
Choose, at random, a ball from the 1st bucket balls of colors that fit in the 3rd bucket, and move that ball to the 3rd bucket.
Put "this ball" (finally) in the 1st bucket.
else:
Choose a ball randomly from that list, and move it to a random bucket that is not full.
Put "this ball" in a bucket that allows it.
Next ball.
Here's an O(n^3)-time algorithm. (The 3 comes from the number of buckets.)
We start by sketching a brute-force enumeration algorithm, then extract an efficient counting algorithm, then show how to sample.
We enumerate with an algorithm that has two nested loops. The outer loop iterates through the balls. The color of each ball does not matter; only that it can be placed in certain buckets but not others. At the beginning of each outer iteration, we have a list of partial solutions (assignments of the balls considered so far to buckets). The inner loop is over partial solutions; we add several partial solutions to a new list by extending the assignment in all valid ways. (The initial list has one element, the empty assignment.)
To count solutions more efficiently, we apply a technique called dynamic programming or run-length encoding depending on how you look at it. If two partial solutions have the same counts in each bucket (O(n^3) possibilities over the life of the algorithm), then all valid extensions of one are valid extensions of the other and vice versa. We can annotate the list elements with a count and discard all but one representative of each "equivalence class" of partial solutions.
Finally, to get a random sample, instead of choosing the representative arbitrarily, when we are combining two list entries, we sample the representative from each side proportionally to that side's count.
Working Python code (O(n^4) for simplicity; there are data structural improvements possible).
#!/usr/bin/env python3
import collections
import random
def make_key(buckets, bucket_sizes):
return tuple(bucket_sizes[bucket] for bucket in buckets)
def sample(balls, final_bucket_sizes):
buckets = list(final_bucket_sizes)
partials = {(0,) * len(buckets): (1, [])}
for ball in balls:
next_partials = {}
for count, partial in partials.values():
for bucket in ball:
next_partial = partial + [bucket]
key = make_key(buckets, collections.Counter(next_partial))
if key in next_partials:
existing_count, existing_partial = next_partials[key]
total_count = existing_count + count
next_partials[key] = (total_count, existing_partial if random.randrange(total_count) < existing_count else next_partial)
else:
next_partials[key] = (count, next_partial)
partials = next_partials
return partials[make_key(buckets, final_bucket_sizes)][1]
def test():
red = {'A', 'C'}
green = {'B', 'C'}
blue = {'A', 'B'}
purple = {'B'}
balls = [red] * 8 + [green] * 8 + [blue] * 8 + [purple] * 4
final_bucket_sizes = {'A': 7, 'B': 11, 'C': 10}
return sample(balls, final_bucket_sizes)
if __name__ == '__main__':
print(test())
I am not really sure what is the trade off you want between a random, a correct and an efficient distribution.
If you want a completely random distribution just pick a ball and put it randomly in a bucket it can go in. It would be pretty efficient but you may easily make a bucket overflow.
If you want to be sure to be correct and random you could try to get all distribution possible correct and pick one of it randomly, but it could be very inefficient since the basic brute force algorithm to create all distribution possibility would nearly be at a complexity of NumberOfBucket^NumberOfBalls.
A better algorithm to create all correct case would be to try to build all case verifying your two rules (a bucket B1 can only have N1 balls & a bucket only accept certain colors) color by color. For instance:
//let a distribution D be a tuple N1,...,Nx of the current number of balls each bucket can accept
void DistributeColor(Distribution D, Color C) {
DistributeBucket(D,B1,C);
}
void DistributeBucket(Distribution D, Bucket B, Color C) {
if B.canAccept(C) {
for (int i = 0; i<= min(D[N],C.N); i++) {
//we put i balls of the color C in the bucket B
C.N-=i;
D.N-=i;
if (C.N == 0) {
//we got no more balls of this color
if (isLastColor(C)){
//this was the last color so it is a valid solution
save(D);
} else {
//this was not the last color, try next color
DistributeColor(D,nextColor(C))
}
} else {
//we still got balls
if (isNotLastBucket(B)) {
//This was not the last bucket let's try to fill the next one
DistributeBucket(D, nextBucket(B), C)
} else {
//this was the last bucket, so this distibution is not a solution. Let's do nothing (please don't kill yourself :/ )
}
}
//reset the balls for the next try
C.N+=i;
D.N+=i;
}
}
//it feel like déjà vu
if (isNotLastBucket(B)) {
//This was not the last bucket let's try to fill the next one
DistributeBucket(D, nextBucket(B), C)
} else {
//this was the last bucket, so this distribution is not a solution.
}
}
(This code is pseudo C++ and is not meant to runnable)
1 First you choose 7 between 28: you have C28,7 =1184040 possibilities.
2 Second, you choose 11 between remaining 21: you have C21,11=352716 possibilities.
3 remaining 10 elements are in bucket C.
At each step, if your choice doesnt fit the rules, you stop and do it again.
Everything makes 417629852640 possibilities (without restriction).
It is not very efficient, but for one choice, it doesnt matter a lot. If restrictions are not too restrictive, you do not lose too much time.
If there are very few solutions, you must restrict combinations (only good colors).
In some cases at least, this problem can be solved quite quickly by
first using the constraints to reduce the problem to a more manageable
size, then searching the solution space.
First, note that we can ignore the distinctness of the balls for the
main part of the algorithm. Having found a solution only considering
color, it’s trivial to randomly assign distinct ball numbers per color
by shuffling within each color.
To restate the problem and clarify the notion of equal probability, here
is a naive algorithm that is simple and correct but potentially very
inefficient:
Sort the balls into some random order with uniform probability. Each
of the n! permutations is equally likely. This can be done with
well-known shuffling algorithms.
Assign the balls to buckets in sequence according to capacity. In
other words, using the example buckets, first 7 to A, next 11 to B,
last 10 to C.
Check if the color constraints have been met. If they have not been
met, go back to the beginning; else stop.
This samples from the space of all permutations with equal probability
and filters out the ones that don’t meet the constraints, so uniform
probability is satisfied. However, given even moderately severe
constraints, it might loop many millions of times before finding a
solution. On the other hand, if the problem is not very constrained, it
will find a solution quickly.
We can exploit both of these facts by first examining the constraints
and the number of balls of each color. For example, consider the
following:
A: 7 balls; allowed colors (red, blue)
B: 11 balls; allowed colors (green, blue, purple)
C: 10 balls; allowed colors (red, green)
Balls: 6 red, 6 green, 10 blue, 6 purple
In a trial run with these parameters, the naive algorithm failed to find
a valid solution within 20 million iterations. But now let us reduce the
problem.
Note that all 6 purple balls must go in B, because it’s the only bucket
that can accept them. So the problem reduces to:
Preassigned: 6 purple in B
A: 7 balls; allowed colors (red, blue)
B: 5 balls; allowed colors (green, blue)
C: 10 balls; allowed colors (red, green)
Balls: 6 red, 6 green, 10 blue
C needs 10 balls, and can only take red and green. There are 6 of each.
The possible counts are 4+6, 5+5, 6+4. So we must put at least 4 red and
4 green in C.
Preassigned: 6 purple in B, 4 red in C, 4 green in C
A: 7 balls; allowed colors (red, blue)
B: 5 balls; allowed colors (green, blue)
C: 2 balls; allowed colors (red, green)
Balls: 2 red, 2 green, 10 blue
We have to put 10 blue balls somewhere. C won’t take any. B can take 5
at most; the other 5 must go in A. A can take 7 at most; the other 3
must go in B. Thus, A must take at least 5 blue, and B must take at
least 3 blue.
Preassigned: 6 purple in B, 4 red in C, 4 green in C, 5 blue in A, 3 blue in B
A: 2 balls; allowed colors (red, blue)
B: 2 balls; allowed colors (green, blue)
C: 2 balls; allowed colors (red, green)
Balls: 2 red, 2 green, 2 blue
At this point, the problem is trivial: checking random solutions to the
reduced problem will find a valid solution within a few tries.
For the fully-reduced problem, 80 out of 720 permutations are valid, so
a valid solution will be found with probability 1/9. For the original
problem, out of 28! permutations there are 7! * 11! * 10! * 80 valid
solutions, and the probability of finding one is less than one in five
billion.
Turning the human reasoning used above into a reducing algorithm is more
difficult, and I will only consider it briefly. Generalizing from the
specific cases above:
Are there any balls that will only go into one bucket?
Is there a bucket that must take some minimum number of balls of one
or more colors?
Is there a color that can only go into certain buckets?
If these don’t reduce a specific problem sufficiently, examination of
the problem may yield other reductions that can then be coded.
Finally: will this always work well? It’s hard to be sure, but I suspect
it will, in most cases, because the constraints are what cause the naive
algorithm to fail. If we can use the constraints to reduce the problem
to one where the constraints don’t matter much, the naive algorithm
should find a solution without too much trouble; the number of valid
solutions should be a reasonably large fraction of all the
possibilities.
Afterthought: the same reduction technique would improve the performance
of the other answers here, too, assuming they’re correct.

Clarity in Procedural Texture Algorithms?

In the Big Picture Section of this page here a table is given for comparing different combinations of 3 different functions. Let the function in the left be y = f(x) then what about the functions Average, Difference, Weighted Sum, 4% Threshold ? I need the mathematical equation in terms of y
Everything is explained on that page :
Here are some simple, boring functions which, when repeatedly combined with smaller and smaller versions of themselves, create very interesting patterns. The table below shows you the basic source pattern (left), and combinations of that pattern with smaller versions of itself using various combination methods.
Average (1/n) - This is simply the average of all of the scales being used, 'n' is the total number of scales. So if there are 6 scales, each scale contributes about 16% (1/6th) of the final value.
Difference - This uses the difference between the color values of each scale as the final texture color.
Weighted Sum (1/2^n) - The weighted sum is very similar to the average, except the larger scales have more weight. As 'n' increases, the contribution of that scale is lessened. The smallest scales (highest value of n) have the least effect. This method is the most common and typically the most visually pleasing.
4% Threshold - This is a version of the Weighted Sum where anything below 48% gray is turned black, and anything above 52% gray is turned white.
Let us take the Average and checker function. You are averaging a number of repeating different images, 6 in their example, but 3 in the following example :
So each pixel of the output image is the average value of the pixel values from the other images. You can have as many of these images as you want, and they are always built the same way : the image at level n is made of 4 tiles which are the image at level n-1 scaled to a fourth of its size. Then from all these pictures you apply one of the above functions to get only one.
Is it clearer now ? It is, however, generally hard to give a function f that defines each image. However, the "compounding" functions are defined even though there are n inputs (xs) for 1 output (y = f(x1, x2, ....xn)) in pseudocode and math :
Average (1/n) - For n levels, final_pixel[x][y] = sum for i from 1 to n of image_i[x][y]/n
Difference - For n levels, final_pixel[x][y] = sum for i from 2 to n of to n of image_i[x][y] - image_i-1[x][y] -- Not entirely sure about this one.
Weighted Sum (1/2^n) - For n levels, final_pixel[x][y] = sum for i from 1 to n of image_i[x][y]/(2**n)
4% Threshold - For n levels,
value = sum for i from 1 to n of image_i[x][y]/(2**n)
if value/max_value > .52 then final_pixel[x][y]=white
else if value/max_value < .48 then final_pixel[x][y]=black;
else final_pixel[x][y]=value
Where 2**n is 2 to the power of n.

Resources