Set membership query for "fuzzy" set - algorithm

I have a set Sx of integers over which I want to answer set membership queries in as little space as possible. (edited to make more clear in response to Niklas' comment)
However, I am allowed some "fuzziness" in the numbers, i.e., it is ok if instead of a number x in Sx, I store any other number y in the range [x-k, x+k]. Here k is a given "fuzziness" constant.
Now, we have modified the set Sx to another set Sy formed by all the y values, which are fuzzed versions of x values. Now, I will like to have a data structure that can answer set membership queries over Sy, possibly with some error probability e (note that the elements in Sx are no longer relevant to the set membership query, it is ok even if all the elements in Sx are changed to different values).
A simple answer would be to create a bloom filter consisting of all elements in Sx, which will consume O(|Sx|log(1/e)). I will like to know if this bound can be improved upon in my specific scenario.
In practice, the number k is around 3, while the numbers in set S are spaced by around 30.

Related

Difference between Rand and Jaccard similarity index?

What is the theoretical difference between Rand and Jaccard similarity/validation index?
I'm not interested in equations, but the interpretation of their difference.
I know Jaccard index neglects true negatives, but why? And what kind of impact does this have?
Thanks
I worked with these in my Master's thesis in computational biology so hopefully I should be able to answer this in a way which helps you-
The shorter version -
J=TP/(TP+FP+FN) while R=(TP+TN)/(TP+TN+FP+FN)
Naturally, TN are neglected by Jaccard by definition. For very large datasets, the number of TN can be pretty huge, which was the case in my thesis. So, that term was driving all the analysis. When I shifted from rand index to Jaccard Index, I neglected the contribution of TN and was able to understand things better.
The longer version-
Rand and Jaccard Indices are more often used to compare Partitionings/clusterings than usual response characteristic statistics like senstivity/specificity etc. But they can in some sense be extended to the idea of a true positive or a true negative. Let's go over this in greater detail-
For a set of elements S={a1,a2....an}, we can define two different clustering algos X and Y which divide them into r clusters each - X1,X2...Xr clusters and Y1,Y2....Yr clusters. Combine all X clusters or all Y clusters and you will get your complete S set again.
Now, we define:-
A= the number of pairs of elements in S that are in the same set in X and in the same set in Y
B= the number of pairs of elements in S that are in different sets in X and in different sets in Y
C= the number of pairs of elements in S that are in the same set in X and in different sets in Y
D= the number of pairs of elements in S that are in different sets in X and in the same set in Y
Rand Index is defined as - R=(A+B)/(A+B+C+D)
Now look at things this way - Let X be your results from a diagnostic test, while Y are the actual labels on the data points. So, A,B,C,D then reduce to TP,TN,FP,FN (in that order). Basically, R reduces to the definition I gave above.
Now, Jaccard Index-
For two sets M,N Jaccard index disregards elements that are in different sets for both clustering algorithms X and Y i.e. it neglects B, which is true negatives.
J = (A)/(A+C+D) which reduces to J=(TP)/(TP+FP+FN).
And that's how the two statistics are fundamentally different. If you want more info on these, here's a pretty good paper, and a website which might be of use to you -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.6189&rep=rep1&type=pdf
http://clusteval.sdu.dk/313/clustering_quality_measures/542
Hope this helps!

Biasing random number generator to some integer n with deviation b

Given an integer range R = [a, b] (where a >=0 and b <= 100), a bias integer n in R, and some deviation b, what formula can I use to skew a random number generator towards n?
So for example if I had the numbers 1 through 10 inclusively and I don't specify a bias number, then I should in theory have equal chances of randomly drawing one of them.
But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.
And if I specify a deviation of say 2 in addition to the bias number, then the number generator should be drawing from 1 through 5 a more frequently than 6 through 10.
What algorithm can I use to achieve this?
I'm using Ruby if it makes it any easier/harder.
i think the simplest route is to sample from a normal (aka gaussian) distribution with the properties you want, and then transform the result:
generate a normal value with given mean and sd
round to nearest integer
if outside given range (normal can generate values over the entire range from -infinity to -infinity), discard and repeat
if you need to generate a normal from a uniform the simplest transform is "box-muller".
there are some details you may need to worry about. in particular, box muller is limited in range (it doesn't generate extremely unlikely values, ever). so if you give a very narrow range then you will never get the full range of values. other transforms are not as limited - i'd suggest using whatever ruby provides (look for "normal" or "gaussian").
also, be careful to round the value. 2.6 to 3.4 should all become 3, for example. if you simply discard the decimal (so 3.0 to 3.999 become 3) you will be biased.
if you're really concerned with efficiency, and don't want to discard values, you can simply invent something. one way to cheat is to mix a uniform variate with the bias value (so 9/10 times generate the uniform, 1/10 times return 3, say). in some cases, where you only care about average of the sample, that can be sufficient.
For the first part "But if I do give a specific bias number (say, 3), then the number generator should be drawing 3 a more frequently than the other numbers.", a very easy solution:
def randBias(a,b,biasedNum=None, bias=0):
x = random.randint(a, b+bias)
if x<= b:
return x
else:
return biasedNum
For the second part, I would say it depends on the task. In a case where you need to generate a billion random numbers from the same distribution, I would calculate the probability of the numbers explicitly and use weighted random number generator (see Random weighted choice )
If you want an unimodal distribution (where the bias is just concentrated in one particular value of your range of number, for example, as you state 3), then the answer provided by andrew cooke is good---mostly because it allows you to fine tune the deviation very accurately.
If however you wish to make several biases---for instance you want a trimodal distribution, with the numbers a, (a+b)/2 and b more frequently than others, than you would do well to implement weighted random selection.
A simple algorithm for this was given in a recent question on StackOverflow; it's complexity is linear. Using such an algorithm, you would simply maintain a list, initial containing {a, a+1, a+2,..., b-1, b} (so of size b-a+1), and when you want to add a bias towards X, you would several copies of X to the list---depending on how much you want to bias. Then you pick a random item from the list.
If you want something more efficient, the most efficient method is called the "Alias method" which was implemented very clearly in Python by Denis Bzowy; once your array has been preprocessed, it runs in constant time (but that means that you can't update the biases anymore once you've done the preprocessing---or you would to reprocess the table).
The downside with both techniques is that unlike with the Gaussian distribution, biasing towards X, will not bias also somewhat towards X-1 and X+1. To simulate this effect you would have to do something such as
def addBias(x, L):
L = concatList(L, [x, x, x, x, x])
L = concatList(L, [x+2])
L = concatList(L, [x+1, x+1])
L = concatList(L, [x-1,x-1,x-1])
L = concatList(L, [x-2])

Generating a set of pseudorandom numbers satisfying following xor-property?

Given a pseudorandom number generator int64 rand64(), I would like to build a set of pseudo random numbers. This set should have the property that the XOR combinations of each subset should not result in the value 0.
I'm thinking of following algorithm:
count = 0
set = {}
while (count < desiredSetSize)
set[count] = rand64()
if propertyIsNotFullfilled(set[0] to set[count])
continue
count = count + 1
The question is: How can propertyIsNotFullfilled be implemented?
Notes: The reason why I like to generate such a set is following: I have a hash table where the hash values are generated via Zobrist hashing. Instead of keeping a boolean value to each hash table entry indicating if the entry is filled, I thought the hash value – which is stored with each entry – is sufficient for this information (0 ... empty, != 0 ... set). There is another reason to carry this information as sentinel value inside the hash-key-table. I'm trying to switch from a AoS (Array of Structure) to a SoA (Structure of Array) memory layout. I'm trying this to avoid padding and to test if there are lesser cache misses. I hope in most cases the access to the hash-key-table is enough (implied that the hash value provides the information if the entry is empty or not).
I also thought about reserving the most significant bit of the hash values for this information but this would reduce the area of possible hash values more than it is necessary. Theoretically the area would be reduced from 264 (minus the seninal 0-value) to 263.
One can read the question in the other way: Given a set of 84 pseudorandom numbers, is there any number which can't be generated by XORing any subset of this set, and how to get it? This number can be used as sentinel value.
Now, for what I need it: I have developed a connect four game engine. There are 6 x 7 moves possible for player A and also for player B. Thus there are 84 possible moves (therefore 84 random values needed). The hash value of a board-state is generated by the precalculated random values in the following manner: hash(board) = randomset[move1] XOR randomset[move2] XOR randomset[move3] ...
This set should have the property that the XOR combinations of each subset should not result in the value 0.
IMHO this would restrict the maxinum number of subsets to 64 (Pigeonhole principle); for >64 subsets, there will always be a (non empty) subset that XORs to zero. For smaller subsets, the property can be fulfilled.
To further illustrate my point: consider a system of 64 equations over 64 unknown variables. Then, add one extra equation. The fact that the equations and variables are booleans does not make the problem different.
--EDIT/UPDATE--: Since the application appears to be the game "connect-four", you could instead enumerate all possible configurations. Not being able to code the impossible board configurations will save enough coding space to fit any valid board position in 64 bits:
Encoding the colored stones as {A,B}, and irrelevant as {X} the configuration of a (hight=6) column can be one of:
X
X X
X X X
X X X X
X X X X X
_ A A A A A A <<-- possible configurations for one pile
--+--+--+--+--+--+--+
1 1 2 4 8 16 32 <<-- number of combinations of the Xs
-2 -5 <<-- number of impossible Xs
(and similar for B instead of A). The numbers below the piles are the number of posssibilities for the Xs on top, the negative numbers the number of forbidden/impossible configurations. For the column with one A and 4 Xs, every value for the Xs is valid, *except 3*A (the game would already have ended). The same for the rightmost pile: the bottom 3Xs cannot be all A, and X cannot be B for all the Xs.
This leads to a total of 1 + 2 * (63-7) := 113.
(1 is for the empty board, 2 is the number of colors). So: 113 is the number of configurations for one column, fitting well within 7 bit. For 7 columns we'll need 7*7:=49 bits. (we might save one bit for the L/R mirror symmetry, maybe even one for the color symmetry, but that would only complicate things, IMHO).
There still be a lot of coding space wasted (the columns are not independent, the number of As on the board is equal to the number of Bs, or one more, etc), but I don't think it would be easy to avoid them. Fortunately, it will not be necessary.
To amplify wildplasser: every hash function that be used to distinguish every n-bit string from every other n-bit string cannot have output shorter than n bits. Shorter hash functions are usable because we only have to avoid collisions in the strings that actually arrive, but we cannot hope to make an intelligent choice offline. Just use a cryptographically-secure RNG and one of two things will happen: (i) your code will work as though the RNG were truly random or (ii, unlikely) your code will break and (if it's not bugged) it will act as a distinguisher between the crypto RNG and true randomness, bringing you fame and notoriety.
Amplifying the answer by wildplasser a little bit more, here is an idea how to implement propertyIsNotFullfilled.
Represent the set of pseudo-random numbers as a {0,1}-matrix. Perform Gaussian elimination (use XOR instead of usual multiply/subtract operations). If you get matrix where the last row is zero, return true, otherwise false.
Definitely, this function will return true very frequently when size of the set is close to 64. So algorithm in OP is efficient only for relatively small sizes.
To optimize this algorithm, you can keep the result of last Gaussian elimination.

Which algorithm will be required to do this?

I have data of this form:
for x=1, y is one of {1,4,6,7,9,18,16,19}
for x=2, y is one of {1,5,7,4}
for x=3, y is one of {2,6,4,8,2}
....
for x=100, y is one of {2,7,89,4,5}
Only one of the values in each set is the correct value, the rest is random noise.
I know that the correct values describe a sinusoid function whose parameters are unknown. How can I find the correct combination of values, one from each set?
I am looking something like "travelling salesman"combinatorial optimization algorithm
You're trying to do curve fitting, for which there are several algorithms depending on the type of curve you want to fit your curve to (linear, polynomial, etc.). I have no idea whether there is a specific algorithm for sinusoidal curves (Fourier approximations), but my first idea would be to use a polynomial fitting algorithm with a polynomial approximation of the sine.
I wonder whether you need to do this in the course of another larger program, or whether you are trying to do this task on its own. If so, then you'd be much better off using a statistical package, my preferred one being R. It allows you to import your data and fit curves and draw graphs in just a few lines, and you could also use R in batch-mode to call it from a script or even a program (this is what I tend to do).
It depends on what you mean by "exactly", and what you know beforehand. If you know the frequency w, and that the sinusoid is unbiased, you have an equation
a cos(w * x) + b sin(w * x)
with two (x,y) points at different x values you can find a and b, and then check the generated curve against all the other points. Choose the two x values with the smallest number of y observations and try it for all the y's. If there is a bias, i.e. your equation is
a cos(w * x) + b sin(w * x) + c
You need to look at three x values.
If you do not know the frequency, you can try the same technique, unfortunately the solutions may not be unique, there may be more than one w that fits.
Edit As I understand your problem, you have a real y value for each x and a bunch of incorrect ones. You want to find the real values. The best way to do this is to fit curves through a small number of points and check to see if the curve fits some y value in the other sets.
If not all the x values have valid y values then the same technique applies, but you need to look at a much larger set of pairs, triples or quadruples (essentially every pair, triple, or quad of points with different y values)
If your problem is something else, and I suspect it is, please specify it.
Define sinusoid. Most people take that to mean a function of the form a cos(w * x) + b sin(w * x) + c. If you mean something different, specify it.
2 Specify exactly what success looks like. An example with say 10 points instead of 100 would be nice.
It is extremely unclear what this has to do with combinatorial optimization.
Sinusoidal equations are so general that if you take any random value of all y's these values can be fitted in sinusoidal function unless you give conditions eg. Frequency<100 or all parameters are integers,its not possible to diffrentiate noise and data theorotically so work on finding such conditions from your data source/experiment first.
By sinusoidal, do you mean a function that is increasing for n steps, then decreasing for n steps, etc.? If so, you you can model your data as a sequence of nodes connected by up-links and down-links. For each node (possible value of y), record the length and end-value of chains of only ascending or descending links (there will be multiple chain per node). Then you scan for consecutive runs of equal length and opposite direction, modulo some initial offset.

Algorithm to pick values from set to match target value?

I have a fixed array of constant integer values about 300 items long (Set A). The goal of the algorithm is to pick two numbers (X and Y) from this array that fit several criteria based on input R.
Formal requirement:
Pick values X and Y from set A such that the expression X*Y/(X+Y) is as close as possible to R.
That's all there is to it. I need a simple algorithm that will do that.
Additional info:
The Set A can be ordered or stored in any way, it will be hard coded eventually. Also, with a little bit of math, it can be shown that the best Y for a given X is the closest value in Set A to the expression X*R/(X-R). Also, X and Y will always be greater than R
From this, I get a simple iterative algorithm that works ok:
int minX = 100000000;
int minY = 100000000;
foreach X in A
if(X<=R)
continue;
else
Y=X*R/(X-R)
Y=FindNearestIn(A, Y);//do search to find closest useable Y value in A
if( X*Y/(X+Y) < minX*minY/(minX+minY) )
then
minX = X;
minY = Y;
end
end
end
I'm looking for a slightly more elegant approach than this brute force method. Suggestions?
For a possibly 'more elegant' solution see Solution 2.
Solution 1)
Why don't you create all the possible 300*300/2 or (300*299/2) possible exact values of R, sort them into an array B say, and then given an R, find the closest value to R in B using binary search and then pick the corresponding X and Y.
I presume that having array B (with the X&Y info) won't be a big memory hog and can easily be hardcoded (using code to write code! :-)).
This will be reasonably fast: worst case ~ 17 comparisons.
Solution 2)
You can possibly also do the following (didn't try proving it, but seems correct):
Maintain an array of the 1/X values, sorted.
Now given an R, you try and find the closest sum to 1/R with two numbers in the array of 1/Xs.
For this you maintain two pointers to the 1/X array, one at the smallest and one at the largest, and keep incrementing one and decrementing the other to find the one closest to 1/R. (This is a classic interview question: Find if a sorted array has two numbers which sum to X)
This will be O(n) comparisons and additions in the worst case. This is also prone to precision issues. You could avoid some of the precision issues by maintaining a reverse sorted array of X's, though.
Two ideas come to my mind:
1) Since the set A is constant, some pre-processing can be helpful. Assuming the value span of A is not too large, you can create an array of size N = max(A). For each index i you can store the closest value in A to i. This way you can improve your algorithm by finding the closest value in constant time, instead of using a binary search.
2) I see that you omit X<=R, and this is correct. If you define that X<=Y, you can restrict the search range even further, since X>2R will yield no solutions either. So the range to be scanned is R<X<=2R, and this guarantees no symetric solutions, and that X<=Y.
When the size of the input is (roughly) constant, an O(n*log(n)) solution might run faster than a particular O(n) solution.
I would start with the solution that you understand the best, and optimize from there if needed.

Resources