finding closest hamming distance - algorithm

I have N < 2^n randomly generated n-bit numbers stored in a file the lookup for which is expensive. Given a number Y, I have to search for a number in the file that is at most k hamming dist. from Y. Now this calls for a C(n 1) + C(n 2) + C(n 3)...+C(n,k) worst case lookups which is not feasible in my case. I tried storing the distribution of 1's and 0's at each bit position in memory and prioritized my lookups. So, I stored probability of bit i being 0/1:
Pr(bi=0), Pr(bi=1) for all i from 0 to n-1.
But it didn't help much since N is too large and have almost equal distribution of 1/0 in every bit location. Is there a way this thing can be done more efficiently. For now, you can assume n=32, N = 2^24.

Google gives a solution to this problem for k=3, n=64, N=2^34 (much larger corpus, fewer bit flips, larger fingerprints) in this paper. The basic idea is that for small k, n/k is quite large, and hence you expect that nearby fingerprints should have relatively long common prefixes if you formed a few tables with permuted bits orders. I am not sure it will work for you, however, since your n/k is quite a bit smaller.

If by "lookup", you mean searching your entire file for a specified number, and then repeating the "lookup" for each possible match, then it should be faster to just read through the whole file once, checking each entry for the hamming distance to the specified number as you go. That way you only read through the file once instead of C(n 1) + C(n 2) + C(n 3)...+C(n,k) times.

You can use quantum computation for speeding up your search process and at the same time minimizing the required number of steps. I think Grover's search algorithm will be help full to you as it provides quadratic speed up to the search problem.....

Perhaps you could store it as a graph, with links to the next closest numbers in the set, by hamming distance, then all you need to do is follow one of the links to another number to find the next closest one. Then use an index to keep track of where the numbers are by file offset, so you don't have to search the graph for Y when you need to find its nearby neighbors.
You also say you have 2^24 numbers, which according to wolfram alpha (http://www.wolframalpha.com/input/?i=2^24+*+32+bits) is only 64MB. Could you just put it all in ram to make the accesses faster? Perhaps that would happen automatically with caching on your machine?

If your application can afford to do some extensive preprocessing, you could, as you're generating the n-bit numbers, compute all the other numbers which are at most k distant from that number and store it in a lookup table. It'd be something like a Map >. riri claims you can fit it in memory, so hash tables might work well, but otherwise, you'd probably need a B+ tree for the Map. Of course, this is expensive as you mentioned before, but if you can do it beforehand, you'd have fast lookups later, either O(1) or O(log(N) + log(2^k)).

Related

Is the linear formation the best sorting production?

Considering usually a sorting method products linearly sorted productions (such as "1,7,8,13,109..."), which consumes O(N) to inquiry.
Why not sort in non-linear order, consuming O(logN) or something to find element(s) by iteration or Newton method etc.? Is it expensive to make such a high-order sorted structure?
Concisely, is it a possible idea to sort results which allowed to be accessed by finding roots for ax^2 + bx + c = 0? (for contrast, usually it's finding root for ax + c = 0.) For example, we have x1 = 1, x2 = 2 as roots of a quadratic equation and just insert following xi(s). Then it is possible to use smarter ways to inquiry.
I suppose difficulty can be encountered by these aspects:
prediction of data can be rather hard. thus we cannot construct a general formula to describe well the following numbers (may be hash values).
due to the first difficulty, numbers out of certain range can be divergent. example graphed by Google:the graph. the values derived out of [-1,3] are really large, as well as rapid increment in difficulty executing the original formula.
that is actually equivalent to hash, which creates a table that contains the values. and the production rule is a formula.
the execution of a "smarter" inquiry may be expensive because of the complexity of algorithm itself.
Smarter schemes which take advantage of a known statistical distribution are typically faster by some constant. However, that still keeps them at O(log N), which is the same as a trivial binary search. The reason is that in each step, they typically narrow down the range of elements to search by a factor R > 2 , for simple binary search that's just R=2. But you need log(N)/log(R) steps to narrow it down to exactly one element.
Now whether this is a net win depends on log(R) versus the work needed at each step. A simple comparison (for binary search) takes a few cycles. As soon as you need anything more complex than +-*/ (say exp or log) to predict the location of the next element, the profit of needing less steps is gone.
So, in summary: binary search is used because each step is efficient, for many real-world distributions.

Constant time search

Suppose I have a rod which I cut to pieces. Given a point on the original rod, is there a way to find out which piece it belongs to, in constant time?
For example:
|------------------|---------|---------------|
0.0 4.5 7.8532 9.123
Given a position:
^
|
8.005
I would like to get 3rd piece.
It is possible to easily get such answer in O(log n) time with binary search but is it possible to do it in O(1)? If I pre-process the "cut" positions somehow?
If you assume the point you want to query is uniformly randomly chosen along the rod, then you can have EXPECTED constant time solution, without crazy memory explosion, as follows. If you break up the rod into N equally spaced pieces, where N is the number of original irregularly spaced segments you have in your rod, and then record for each of the N equal-sized pieces which of the original irregular segment(s) it overlaps, then to do a query you first just take the query point and do simple round-off to find out which equally spaced piece it lies in, then use that index to look up which of your original segments intersect the equally spaced piece, and then check each intersecting original segment to see if the segment contains your point (and you can use binary search if you want to make sure the worst-case performance is still logarithmic). The expected running time for this approach is constant if you assume that the query point is randomly chosen along your rod, and the amount of memory is O(N) if your rod was originally cut into N irregular pieces, so no crazy memory requirements.
PROOF OF EXPECTED O(1) RUNNING TIME:
When you count the total number of intersection pairs between your original N irregular segments and the N equally-spaced pieces I propose constructing, the total number is no more than 2*(N+1) (because if you sort all the end-points of all the regular and irregular segments, a new intersection pair can always be charged to one of the end-points defining either a regular or irregular segment). So you have a multi-set of at most 2(N+1) of your irregular segments, distributed out in some fashion among the N regular segments that they intersect. The actual distribution of intersections among the regular segments doesn't matter. When you have a uniform query point and compute the expected number of irregular segments that intersect the regular segment that contains the query point, each regular segment has probability 1/N of being chosen by the query point, so the expected number of intersected irregular segments that need to be checked is 2*(N+1)/N = O(1).
For arbitrary cuts and precisions, not really, you have to compare the position with the various start or end points.
But, if you're only talking a small number of cuts, performance shouldn't really be an issue.
For example, even with ten segments, you only have nine comparisons, not a huge amount of computation.
Of course, you can always turn the situation into a ploynomial formula (such as ax^4 + bx^3 +cx^2 + dx + e), generated using simultaneous equations, which will give you a segment but the highest power tends to rise with the segment count so it's not necessarily as efficient as simple checks.
You're not going to do better than lg n with a comparison-based algorithm. Reinterpreting the 31 non-sign bits of a positive IEEE float as a 31-bit integer is an order-preserving transformation, so tries and van Emde Boas trees both are options. I would steer you first toward a three-level trie.
You could assign an integral number to every position and then use that as index into a lookup table, which would give you constant-time lookup. This is pretty easy if your stick is short and you don't cut it into pieces that are fractions of a millimeter long. If you can get by with such an approximation, that would be my way to go.
There is one enhanced way which generalizes this even further. In each element of a lookup table, you store the middle position and the segment ID to the left and right. This makes one lookup (O(1)) plus one comparison (O(1)). The downside is that the lookup table has to be so large that you never have more than two different segments in the same table element's range. Again, it depends on your requirements and input data whether this works or not.

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.
As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.
You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N documents, that is, to get better than linear time.
Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see #Philippe's answer.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Finding median of large set of numbers too big to fit into memory

I was asked this question in an interview recently.
There are N numbers, too many to fit into memory. They are split across k database tables (unsorted), each of which can fit into memory. Find the median of all the numbers.
Wasn't quite sure about the answer to this one.
There's a few potential solutions:
External merge sort - O(n log n)
You basically sort the numbers on the first pass, then find the median on the second.
Order statistics distributed selection algorithm - O(n)
Simplify the problem to the original problem of finding the kth number in an unsorted array.
Counting sort histogram O(n)
You have to assume some properties about the range of the numbers - can the range fit in the memory?
If anything is known about the distribution of the numbers other
algorithms can be produced.
For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html
This answer on quora explains the whole process clearly step by step http://qr.ae/dMkGc. Simply copying it down for non Quorans
Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers). The master first queries the servers for the size of their sets of data, call this n, so that it knows to look for the k = n/2 largest element.
The master then selects a random server and queries it for a random element from the elements on that server. The master broadcasts this element to each server, and each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.
Each server returns to the master the size of the larger-than partition, call this m. If the sum of these sizes is greater than k, the master indicates to each server to disregard the less-than set for the remainder of the algorithm. If it is less than k, then the master indicates to disregard the larger-than sets and updates k = k - m. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.
If the algorithm does not terminate, recurse beginning with selecting a new random pivot from the remaining elements.
Analysis:
Let n be the total number of elements and s be the number of servers. Assume that the elements are roughly randomly and evenly distributed among servers (each server has O(n/s) elements). In iteration i, we expect to do about O(n/(s*2^i)) work on each server, as the size of each servers element sets will be approximately cut in half (remember, we assumed roughly random distribution of elements) and O(s) work on the master (for broadcasting/receiving messages and adding the sizes together). We expect O(log(n/s)) iterations. Adding these up over all iterations gives an expected runtime of O(n/s + slog(n/s)), and assuming s << sqrt(n) which is normally the case, this becomes simply (O(n/s)), which is the best you could possibly hope for.
Note also that this works not just for finding the median but also for finding the kth largest value for any value of k.
Have a look at the "Median of Medians" algorithm in this Wikipedia article.
Related question: Median-of-medians in Java.
Explanation: http://www.ics.uci.edu/~eppstein/161/960130.html
Another way to look at this is to go back to the definition of "median." Authors vary in their language, but basically the median is the value which splits a probability distribution into two equal parts.
So instead of spending a lot of effort sorting enormous data sets, estimate the distribution and find the middle. As noted above for some distributions the median equals the mean, which is quick and easy to compute. Also, if an exact answer isn't necessary you can use the empirical relationship: mean - mode = 3 * (mean - median).
Here is what I would do:
Sample the data to get a general idea about the distribution.
Using the information about the distribution, choose a "bucket" (a range), large enough to get the median inside and small enough to fit into the memory.
With one pass (O(N)) count the numbers before the bucket (L1_size), after the bucket (L3_size) and put numbers within the range into the bucket (L2). You will see if the chosen bucket contains the median. If not - go to step 2.
Use quickselect or other method to find the k=(L1_size + L2_size/2) element in the bucket.
Requires O(N) + O(L2_size) steps.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.
If an approximate answer is sufficient, a method similar to #piccolbo works well. I'll assume all the points are integers, but if not you can multiply by ten or a hundred or whatever to normalize the data to integers. Make one pass over the data calculating an average (arithmetic mean. Call that number the provisional median. Then make a second pass over the data. If the data point is less than the provisional median, reduce the provisional median by one. If the data point is greater than the provisional median, increase the provisional median by one. If the data point is the same as the provisional median, leave the provisional median unchanged. After the end of the data, return the provisional median. What will happen is that the provisional median will initially change from time to time, but eventually it will stabilize over a very small range, which will be very close to the actual median.

Splitting a set of object into several subsets according to certain evaluation

Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.
However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.
The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.
Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:
for x in S:
i = such i that
size(f(Si + {x})) - size(f(Si))
is min
Si = Si + {x}
This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.
Is there any standard approach to such kinds of problems?
I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?
EDIT:
As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets Si that combined size of Di = f(Si) is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.
So, what I need is a (likely heuristical) algorithm that minimizes sum f(Si) given that there is no simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.
About the slowness I found that in similar problems a good-enough solution is to compute the match just by picking a fixed number of random candidates.
True that the result will not be the best one (often worse than the full "greedy" solution you implemented) but it in my experience not too bad and you can decide the speed... it can even be implemented in a prescribed amount of time (that is you keep searching until the allocated time expires).
Another option I use is to keep searching until I see no improvement for a while.
To get past the greedy logic you could keep a queue of N "x" elements and trying to pack them simultaneously in groups of "k" (with k < N).
In this case I found that is important to also keep the "age" of an element in the queue and to use it as a "prize" for the result to avoid keeping "bad" elements forever in the queue because others will always match better (this would make the queue search useless and the results would be basically the same as the greedy approach).

Resources