Non-trivial usage of count-min sketch data-structure - algorithm

I have a large array with increasing values - like this:
array = [0, 1, 6, 6, 12, 13, 22, ..., 92939, 92940]
and I want to use interpolation-search algorithm on it. Size of the array is variable, new elements are added to the end of the array.
I need to find index of some element, let's call it X.
Y = find(X in array)
Y must be an index of the element from the array such that array[Y] >= X
find can be implemented with binary search but for some complicated reasons I want to implemented it using interpolation search. Interpolation search tries to guess correct position of the X by looking at the bounds of the array. If first array value is 0 and last is 100 and I want to find position of the value 25 than, if array length is 1000, I need to look at value at index 250 first. This works as a charm if values of the array is evenly distributed. But if they not evenly distributed, interpolation search can work slower than binary search (there is some optimizations possible).
I'm trying to speed up search in such cases using Count-Min Sketch data structure. When I appending new element to the array I just add some data to count-min sketch data-structure.
Z = 1005000
elements_before_Z = len(array)
array.append(Z)
count_min_sketch.add(Z, elements_before_Z)
# Z is the key and elenents_before_Z - is count
using this approach I can guess position of the searched element X approximately. This can result in search speedup if guess is correct, but I've ran into some problems.
I don't know if X is in array and my count_min_sketch have seen this value. If this is the case I can get correct value out of count_min_sketch datastructure. If it's not - I will get 0 or some other value (worst case scenario).
Collisions. If value X have been seen by my count_min_sketch object - than I get back correct value or larger value. If count min sketch used for something like counting word occurrence in document - this is not a problem because collisions is rare and error is less or equal than number of collisions (it usually used like this: count_min_sketch.add(Z, 1)). In my case, every collision can result in large error, because I usually add large numbers for every key.
Is it possible to use count-min sketch in such way (adding large number of entries every time)?

Related

Searching in vector of pairs

I have a vector of pairs (datatype=double), where each pair is (a,b) and a less than b.For a number x, I want to find out number of pair in vector, where a<=x<=b.
Consider the vector size about 10^6.
My Approach
Sort the vector pair and perform a lower_bound operation for x over "a" in pair then iterate from start till my lower bound value and check for values of "b" which satisfies condition of x<=b.
Time Complexity
N(LogN) where N is vector size.
Issue
I have to perform this over large queries where this approach becomes inefficient.So is there any better solution to decrease the time complexity.
Sorry for my poor English and question formatting.
In addition to the previous answer, here's a suggestion how to prepare the ranges to optimize the subsequent lookup. The idea boils down to precomputing the result for all significantly different input values, but being smart about when values don't differ significantly.
To illustrate what I mean, let's consider this sequence of ranges:
1, 3
1, 8
2, 4
2, 6
The prepared output structure then looks like this:
1, 2 -> 2
2, 3 -> 4
3, 4 -> 3
4, 6 -> 2
6, 8 -> 1
For any number in the range 1, 2, there are two matching ranges in the initial sequence. For any number in the range 2, 3, there are four matches, etc. Note that there are five ranges here now, because some of the input ranges partially overlapped. Since for every range here the end value is also the start value of the next range, the end value can be optimized out. The result then looks like a simple map:
1 -> 2
2 -> 4
3 -> 3
4 -> 2
6 -> 1
8 -> 0
Note here that the last range didn't have one following, so the explicit zero becomes necessary. For the values before the first, that is implied. In order to find the result for a value, just find the key that is less than or equal to that value. This is a simple O(log n) lookup.
Firstly, if you just did a simple scan over the pairs, you would have O(n) complexity! The O(n log n) comes from sorting and for a one-off operation this is just overhead. This might even be the best way to do it, if you don't reuse the results and even if you just perform a few queries, it might still be better than sorting. Make sure you allow yourself to switch out the algorithm.
Anyhow, let's consider that you need to make many queries. Then, one relatively obvious step to improve things is to not iterate step-by-step after sorting. Instead, you can do a binary search for the lower bound. Simply partition the sequence into halves. The lower bound can be found in either half, which you can determine by looking at the middle element between the partitions. Recurse until you found the first element that can not possibly contain the value you search, because its start value is already greater.
Concerning the other direction, things are not that easy. Just because you sorted the ranges by the start value doesn't imply that the end values are sorted, too. Also, ranges that match and ranges that don't can be mixed in the sequence, so here you will have to perform a linear scan.
Lastly, some notes:
You could parallelize this algorithm using multithreading.
Depending on your number of searches M in your outer loop, you could also switch the outer loop with the inner one. That means that for every pair of the input vector, you check each of the M search values whether they fall within the range. This might be better, in particular when the M searches fit into the CPU cache.
This is a very typical style problem in for segment trees, binary indexed trees, interval trees.
There are two operations that you have to carry out on an array arr.
You have two operations on an array arr:
1. Range update: Add(a, b): for(int i = a; i <= b; ++i) arr[i]++
2. Point query : Query(x): return arr[x]
Alternately, you could formulate your problem slightly cleverly.
1. Point Update: Add(a, b): arr[a]++; arr[b+1]--;
2. Range Query: Query(x): return sum(arr[0], arr[1] ..... arr[x]);
In each of the cases above, you have one O(n) operation and one O(1) operation.
For the second case, the query is essentially a prefix sum calculation. Binary Indexed Trees are especially efficient at this task.
Tutorial for Binary Indexed Trees
IMPORTANT IDEA: ARRAY COMPRESSION
You did mention that the vector size is about 10^6, so there is a chance that you may not be able to create an array that big. If you are able to create a set that consists of all the as and bs and xs beforehand, then you can translate them into numbers from 1 to size of set.
SUPER CLEVER IDEA: MO's ALGORITHM
This is only allowed if you are allowed to solve the problem offline. What that means is that you can take all the query points x as input, solve them in any order as you like and store the solution, and then print the solution in the correct order.
Please mention if this is your situation, and only then will I elaborate further on this. But Binary Indexed Trees are going to be more efficient than Mo's algorithm.
EDIT:
Because your interval values are of type double, you must convert them to integers before you use my solution. Let me give an example,
Intervals = (1.1 to 1.9), (1.4 to 2.1)
Query Points = 1.5, 2.0
Here all the points that are of interest are not all the possible doubles, but just the above numbers = {1.1, 1.4, 1.5, 1.9, 2.0, 2.1}
If we map them into positive integers:
1.1 --> 1
1.4 --> 2
1.5 --> 3
1.9 --> 4
2.0 --> 5
2.1 --> 6
Then you could use segment trees/binary indexed trees.
For each pair a,b you can decompose so that a=+1 and b=-1 for the number of ranges valid for a particular value. Then in becomes a simple O(log n) lookup to see how many ranges encompass the search value.

Constant time binning of values

Say I have a vector of values that represent the upper boundaries of classes to classify (bin) values in. So e.g. vector { 1, 3, 5, 10 } represents bins [0, 1[, [1, 3[, [3, 5[ and [5,10[. How do I implement classification of a random value V in one of these classes (0,1,2,3) in constant time? It's trivial to walk the list of boundaries and stop once V surpasses the bin's upper boundary; but that's O(n) wrt the number of bins; I'm looking to do this in constant time.
I thought it was trivial before I was actually typing the code, by setting up a lookup table, dividing each V by a certain value depending on the class bounds and then using the (rounded) result of the division to find the bin number in the lookup table. But I'm finding it a lot harder than I thought to made this in a generic way that minimizes the size of the lookup table while still being accurate, regardless of the proportional distance between bin boundaries; and in a way that works for all real values. With Google'ing I only find algorithms that determine the boundaries of the bins, at least using the terms I did.
I doubt there's a way to do this in strictly constant time (and not requiring infinite space) without taking advantage of some property of the given numbers.
A lookup table is a decent idea, but floating point values makes this difficult. If the number of digits is finite, you can consider is having the lookup table represented as essentially a trie (a tree where each level represents a digit).
So for {1, 2.5, 5, 9}, your tree would look something like this:
root
/ / / / | \ \ \ \ \
0 1 2 3 4 5 6 7 8 9
/ | \
2.0 ... 2.5 ... 2.9
Each leaf node would contain a value indicating which interval it belongs to, so
0 will be set to 0,
1, 2.0 - 2.4 will all be set to 1,
2.5 - 2.9, 3 - 4 will be set to 2,
5 - 9 will be set to 3
A query would just involve starting from the root and repeatedly going to the child node corresponding to the next digit in the number we're looking up (if you look up 2.65 in the above tree, you first go to 2, then 2.6, then, since it's a leaf, you stop and return it's value, which is 1).
The time complexity for a query would be O(d), where d is the number of significant digits in your vector, and the space complexity is O(nd).
That might not sound particularly efficient, but keep in mind that d is the number of digits - for example, that would be d = log m with m being the maximum possible value if we're talking about positive integers.
O(log n) is fairly trivial if you just set up a binary search tree (BST) containing all the values in the vector mapped to their original indices.
A lookup would look very similar to how you'd search a BST - start from the root and go either left or right until you find the value, except in this case you note every node you visit and return the mapped index of the closest value that's not bigger. Some API's have methods that basically do this for you (such as std::map in C++).
I think the only way to get O(1) is to create a lookup table so that you can look up all the values directly.
This is only feasable if the boundaries are behaving nicely:
The expected numbers are integers or the boundaries are integers or have limited precision. This allows you to round down (floor) the number before checking against the lookup table and drastically reduces the required entries for the table.
The difference between the max and min boundary cannot be too big. Let's say we know that the precision of the boundaries is 0.5 and the min is 1 and the max is 10, then the lookup table requires (10-1)/0.5 = 18 entries.
The checks for the first and last group (smaller than min and greater than max) is done with simple if checks which doesn't affect the complexity.

Looking for a limited shuffle algorithm

I have a shuffling problem. There is lots of pages and discussions about shuffling a array of values completely, like a stack of cards.
What I need is a shuffle that will uniformly displace the array elements at most N places away from its starting position.
That is If N is 2 then element I will be shuffled at most to a position from I-2 to I+2 (within the bounds of the array).
This has proven to be tricky with some simple solutions resulting in a directional bias to the element movement, or by a non-uniform amount.
You're right, this is tricky! First, we need to establish some more rules, to ensure we don't create artificially non-random results:
Elements can be left in the position they started in. This is a necessary part of any fair shuffle, and also ensures our shuffle will work for N=0.
When N is larger than an element's distance from the start or end of the array, it's allowed to be moved to the other side. We could tweak the algorithm to forbid this, but it would violate the "uniformly" requirement - elements near either end would be more likely to stay put than elements near the middle.
Now we can actually solve the problem.
Generate an array of random value in the range i + [-N, N] where i is the current index in the array. Normalize values outside the array bounds (e.g. -1 should become length-1 and length should become 0).
Look for pairs of duplicate values (collisions) in the array, and recompute them. You have a few options:
Recompute both values until they don't collide with each other, they could both still collide with other values.
Recompute just one until it doesn't collide with the other, the first value could still collide, but the second should now be unique, which might mean fewer calls to the RNG.
Identify the set of available indices for each collision (e.g. in [3, 1, 1, 0] index 2 is available), pick a random value from that set, and set one of the array values to selected result. This avoids needing to loop until the collision is resolved, but is more complex to code and risks running into a case where the set is empty.
However you address individual collisions, repeat the process until every value in the array is unique.
Now move each element in the original array to the index specified in the array we generated.
I'm not sure how to best implement #2, I'd suggest you benchmark it. If you don't want to take the time to benchmark, I'd go with the first option. The others are optimizations that might be faster, but might actually end up being slower.
This solution has an unbounded runtime in theory, but should terminate reasonably quickly in practice. Again, benchmark and test it before using it anywhere critical.
One possible solution I have come up with though how 'naive' it is I am not certain. Especially at edges, the far edge especially.
create a array of flags (boolean) N long (representing elements that have been swapped)
For At each index check if it has already been swapped (according first element in flags array) if so, move on to next (see below)
rotate the flags array, deleting the first element (representing this
element), and add a new 'not swapped' element to end. ASIDE: This
maybe done using a modulus array lookup, to avoid having to actually
move array contents, especially for large N
Loop...
pick a number from 0 to N (or less than N, if N plus current
index is larger that array being shuffled.
If 0, element swaps with itself, move to next.
Otherwise if that element marked as swapped, Loop and try again.
Note there is always 2 elements in flags array that can be picks, itself
and the last element (unless close to end of array being shuffled)
Swap current element with selected unswapped element, mark the selected element as swapped in the flags array. Loop to next element

Search in ordered list

Assume we have a list 0, 10, 30, 45, 60, 70 sorted in ascending order. Given a number X how to find the number in the list immediately below it?
I am looking for the most efficient (faster) algorithm to do this, without of course having to iterate through the whole list.
Ex: [0, 10, 30, 45, 60, 70]
Given the number 34, I want to return 30.
Given the number 30, I want to return 30.
Given the number 29, I want to return 10.
And so forth.
If your list is indeed that small, most efficient way would be to
create an array of size 71, initialize it once with arr[i] = answer, and in constant query time - just get the answer. The idea is since your possible set of queries is so limited, there is no reason not to pre-calculate it and get the result from the pre-calculated data.
If you cannot pre-process, and the array is that small - linear scan
will be most efficient for such a small array, the overhead of using
complex algorithm does not worth it for such small arrays. Any
overhead for more complex algorithms (like binary search) that add a
lot of instructions per iteration, is nullified for small arrays.
Note that log_2(6) < 3, and this is also the expected time
(assuming uniform distribution) to get the result in a linear search,
but linear search is so much simpler, each iteration is much faster
than in binary search.
Pseudo code:
prev = -infinity
for (x in arr):
if x>arr:
return prev
prev = x
If the array is getting larger, use binary search. This
algorithm is designed to find a value (or the first value closest to
it) in a sorted array, and runs in O(logn) time, needing to
traverse significantly fewer elements than the entire list.
It will achieve much better results (in terms of time performance) compared to the naive linear scan, assuming uniform distribution of queries.
Is the list always sorted? Fast to get written or fast in execution time?
Look at this: http://epaperpress.com/sortsearch/download/sortsearch.pdf
Implement the Binary Search Algorithm where, in case the element is not found, you return the element in the last visited position (if it's smaller than or equal to the given number) or the element in the last visited position - 1 (in case the element in the last visited position is greater than the given number).

Find random numbers in a given range with certain possible numbers excluded

Suppose you are given a range and a few numbers in the range (exceptions). Now you need to generate a random number in the range except the given exceptions.
For example, if range = [1..5] and exceptions = {1, 3, 5} you should generate either 2 or 4 with equal probability.
What logic should I use to solve this problem?
If you have no constraints at all, i guess this is the easiest way: create an array containing the valid values, a[0]...a[m] . Return a[rand(0,...,m)].
If you don't want to create an auxiliary array, but you can count the number of exceptions e and of elements n in the original range, you can simply generate a random number r=rand(0 ... n-e), and then find the valid element with a counter that doesn't tick on exceptions, and stops when it's equal to r.
Depends on the specifics of the case. For your specific example, I'd return a 2 if a Uniform(0,1) was below 1/2, 4 otherwise. Similarly, if I saw a pattern such as "the exceptions are odd numbers", I'd generate values for half the range and double. In general, though, I'd generate numbers in the range, check if they're in the exception set, and reject and re-try if they were - a technique known as acceptance/rejection for obvious reasons. There are a variety of techniques to make the exception-list check efficient, depending on how big it is and what patterns it may have.
Let's assume, to keep things simple, that arrays are indexed starting at 1, and your range runs from 1 to k. Of course, you can always shift the result by a constant if this is not the case. We'll call the array of exceptions ex_array, and let's say we have c exceptions. These need to be sorted, which shall turn out to be pretty important in a while.
Now, you only have k-e useful numbers to work with, so it'll be meaningful to find a random number in the range 1 to k-e. Say we end up with the number r. Now, we just need to find the r-th valid number in your array. Simple? Not so much. Remember, you can never simply walk over any of your arrays in a linear fashion, because that can really slow down your implementation when you have a lot of numbers. You have do some sort of binary search, say, to come up with a fast enough algorithm.
So let's try something better. The r-th number would nominally have lied at index r in your original array had you had no exceptions. The number at index r is r, of course, since your range and your array indices start from 1. But, you have a bunch of invalid numbers between 1 and r, and you want to somehow get to the r-th valid number. So, lets do a binary search on the array of exceptions, ex_array, to find how many invalid numbers are equal to or less than r, because we have these many invalid numbers lying between 1 and r. If this number is 0, we're all done, but if it isn't, we have a bit more work to do.
Assume you found there were n invalid numbers between 1 and r after the binary search. Let's advance n indices in your array to the index r+n, and find the number of invalid numbers lying between 1 and r+n, using a binary search to find how many elements in ex_array are less than or equal to r+n. If this number is exactly n, no more invalid numbers were encountered, and you've hit upon your r-th valid number. Otherwise, repeat again, this time for the index r+n', where n' is the number of random numbers that lay between 1 and r+n.
Repeat till you get to a stage where no excess exceptions are found. The important thing here is that you never once have to walk over any of the arrays in a linear fashion. You should optimize the binary searches so they don't always start at index 0. Say if you know there are n random numbers between 1 and r. Instead of starting your next binary search from 1, you could start it from one index after the index corresponding to n in ex_array.
In the worst case, you'll be doing binary searches for each element in ex_array, which means you'll do c binary searches, the first starting from index 1, the next from index 2, and so on, which gives you a time complexity of O(log(n!)). Now, Stirling's approximation tells us that O(ln(x!)) = O(xln(x)), so using the algorithm above only makes sense if c is small enough that O(cln(c)) < O(k), since you can achieve O(k) complexity using the trivial method of extracting valid elements from your array first.
In Python the solution is very simple (given your example):
import random
rng = set(range(1, 6))
ex = {1, 3, 5}
random.choice(list(rng-ex))
To optimize the solution, one needs to know how long is the range and how many exceptions there are. If the number of exceptions is very low, it's possible to generate a number from the range and just check if it's not an exception. If the number of exceptions is dominant, it probably makes sense to gather the remaining numbers into an array and generate random index for fetching non-exception.
In this answer I assume that it is known how to get an integer random number from a range.
Here's another approach...just keep on generating random numbers until you get one that isn't excluded.
Suppose your desired range was [0,100) excluding 25,50, and 75.
Put the excluded values in a hashtable or bitarray for fast lookup.
int randNum = rand(0,100);
while( excludedValues.contains(randNum) )
{
randNum = rand(0,100);
}
The complexity analysis is more difficult, since potentially rand(0,100) could return 25, 50, or 75 every time. However that is quite unlikely (assuming a random number generator), even if half of the range is excluded.
In the above case, we re-generate a random value for only 3/100 of the original values.
So 3% of the time you regenerate once. Of those 3%, only 3% will need to be regenerated, etc.
Suppose the initial range is [1,n] and and exclusion set's size is x. First generate a map from [1, n-x] to the numbers [1,n] excluding the numbers in the exclusion set. This mapping with 1-1 since there are equal numbers on both sides. In the example given in the question the mapping with be as follows - {1->2,2->4}.
Another example suppose the list is [1,10] and the exclusion list is [2,5,8,9] then the mapping is {1->1, 2->3, 3->4, 4->6, 5->7, 6->10}. This map can be created in a worst case time complexity of O(nlogn).
Now generate a random number between [1, n-x] and map it to the corresponding number using the mapping. Map looks can be done in O(logn).
You can do it in a versatile way if you have enumerators or set operations. For example using Linq:
void Main()
{
var exceptions = new[] { 1,3,5 };
RandomSequence(1,5).Where(n=>!exceptions.Contains(n))
.Take(10)
.Select(Console.WriteLine);
}
static Random r = new Random();
IEnumerable<int> RandomSequence(int min, int max)
{
yield return r.Next(min, max+1);
}
I would like to acknowledge some comments that are now deleted:
It's possible that this program never ends (only theoretically) because there could be a sequence that never contains valid values. Fair point. I think this is something that could be explained to the interviewer, however I believe my example is good enough for the context.
The distribution is fair because each of the elements has the same chance of coming up.
The advantage of answering this way is that you show understanding of modern "functional-style" programming, which may be interesting to the interviewer.
The other answers are also correct. This is a different take on the problem.

Resources