A special sample method in Map-Reduce implementation - algorithm

I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.
But my way to get the sample is somehow special:
I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
repeat step 1 4*10^6 times(no matter if one record be selected multiple times).
I think up a method to solve this:
Generate a table A(num int), and there only one number in every record of table A which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above).
Load table A as resource file to every map, and if the ordinal number of the record which is on decision now is in table A, output this record, otherwise discard it.
I think my method is not so good because if I want to sample more record from the original table, the table A will became very large and can't be loaded as resource file.
So, could any one please give an elegant algorithm?

I'm not sure what "elegant" means, but perhaps you're interested in something analogous to reservoir sampling. Let k be the size of the sample and initialize a k-element array with nulls. The elements from which we are sampling arrive one by one. When the jth (counting from 1) element arrives, we iterate through the array and, for each cell, replace its contents by the current element independently with probability 1/j.
Naively, the running time is pretty bad -- to sample k elements from n with replacement costs O(k n). The number of writes into the array, however, is O(k log n) in expectation, because later elements in the stream rarely result in writes. Here's an efficient method based on the exponential distribution (warning: lightly tested Python ahead). The running time is O(n + k log n).
import math
import random
def sample_from(population, k):
for i, x in enumerate(population):
if i == 0:
sample = [x] * k
else:
t = float(k) * math.log(1.0 - 1.0 / float(i + 1))
while True:
t -= math.log(1.0 - random.random())
if t >= 0.0:
break
sample[random.randrange(k)] = x
return sample

Related

fastest algorithm for sum queries in a range

Assume we have the following data, which consists of a consecutive 0's and 1's (the nature of data is that there are very very very few 1s.
data =
[0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0]
so a huge number of zeros, and then possibly some ones (which indicate that some sort of an event is happening).
You want to query this data many times. The query is that given two indices i and j what is sum(data[i:j]). For example, sum_query(i=12, j=25) = 2 in above example.
Note that you have all these queries in advance.
What sort of a data structure can help me evaluate all the queries as fast as possible?
My initial thoughts:
preprocess the data and obtain two shorter arrays: data_change and data_cumsum. The data_change will be filled up with the indices for when the sequence of 1s will start and when the next sequence of 0s will start, and so on. The data_cumsum will contain the corresponding cummulative sums up to indices represented in data_change, i.e. data_cumsum[k] = sum(data[0:data_change[k]])
In above example, the preprocessing results in: data_change=[8,11,18,20,31,35] and data_cumsum=[0,3,3,5,5,9]
Then if query comes for i=12 and j=25, I will do a binary search in this sorted data_change array to find the corresponding index for 12 and then for 25, which will result in the 0-based indices: bin_search(data_change, 12)=2 and bin_search(data_change, 25)=4.
Then I simply output the corresponding difference from the cumsum array: data_cumsum[4] - data_cumsum[2]. (I won't go into the detail of handling the situation where the any endpoint of the query range falls in the middle of the sequence of 1's, but those cases can be handled easily with an if-statement.
With linear space, linear preprocessing, constant query time, you can store an array of sums. The i'th position gets the sum of the first i elements. To get query(i,j) you take the difference of the sums (sums[j] - sums[i-1]).
I already gave an O(1) time, O(n) space answer. Here are some alternates that trade time for space.
1. Assuming that the number of 1s is O(log n) or better (say O(log n) for argument):
Store an array of ints representing the positions of the ones in the original array. so if the input is [1,0,0,0,1,0,1,1] then A = [0,4,6,7].
Given a query, use binary search on A for the start and end of the query in O(log(|A|)) = O(log(log(n)). If the element you're looking for isn't in A, find the smallest bigger index and the largest smaller index. E.g., for query (2,6) you'd return the indices for the 4 and the 6, which are (1,2). Then the answer is one more than the difference.
2. Take advantage of knowing all the queries up front (as mentioned by the OP in a comment to my other answer). Say Q = (Q1, Q2, ..., Qm) is the set of queries.
Process the queries, storing a map of start and end indices to the query. E.g., if Q1 = (12,92) then our map would include {92 => Q1, 12 => Q1}. This take O(m) time and O(m) space. Take note of the smallest start index and the largest end index.
Process the input data, starting with the smallest start index. Keep track of the running sum. For each index, check your map of queries. If the index is in the map, associate the current running sum with the appropriate query.
At the end, each query will have two sums associated with it. Add one to the difference to get the answer.
Worst case analysis:
O(n) + O(m) time, O(m) space. However, this is across all queries. The amortized time cost per query is O(n/m). This is the same as my constant time solution (which required O(n) preprocessing).
I would probably go with something like this:
# boilerplate testdata
from itertools import chain, permutations
data = [0,0,0,0,0,0,0,1,1,1]
chained = list(chain(*permutations(data,5))) # increase 5 to 10 if you dare
Preprozessing:
frSet = frozenset([i for i in range(len(chained)) if chained[i]==1])
"Counting":
# O(min(len(frSet), len(frozenset(range(200,500))))
summa = frSet.intersection(frozenset(range(200,500))) # use two sets for faster intersect
counted=len(summa)
"Sanity-Check"
print(sum([1 for x in frSet if x >= 200 and x<500]))
print(summa)
print(len(summa))
No edge cases needed, intersection will do all you need, slightly higher memory as you store each index not ranges of ones. Performance depends on intersection-Implementation.
This might be helpfull: https://wiki.python.org/moin/TimeComplexity#set

Hashing function to distribute over n values (with a twist)

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.
Some properties of this technique:
You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0.
When you add node 2, it steals 1/3 of the items from the previous 2 nodes.
And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.

How do I get an unbiased random sample from a really huge data set?

For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 trillion (and growing).
Usually I use the technique of seeing if a uniform random number r (0..1) is less than S/T, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.
However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at the next position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)
I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.
To be honest, approximately unbiased would be OK.
This is related (more or less a follow-on) to this persons question:
https://math.stackexchange.com/questions/350041/simple-random-sample-without-replacement
One more side question... the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?
How about this:
precompute S random numbers from 0 to the size of your dataset.
order your numbers, low to high
store the difference between consecutive numbers as the skip size
iterate though the large dataset using the skip size above.
...The assumption being the order you collect the samples doesn't matter
So I thought about it, and got some help from http://math.stackexchange.com
It boils down to this:
If I picked n items randomly all at once, where would the first one land? That is, min({r_1 ... r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:
x = 1 - (1 - r) ** (1 / n)
that is, the distribution would be 1 minus (1 - r) to the nth power. Then solve for x. Pretty easy.
If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 ... r_n}) -- the same way that the lowest item would fall. Voila! I've just simulated picking the first item as if I had randomly selected all n.
So I skip over that many items in the list, pick that one, and then....
Repeat until n is 0
That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.
The only problem I'm having is that my implementation favors the first and last element in the list. But I can live with that.
In Python 2.7, my implementation looks like:
def skip(n):
"""
Produce a random number with the same distribution as
min({r_0, ... r_n}) to see where the next smallest one is
"""
r = numpy.random.uniform()
return 1.0 - (1.0 - r) ** (1.0 / n)
def sample(T, n):
"""
Take n items from a list of size T
"""
t = T
i = 0
while t > 0 and n > 0:
s = skip(n) * (t - n + 1)
i += s
yield int(i) % T
i += 1
t -= s + 1
n -= 1
if __name__ == '__main__':
t = [0] * 100
for c in xrange(10000):
for i in sample(len(t), 10):
t[i] += 1 # this is where we would read value i
pprint.pprint(t)

Limit for quadratic probing a hash table

I was doing a program to compare the average and maximum accesses required for linear probing, quadratic probing and separate chaining in hash table.
I had done the element insertion part for 3 cases. While finding the element from hash table, I need to have a limit for ending the searching.
In the case of separate chaining, I can stop when next pointer is null.
For linear probing, I can stop when probed the whole table (ie size of table).
What should I use as limit in quadratic probing? Will table size do?
My quadratic probing function is like this
newKey = (key + i*i) % size;
where i varies from 0 to infinity. Please help me..
For such problems analyse the growth of i in two pieces:
First Interval : i goes from 0 to size-1
In this case, I haven't got the solution for now. Hopefully will update.
Second Interval : i goes from size to infinity
In this case i can be expressed as i = size + k, then
newKey = (key + i*i) % size
= (key + (size+k)*(size+k)) % size
= (key + size*size + 2*k*size + k*k) % size
= (key + k*k) % size
So it's sure that we will start probing previously probed cells, after i reaches to size. So you only need to consider the situation where i goes from 0 to size-1. Because rest is only the same story again and again.
What the story tells up to now: A simple analysis showed me that I need to probe at most size times because beyond size times I started probing the same cells.
See this link. If your table size is power of 2 and you are using a reprobe function f(i)=i*(i+1)/2, you are guaranteed to traverse the entire table. If your table size is a prime number, you are guaranteed to traverse at least half of the table. In general, you can check if at some point you are back to the original point. If that happens, you need to rehash.
After doing some simulations in Excel, it appears that iterating up to i = size / 2 would be all that needs to be tested. This is when using the standard method of adding sequential perfect squares to the single-hashed position.
The answer that you can quit if a position is revisited would not allow testing of all possible positions that could be reached by the quadratic-probe method, at least not for all array sizes. (I tested array size 21 and found that i=5 revisits the same position as i=2, but i=6 yields a previously not-calculated position.)

How to design a data structure that allows one to search, insert and delete an integer X in O(1) time

Here is an exercise (3-15) in the book "Algorithm Design Manual".
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time (i.e. , constant time, independent of the total number of integers stored). Assume that 1 ≤ X ≤ n and that there are m + n units of space available, where m is the maximum number of integers that can be in the table at any one time. (Hint: use two arrays A[1..n] and B[1..m].) You are not allowed to initialize either A or B, as that would take O(m) or O(n) operations. This means the arrays are full of random garbage to begin with, so you must be very careful.
I am not really seeking for the answer, because I don't even understand what this exercise asks.
From the first sentence:
Design a data structure that allows one to search, insert, and delete an integer X in O(1) time
I can easily design a data structure like that. For example:
Because 1 <= X <= n, so I just have an bit vector of n slots, and let X be the index of the array, when insert, e.g., 5, then a[5] = 1; when delete, e.g., 5, then a[5] = 0; when search, e.g.,5, then I can simply return a[5], right?
I know this exercise is harder than I imagine, but what's the key point of this question?
You are basically implementing a multiset with bounded size, both in number of elements (#elements <= m), and valid range for elements (1 <= elementValue <= n).
Search: myCollection.search(x) --> return True if x inside, else False
Insert: myCollection.insert(x) --> add exactly one x to collection
Delete: myCollection.delete(x) --> remove exactly one x from collection
Consider what happens if you try to store 5 twice, e.g.
myCollection.insert(5)
myCollection.insert(5)
That is why you cannot use a bit vector. But it says "units" of space, so the elaboration of your method would be to keep a tally of each element. For example you might have [_,_,_,_,1,_,...] then [_,_,_,_,2,_,...].
Why doesn't this work however? It seems to work just fine for example if you insert 5 then delete 5... but what happens if you do .search(5) on an uninitialized array? You are specifically told you cannot initialize it, so you have no way to tell if the value you'll find in that piece of memory e.g. 24753 actually means "there are 24753 instances of 5" or if it's garbage.
NOTE: You must allow yourself O(1) initialization space, or the problem cannot be solved. (Otherwise a .search() would not be able to distinguish the random garbage in your memory from actual data, because you could always come up with random garbage which looked like actual data.) For example you might consider having a boolean which means "I have begun using my memory" which you initialize to False, and set to True the moment you start writing to your m words of memory.
If you'd like a full solution, you can hover over the grey block to reveal the one I came up with. It's only a few lines of code, but the proofs are a bit longer:
SPOILER: FULL SOLUTION
Setup:
Use N words as a dispatch table: locationOfCounts[i] is an array of size N, with values in the range location=[0,M]. This is the location where the count of i would be stored, but we can only trust this value if we can prove it is not garbage. >!
(sidenote: This is equivalent to an array of pointers, but an array of pointers exposes you being able to look up garbage, so you'd have to code that implementation with pointer-range checks.)
To find out how many is there are in the collection, you can look up the value counts[loc] from above. We use M words as the counts themselves: counts is an array of size N, with two values per element. The first value is the number this represents, and the second value is the count of that number (in the range [1,m]). For example a value of (5,2) would mean that there are 2 instances of the number 5 stored in the collection.
(M words is enough space for all the counts. Proof: We know there can never be more than M elements, therefore the worst-case is we have M counts of value=1. QED)
(We also choose to only keep track of counts >= 1, otherwise we would not have enough memory.)
Use a number called numberOfCountsStored that IS initialized to 0 but is updated whenever the number of item types changes. For example, this number would be 0 for {}, 1 for {5:[1 times]}, 1 for {5:[2 times]}, and 2 for {5:[2 times],6:[4 times]}.
                          1  2  3  4  5  6  7  8...
locationOfCounts[<N]: [☠, ☠, ☠, ☠, ☠, 0, 1, ☠, ...]
counts[<M]:           [(5,⨯2), (6,⨯4), ☠, ☠, ☠, ☠, ☠, ☠, ☠, ☠..., ☠]
numberOfCountsStored:          2
Below we flush out the details of each operation and prove why it's correct:
Algorithm:
There are two main ideas: 1) we can never allow ourselves to read memory without verifying that is not garbage first, or if we do we must be able to prove that it was garbage, 2) we need to be able to prove in O(1) time that the piece of counter memory has been initialized, with only O(1) space. To go about this, the O(1) space we use is numberOfItemsStored. Each time we do an operation, we will go back to this number to prove that everything was correct (e.g. see ★ below). The representation invariant is that we will always store counts in counts going from left-to-right, so numberOfItemsStored will always be the maximum index of the array that is valid.
.search(e) -- Check locationsOfCounts[e]. We assume for now that the value is properly initialized and can be trusted. We proceed to check counts[loc], but first we check if counts[loc] has been initialized: it's initialized if 0<=loc<numberOfCountsStored (if not, the data is nonsensical so we return False). After checking that, we look up counts[loc] which gives us a number,count pair. If number!=e, we got here by following randomized garbage (nonsensical), so we return False (again as above)... but if indeed number==e, this proves that the count is correct (★proof: numberOfCountsStored is a witness that this particular counts[loc] is valid, and counts[loc].number is a witness that locationOfCounts[number] is valid, and thus our original lookup was not garbage.), so we would return True.
.insert(e) -- Perform the steps in .search(e). If it already exists, we only need to increment the count by 1. However if it doesn't exist, we must tack on a new entry to the right of the counts subarray. First we increment numberOfCountsStored to reflect the fact that this new count is valid: loc = numberOfCountsStored++. Then we tack on the new entry: counts[loc] = (e,⨯1). Finally we add a reference back to it in our dispatch table so we can look it up quickly locationOfCounts[e] = loc.
.delete(e) -- Perform the steps in .search(e). If it doesn't exist, throw an error. If the count is >= 2, all we need to do is decrement the count by 1. Otherwise the count is 1, and the trick here to ensure the whole numberOfCountsStored-counts[...] invariant (i.e. everything remains stored on the left part of counts) is to perform swaps. If deletion would get rid of the last element, we will have lost a counts pair, leaving a hole in our array: [countPair0, countPair1, _hole_, countPair2, countPair{numberOfItemsStored-1}, ☠, ☠, ☠..., ☠]. We swap this hole with the last countPair, decrement numberOfCountsStored to invalidate the hole, and update locationOfCounts[the_count_record_we_swapped.number] so it now points to the new location of the count record.
Here is an idea:
treat the array B[1..m] as a stack, and make a pointer p to point to the top of the stack (let p = 0 to indicate that no elements have been inserted into the data structure). Now, to insert an integer X, use the following procedure:
p++;
A[X] = p;
B[p] = X;
Searching should be pretty easy to see here (let X' be the integer you want to search for, then just check that 1 <= A[X'] <= p, and that B[A[X']] == X'). Deleting is trickier, but still constant time. The idea is to search for the element to confirm that it is there, then move something into its spot in B (a good choice is B[p]). Then update A to reflect the pointer value of the replacement element and pop off the top of the stack (e.g. set B[p] = -1 and decrement p).
It's easier to understand the question once you know the answer: an integer is in the set if A[X]<total_integers_stored && B[A[X]]==X.
The question is really asking if you can figure out how to create a data structure that is usable with a minimum of initialization.
I first saw the idea in Cameron's answer in Jon Bentley Programming Pearls.
The idea is pretty simple but it's not straightforward to see why the initial random values that may be on the uninitialized arrays does not matter. This link explains pretty well the insertion and search operations. Deletion is left as an exercise, but is answered by one of the commenters:
remove-member(i):
if not is-member(i): return
j = dense[n-1];
dense[sparse[i]] = j;
sparse[j] = sparse[i];
n = n - 1

Resources