4 values in an array with sum = x - algorithm

I don't need the code for this. I was wondering if someone could show how to find 4 integers in an array of size n that could equal a value x in time complexity n^2logn. I tried looking for videos but found them hard to follow. From my understanding you create a helper array of all the possible pairs? Sort them and then find a sum of x from the smaller arrays?
Eg. Array = { 1,3,4,5,6,9,4}
Can someone very explain in steps how to check for a sum of say 8. Just need help visualizing this process.

# your input array 'df' and the sought sum 'x'
df = [1,3,4,5,6,9,4]
x = 12
def findSumOfFour(df, x):
# create the dictionary of all pairs key (as sum of a pair) value (pair of indices of 'df')
myDict = { (df[i]+df[j]):[i,j] for i in range(0,len(df)) for j in range(i,len(df)) if not i == j}
# verify if 'x' - a key 'k' is again a key in the dictionary 'myDict',
# if so we only need to verify that the indices differ
for k in myDict.keys():
if x-k in myDict.keys() and not set(myDict[k]) & set(myDict[x-k]):
return [df[myDict[k][0]],df[myDict[k][1]],df[myDict[x-k][0]],df[myDict[x-k][1]]]
return []
solution = findSumOfFour(df,x)
print(solution)
Note, this approach follows the description of גלעד-ברקן (comments) explained with more detail and on your example.
Note 2, the runtime is in O(n^2 log n) since the insert and lookup operations in a map/dictionary in general is O(log n).

Related

Approximation-tolerant map

I'm working with arrays of integer, all of the same size l.
I have a static set of them and I need to build a function to efficiently look them up.
The tricky part is that the elements in the array I need to search might be off by 1.
Given the arrays {A_1, A_2, ..., A_n}, and an array S, I need a function search such that:
search(S)=x iff ∀i: A_x[i] ∈ {S[i]-1, S[i], S[i]+1}.
A possible solution is treating each vector as a point in an l-dimensional space and looking for the closest point, but it'd cost something like O(l*n) in space and O(l*log(n)) in time.
Would there be a solution with a better space complexity (and/or time, of course)?
My arrays are pretty different from each other, and good heuristics might be enough.
Consider a search array S with the values:
S = [s1, s2, s3, ... , sl]
and the average value:
s̅ = (s1 + s2 + s3 + ... + sl) / l
and two matching arrays, one where every value is one greater than the corresponding value in S, and one where very value is one smaller:
A1 = [s1+1, s2+1, s3+1, ... , sl+1]
A2 = [s1−1, s2−1, s3−1, ... , sl−1]
These two arrays would have the average values:
a̅1 = (s1 + 1 + s2 + 1 + s3 + 1 + ... + sl + 1) / l = s̅ + 1
a̅2 = (s1 − 1 + s2 − 1 + s3 − 1 + ... + sl − 1) / l = s̅ − 1
So every matching array, whose values are at most 1 away from the corresponding values in the search array, has an average value that is at most 1 away from the average value of the search array.
If you calculate and store the average value of each array, and then sort the arrays based on their average value (or use an extra data structure that enables you to find all arrays with a certain average value), you can quickly identify which arrays have an average value within 1 of the search array's average value. Depending on the data, this could drastically reduce the number of arrays you have to check for similarity.
After having pre-processed the arrays and stores their average values, performing a search would mean iterating over the search array to calculate the average value, looking up which arrays have a similar average value, and then iterating over those arrays to check every value.
If you expect many arrays to have a similar average value, you could use several averages to detect arrays that are locally very different but similar on average. You could e.g. calculate these four averages:
the first half of the array
the second half of the array
the odd-numbered elements
the even-numbered elements
Analysis of the actual data should give you more information about how to divide the array and combine different averages to be most effective.
If the total sum of an array cannot exceed the integer size, you could store the total sum of each array, and check whether it is within l of the total sum of the search array, instead of using averages. This would avoid having to use floats and divisions.
(You could expand this idea by also storing other properties which are easily calculated and don't take up much space to store, such as the highest and lowest value, the biggest jump, ... They could help create a fingerprint of each array that is near-unique, depending on the data.)
If the number of dimensions is not very small, then probably the best solution will be to build a decision tree that recursively partitions the set along different dimensions.
Each node, including the root, would be a hash table from the possible values for some dimension to either:
The list of points that match that value within tolerance, if it's small enough; or
Those same points in a similar tree partitioning on the remaining dimensions.
Since each level completely eliminates one dimension, the depth of the tree is at most L, and search takes O(L) time.
The order in which the dimensions are chosen along each path is important, of course -- the wrong choice could explode the size of the data structure, with each point appearing many times.
Since your points are "pretty different", though, it should be possible to build a tree with minimal duplication. I would try the ID3 algorithm to choose the dimensions: https://en.wikipedia.org/wiki/ID3_algorithm. That basically means you greedily choose the dimension that maximizes the overall reduction in set size, using an entropy metric.
I would personally create something like a Trie for the lookup. I said "something like" because we have up to 3 values per index that might match. So we aren't creating a decision tree, but a DAG. Where sometimes we have choices.
That is straightforward and will run (with backtracking) in maximum time O(k*l).
But here is the trick. Whenever we see a choice of matching states that we can go into next, we can create a merged state which tries all of them. We can create a few or a lot of these merged states. Each one will defer a choice by 1 step. And if we're careful to keep track of which merged states we've created, we can reuse the same one over and over again.
In theory we can be generating partial matches for somewhat arbitrary subsets of our arrays. Which can grow exponentially in the number of arrays. In practice are likely to only wind up with a few of these merged states. But still we can guarantee a tradeoff - more states up front runs faster later. So we optimize until we are done or have hit the limit of how much data we want to have.
Here is some proof of concept code for this in Python. It will likely build the matcher in time O(n*l) and match in time O(l). However it is only guaranteed to build the matcher in time O(n^2 * l^2) and match in time O(n * l).
import pprint
class Matcher:
def __init__ (self, arrays, optimize_limit=None):
# These are the partial states we could be in during a match.
self.states = [{}]
# By state, this is what we would be trying to match.
self.state_for = ['start']
# By combination we could try to match for, which state it is.
self.comb_state = {'start': 0}
for i in range(len(arrays)):
arr = arrays[i]
# Set up "matched the end".
state_index = len(self.states)
this_state = {'matched': [i]}
self.comb_state[(i, len(arr))] = state_index
self.states.append(this_state)
self.state_for.append((i, len(arr)))
for j in reversed(range(len(arr))):
this_for = (i, j)
prev_state = {}
if 0 == j:
prev_state = self.states[0]
matching_values = set((arr[k] for k in range(max(j-1, 0), min(j+2, len(arr)))))
for v in matching_values:
if v in prev_state:
prev_state[v].append(state_index)
else:
prev_state[v] = [state_index]
if 0 < j:
state_index = len(self.states)
self.states.append(prev_state)
self.state_for.append(this_for)
self.comb_state[this_for] = state_index
# Theoretically optimization can take space
# O(2**len(arrays) * len(arrays[0]))
# We will optimize until we are done or hit a more reasonable limit.
if optimize_limit is None:
# Normally
optimize_limit = len(self.states)**2
# First we find all of the choices at the root.
# This will be an array of arrays with format:
# [state, key, values]
todo = []
for k, v in self.states[0].iteritems():
if 1 < len(v):
todo.append([self.states[0], k, tuple(v)])
while len(todo) and len(self.states) < optimize_limit:
this_state, this_key, this_match = todo.pop(0)
if this_key == 'matched':
pass # We do not need to optimize this!
elif this_match in self.comb_state:
this_state[this_key] = self.comb_state[this_match]
else:
# Construct a new state that is all of these.
new_state = {}
for state_ind in this_match:
for k, v in self.states[state_ind].iteritems():
if k in new_state:
new_state[k] = new_state[k] + v
else:
new_state[k] = v
i = len(self.states)
self.states.append(new_state)
self.comb_state[this_match] = i
self.state_for.append(this_match)
this_state[this_key] = [i]
for k, v in new_state.iteritems():
if 1 < len(v):
todo.append([new_state, k, tuple(v)])
#pp = pprint.PrettyPrinter()
#pp.pprint(self.states)
#pp.pprint(self.comb_state)
#pp.pprint(self.state_for)
def match (self, list1, ind=0, state=0):
this_state = self.states[state]
if 'matched' in this_state:
return this_state['matched']
elif list1[ind] in this_state:
answer = []
for next_state in this_state[list1[ind]]:
answer = answer + self.match(list1, ind+1, next_state)
return answer;
else:
return []
foo = Matcher([[1, 2, 3], [2, 3, 4]])
print(foo.match([2, 2, 3]))
Please note that I deliberately set up a situation where there are 2 matches. It reports both of them. :-)
I came up with a further approach derived off Matt Timmermans's answer: building a simple decision tree that might have certain some arrays in multiple branches. It works even if the error in the array I'm searching is larger than 1.
The idea is the following: given the set of arrays As...
Pick an index and a pivot.
I fixed the pivot to a constant value that works well with my data, and tried all indices to find the best one. Trying multiple pivots might work better, but I didn't need to.
Partition As into two possibly-intersecting subsets, one for the arrays (whose index-th element is) smaller than the pivot, one for the larger arrays. Arrays very close to the pivot are added to both sets:
function partition( As, pivot, index ):
return {
As.filter( A => A[index] <= pivot + 1 ),
As.filter( A => A[index] >= pivot - 1 ),
}
Apply both previous steps to each subset recursively, stopping when a subset only contains a single element.
Here an example of a possible tree generated with this algorithm (note that A2 appears both on the left and right child of the root node):
{A1, A2, A3, A4}
pivot:15
index:73
/ \
/ \
{A1, A2} {A2, A3, A4}
pivot:7 pivot:33
index:54 index:0
/ \ / \
/ \ / \
A1 A2 {A2, A3} A4
pivot:5
index:48
/ \
/ \
A2 A3
The search function then uses this as a normal decision tree: it starts from the root node and recurses either to the left or the right child depending on whether its value at index currentNode.index is greater or less than currentNode.pivot. It proceeds recursively until it reaches a leaf.
Once the decision tree is built, the time complexity is in the worst case O(n), but in practice it's probably closer to O(log(n)) if we choose good indices and pivots (and if the dataset is diverse enough) and find a fairly balanced tree.
The space complexity can be really bad in the worst case (O(2^n)), but it's closer to O(n) with balanced trees.

Find ranges in array

I've been trying to find the optimal solution to the following (interesting?) problem that came up at work: Eventually I settled for a good enough solution but I'd like to know if there's a better one.
Let a1...an be an array of strings.
Let s1...sk be an unordered list of strings, all of them also members of the array.
The task is to find the minimum set of index ranges eleements of s cover in a.
So for example if a = [ "x", "y", "a", "f", "c" ] and s = { "c","y","f" }, the answer would be (1;1), (3;4), assuming that the array is indexed from zero.
a is typically fairly large (hundreds of thousands of elements), while s is relatively small, typically length(s) < log(length(a)).
So the question is: can you find a time-efficient algorithm for this problem? (Space efficiency is not a concern within reasonable limits.)
Just a quick but important update: I need to perform this operation with different s values but the same a a lot. So precomputing stuff based on a is allowed, indeed it is the only way.
Build a hash table H(a) to map from element to index: ax->x in O(n) time and space. Then look up each sy in H(a) (in O(1) time on average for a total of O(k) for s) and keep track of the ranges. For that you can use an array of pair(min_index, max_index) sorted by min_index and do a binary search to either locate the range or where you should insert the new 1 element range.
So overall, the solution above would take O( n + k + k * log( nb_ranges ) ) time and O( n + nb_ranges ) space.
This is what you want, written in python:
def flattened(indexes):
s, rest = indexes[0], indexes[1:]
result = (s, s)
for e in rest:
if e == result[1] + 1:
result = (result[0], e)
else:
yield result
result = (e, e)
yield result
a = ["x", "y", "a", "f", "c"]
s = ["c", "y", "f"]
# Create lookup table of ai to index in a
src_indexes = dict((key, i) for i, key in enumerate(a))
# Create sorted list of all indexes into a
raw_dst_indexes = sorted(src_indexes[key] for key in s)
# Convert sorted list of indexes into an array of ranges
dst_indexes = [r for r in flattened(raw_dst_indexes)]
print dst_indexes
I think you can throw the elements of S into a set or hashtable, anything with near O(1) to check for membership. Then just do a linear scan on A, with a flag to determine if you are currently covering elements in S, and the start position of that cover. Should be O(n + k).

product of two number in a array

Given a array of n distinct integer. Find all pairs of x,y in the array such that z(given) = x * y...do it without sorting and in a most efficient manner..
[edit] Integer are within range of int i.e 0-65536 and numbers are non negative if that helps.
Dont want to sort coz it will take a lot of time. Storage space is not a issue.
Here is linear time hash based solution:
Let hash be an array of size 65537 initilized to 0.
foreach element ele in Array
if ele != 0
hash[product/ele] = ele
end-if
if hash[ele] != 0 AND ele * hash[ele] == product
print ele, product/ele
end-if
end-foreach
There aren't any super efficient ways of doing this. The best I can think of is O(n^2):
Have an auxiliary function that takes a number (a) and a list, and goes through every element (b) checking a*b = z and saving the pair if it is.
Go through every element of your original list, and if a particular element (x) divides z (ie z % x = 0) then send x and the remainder of the list after x to the auxiliary function.
UPDATE:
I'm giving an O(n^2) solution because the question did not specify unique pairs. If only unique pairs are desired, this should be added to the question. Also, my solution assumes the order of pairs doesn't matter, which is another detail that should be clarified.
Iterate through the array...if an element x can divide z (ie z % x == 0), check if it's other factor y=(z/x) exists in the HashTable....
If it does, then you found a pair...else just add it to the hashTable and continue...

Compute rank of a combination?

I want to pre-compute some values for each combination in a set of combinations. For example, when choosing 3 numbers from 0 to 12, I'll compute some value for each one:
>>> for n in choose(range(13), 3):
print n, foo(n)
(0, 1, 2) 78
(0, 1, 3) 4
(0, 1, 4) 64
(0, 1, 5) 33
(0, 1, 6) 20
(0, 1, 7) 64
(0, 1, 8) 13
(0, 1, 9) 24
(0, 1, 10) 85
(0, 1, 11) 13
etc...
I want to store these values in an array so that given the combination, I can compute its and get the value. For example:
>>> a = [78, 4, 64, 33]
>>> a[magic((0,1,2))]
78
What would magic be?
Initially I thought to just store it as a 3-d matrix of size 13 x 13 x 13, so I can easily index it that way. While this is fine for 13 choose 3, this would have way too much overhead for something like 13 choose 7.
I don't want to use a dict because eventually this code will be in C, and an array would be much more efficient anyway.
UPDATE: I also have a similar problem, but using combinations with repetitions, so any answers on how to get the rank of those would be much appreciated =).
UPDATE: To make it clear, I'm trying to conserve space. Each of these combinations actually indexes into something take up a lot of space, let's say 2 kilobytes. If I were to use a 13x13x13 array, that would be 4 megabytes, of which I only need 572 kilobytes using (13 choose 3) spots.
Here is a conceptual answer and a code based on how lex ordering works. (So I guess my answer is like that of "moron", except that I think that he has too few details and his links have too many.) I wrote a function unchoose(n,S) for you that works assuming that S is an ordered list subset of range(n). The idea: Either S contains 0 or it does not. If it does, remove 0 and compute the index for the remaining subset. If it does not, then it comes after the binomial(n-1,k-1) subsets that do contain 0.
def binomial(n,k):
if n < 0 or k < 0 or k > n: return 0
b = 1
for i in xrange(k): b = b*(n-i)/(i+1)
return b
def unchoose(n,S):
k = len(S)
if k == 0 or k == n: return 0
j = S[0]
if k == 1: return j
S = [x-1 for x in S]
if not j: return unchoose(n-1,S[1:])
return binomial(n-1,k-1)+unchoose(n-1,S)
def choose(X,k):
n = len(X)
if k < 0 or k > n: return []
if not k: return [[]]
if k == n: return [X]
return [X[:1] + S for S in choose(X[1:],k-1)] + choose(X[1:],k)
(n,k) = (13,3)
for S in choose(range(n),k): print unchoose(n,S),S
Now, it is also true that you can cache or hash values of both functions, binomial and unchoose. And what's nice about this is that you can compromise between precomputing everything and precomputing nothing. For instance you can precompute only for len(S) <= 3.
You can also optimize unchoose so that it adds the binomial coefficients with a loop if S[0] > 0, instead of decrementing and using tail recursion.
You can try using the lexicographic index of the combination. Maybe this page will help: http://saliu.com/bbs/messages/348.html
This MSDN page has more details: Generating the mth Lexicographical Element of a Mathematical Combination.
NOTE: The MSDN page has been retired. If you download the documentation at the above link, you will find the article on page 10201 of the pdf that is downloaded.
To be a bit more specific:
When treated as a tuple, you can order the combinations lexicographically.
So (0,1,2) < (0,1,3) < (0,1,4) etc.
Say you had the number 0 to n-1 and chose k out of those.
Now if the first element is zero, you know that it is one among the first n-1 choose k-1.
If the first element is 1, then it is one among the next n-2 choose k-1.
This way you can recursively compute the exact position of the given combination in the lexicographic ordering and use that to map it to your number.
This works in reverse too and the MSDN page explains how to do that.
Use a hash table to store the results. A decent hash function could be something like:
h(x) = (x1*p^(k - 1) + x2*p^(k - 2) + ... + xk*p^0) % pp
Where x1 ... xk are the numbers in your combination (for example (0, 1, 2) has x1 = 0, x2 = 1, x3 = 2) and p and pp are primes.
So you would store Hash[h(0, 1, 2)] = 78 and then you would retrieve it the same way.
Note: the hash table is just an array of size pp, not a dict.
I would suggest a specialised hash table. The hash for a combination should be the exclusive-or of the hashes for the values. Hashes for values are basically random bit-patterns.
You could code the table to cope with collisions, but it should be fairly easy to derive a minimal perfect hash scheme - one where no two three-item combinations give the same hash value, and where the hash-size and table-size are kept to a minimum.
This is basically Zobrist hashing - think of a "move" as adding or removing one item of the combination.
EDIT
The reason to use a hash table is that the lookup performance O(n) where n is the number of items in the combination (assuming no collisions). Calculating lexicographical indexes into the combinations is significantly slower, IIRC.
The downside is obviously the up-front work done to generate the table.
For now, I've reached a compromise: I have a 13x13x13 array which just maps to the index of the combination, taking up 13x13x13x2 bytes = 4 kilobytes (using short ints), plus the normal-sized (13 choose 3) * 2 kilobytes = 572 kilobytes, for a total of 576 kilobytes. Much better than 4 megabytes, and also faster than a rank calculation!
I did this partly cause I couldn't seem to get Moron's answer to work. Also this is more extensible - I have a case where I need combinations with repetitions, and I haven't found a way to compute the rank of those, yet.
What you want are called combinadics. Here's my implementation of this concept, in Python:
def nthresh(k, idx):
"""Finds the largest value m such that C(m, k) <= idx."""
mk = k
while ncombs(mk, k) <= idx:
mk += 1
return mk - 1
def idx_to_set(k, idx):
ret = []
for i in range(k, 0, -1):
element = nthresh(i, idx)
ret.append(element)
idx -= ncombs(element, i)
return ret
def set_to_idx(input):
ret = 0
for k, ck in enumerate(sorted(input)):
ret += ncombs(ck, k + 1)
return ret
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration and it does not use very much memory. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to C++.

How can I randomly iterate through a large Range?

I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x) is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9) with (0..99**99). This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x) in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x after I call f(x) so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on #bta's solution. For the sake of conciseness, next_prime is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
Define the range to be from 0 to
some number N
Generate a random starting point x[0] inside N
Generate an iterator Q less than N
Generate successive points x[n] by adding Q to
the previous point and wrapping around if needed. That
is, x[n+1] = (x[n] + Q) % N
Repeat until you generate a new point equal to the starting point.
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime N and Q will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor of N should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" in N.
This algorithm only requires the starting point (x[0]), the current point (x[n]), the iterator value (Q), and the range limit (N) to be stored.
Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
As #Turtle answered, you problem doesn't have a solution. #KandadaBoggu and #bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If (0..99**99) is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.
Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
puts "Oh, no!" if tried[x]
tried[x] = true
}
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
Break the range in to manageable batches as shown below:
def range_walker range, batch_size = 100
size = (range.end - range.begin) + 1
n = size/batch_size
n.times do |i|
x = i * batch_size + range.begin
y = x + batch_size
(x...y).sort_by{rand}.each{|z| p z}
end
d = (range.end - size%batch_size + 1)
(d..range.end).sort_by{rand}.each{|z| p z }
end
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
you can randomly iterate an array with shuffle method
a = [1,2,3,4,5,6,7,8,9]
a.shuffle!
=> [5, 2, 8, 7, 3, 1, 6, 4, 9]
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
function fullCycleStep(sample_size, last_value, random_seed = 31337, prime_number = 32452843) {
if last_value = null then last_value = random_seed % sample_size
return (last_value + prime_number) % sample_size
}
If you call this like so:
sample = 10
For i = 1 to sample
last_value = fullCycleStep(sample, last_value)
print last_value
next
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
def gen_random_indices
# Assume your input range is (0..(10**3))
(0..3).sort_by{rand}.each do |a|
(0..3).sort_by{rand}.each do |b|
(0..3).sort_by{rand}.each do |c|
yield "#{a}#{b}#{c}".to_i
end
end
end
end
gen_random_indices do |idx|
run_test_with_index(idx)
end
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range (0..(10**3)) exactly once, but the order is only pseudo-random. That is, if the first loop sets a=1, then you will encounter all three-digit numbers of the form 1xx before you see the hundreds digit change.
The other downside is the need to manually construct the function to a specified depth. In your (0..(99**99)) case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).
[Edit]: Taking into account #klew and #Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about 10^19 keys. I has to do with the numbers being too large to convert to a long. It can probably be fixed:
# partition a range into an array of N equal-sized ranges
def partition(range, n)
ranges = []
first = range.first
last = range.last
length = last - first + 1
step = length / n # integer division
((first + step - 1)..last).step(step) { |i|
ranges << (first..i)
first = i + 1
}
# append any extra onto the last element
ranges[-1] = (ranges[-1].first)..last if last > step * ranges.length
ranges
end
I hope the code comments help shed some light on my original question.
pastebin: full source
Note: PW_LEN under # options can be changed to a lower number in order to get quicker results.
For a prohibitively large space, like
space = -10..1000000000000000000000
You can add this method to Range.
class Range
M127 = 170_141_183_460_469_231_731_687_303_715_884_105_727
def each_random(seed = 0)
return to_enum(__method__) { size } unless block_given?
unless first.kind_of? Integer
raise TypeError, "can't randomly iterate from #{first.class}"
end
sample_size = self.end - first + 1
sample_size -= 1 if exclude_end?
j = coprime sample_size
v = seed % sample_size
each do
v = (v + j) % sample_size
yield first + v
end
end
protected
def gcd(a,b)
b == 0 ? a : gcd(b, a % b)
end
def coprime(a, z = M127)
gcd(a, z) == 1 ? z : coprime(a, z + 1)
end
end
You could then
space.each_random { |i| puts i }
729815750697818944176
459631501395637888351
189447252093456832526
919263002791275776712
649078753489094720887
378894504186913665062
108710254884732609237
838526005582551553423
568341756280370497598
298157506978189441773
27973257676008385948
757789008373827330134
487604759071646274309
217420509769465218484
947236260467284162670
677052011165103106845
406867761862922051020
136683512560740995195
866499263258559939381
596315013956378883556
326130764654197827731
55946515352016771906
785762266049835716092
515578016747654660267
...
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to #nick-steele and #bta for the approach.
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider x = i + 1. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g. x ^= (x & 0xFF) >> 5) is valid for the 16-bit domain. You can specify that domain with a mask, e.g. mask = 0xFF, and your hash function becomes x = hash(i, mask). Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.
So you have a reversible function x = hash(i, mask, seed). The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.
The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
fun permute(i, length, seed) {
i = hash(i, 0xFFFF, seed)
while(i >= length): i = hash(i, 0xFFFF, seed)
return i
}
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
fun next_pow_2(length) {
# This implementation is for clarity.
# See Kensler's paper for one way to do it fast.
p = 1
while (p < length): p *= 2
return p
}
permute(i, length, seed) {
mask = next_pow_2(length)-1
i = hash(i, mask, seed) & mask
while(i >= length): i = hash(i, mask, seed) & mask
return i
}
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.

Resources