Implementing an algorithm? - algorithm

I have to write a small program to implement the following algorithm:
Assume you have a search algorithm which, at each level of recursion, excludes half of the data from consideration when searching for a specific data item. Search stops only when one data item is left. How many levels of recursion are required when the number of elements in the data is 1024?
Do anybody has idea about how to analyze or any suggestion on how to start ?

You need to find the minimal value of d such that:
1 * 2 * 2 * 2 * .... * 2 = 1024
____________________
total of d times
The above is true, because each multiplication by 2 is actually one level up in the recursion, you go up from the stop clause of 1 element, until you get the initial data size, which is 1024.
The above equation is actually 2^d = 1024
And it is solved easily with extracting log_2 from both sides:
log_2(2^d) = log^2(1024)
d = 10
P.S. Note that the above is the number of recursive calls, exclusive of the initial call, so total number of calls to the method is d+1=11, one from the calling environment, and 10 from the method itself.

Related

safe array partition based on some criteria

I am trying to solve this problem. the problem can be summarized as:
Given a sequence of integers find no of safe partitions, where safe partitions are defined as:
A safe partition is a partition into subsequences S1,S2,…,SK such that for each valid i, min(Si)≤|Si|≤max(Si)— that is, for each subsequence in this partition, its length is greater or equal to its smallest element and smaller or equal to its largest element.
Ex:
Input => 1 6 2 3 4 3 4
Output => 6 partitions
[1],[6,2,3,4,3,4]
[1,6,2],[3,4,3,4]
[1,6,2,3],[4,3,4]
[1],[6,2],[3,4,3,4]
[1],[6,2,3],[4,3,4]
[1,6],[2,3],[4,3,4]
I can probably find out the solution somewhere on internet which includes the code but i am more intrested in finding out the approach to solve this problem so i am asking here what are the points that I am missing in my observation.
These are the things that pop in my mind when I read this problem:
if an element at index i extends a sequence safely its quite
possible that it could also be the start of a new sequence.so at
every element i am left with two choices whether it extends the
sequence or not.
so i think it can be represented mathematically as ,
p(0..N)=1+P(i..N)+P(i+1..N),if A[i] is safe to extend current partition
p(0..N)=1+ p(i..N), if A[i] can't be used to extend
where P is the partition function.
is this reasoning valid? am i missing something?
[I'm having trouble giving a direction without actually giving the solution, because once a person thinks in the right direction then the solution becomes evident. I'll try to highlight some facts which may put a person on the right track.]
Explicitly enumerating safe partitions is problematic, since there are O(2n) safe partitions. For example in:
1,N,1,N,1,N ... [N elements]
For this sequence, at any subsequence of length > 1 and the subsequence [1] matches the criteria. The number of safe partitions for such a sequence of length n=2k is 3k-1. To prove that, look at the following
Base k = 1: f(1) = f(2) = 1
Step assumption: f(2k) = 3k-1.
f(2k+1) =
f(2k+2) = (f(2k) + f(2k-1)) + (f(2k-2) + f(2k-3)) + ... + f(1) + 1
= 2*(f(2k) + f(2k-2) + .. + f(2)) + 1
= 2 * (3k-1 + 3k-2 + ... + 1) + 1
= 2 * (3k - 1) / 2 + 1
= 3k
Since enumeration is out of the question, for any reasonable performance, the solution must somehow count without iterating. Since the proof that 1,N,...,1,N has 3k-1 did not have to explicitly enumerate all sequences, its principles can be generalized to any sequence.
NOTES:
I have solved similar problems before, so the direction was clear to me. For this question I tried to break my thoughts into something manageable and came up with the thought about complexity. I had a very strong feeling that this is exponential even before writing it down, and trying to prove it. This comes from experience and from seeing other problems. The complexity function felt worse than a Fibbonacci because adding an element to a sequence seemed to be adding at least two elements of smaller sizes (similar to the Fibbonacci sequence). Since Fibbonacci is exponential, so the 1,...,1 partitioning must be exponential. From there went on and analyzed it with a recurrence relation.
The exact way I reached the solution matches my way of thought. Everybody has a different way of thought that works for them, and they need to develop and find it.
This is how I came to suspect that the number of safe sequences in tge example was 3k-1:
I recursively calculated f(2k), with base condition f(1)=f(2)=1. Then for 3:
[1,N,1]
[1],[N,1]
[1,N],[1]
And for 4:
[1,N,1,N]
[1],[N,1,N]
[1,N],[1,N]
Meaning f(3)=f(4)=3. Then I recursively applied
f(2k+2)=2*(f(2k) + f(2k-2) + .. + f(2)) + 1
resulting with f(2)=1, f(4)=3, f(6)=9, f(8)=27. This suspiciously looks like 3k-1. Then I simply had to prove that with induction.

Best way to distribute a given resource (eg. budget) for optimal output

I am trying to find a solution in which a given resource (eg. budget) will be best distributed to different options which yields different results on the resource provided.
Let's say I have N = 1200 and some functions. (a, b, c, d are some unknown variables)
f1(x) = a * x
f2(x) = b * x^c
f3(x) = a*x + b*x^2 + c*x^3
f4(x) = d^x
f5(x) = log x^d
...
And also, let's say there n number of these functions that yield different results based on its input x, where x = 0 or x >= m, where m is a constant.
Although I am not able to find exact formula for the given functions, I am able to find the output. This means that I can do:
X = f1(N1) + f2(N2) + f3(N3) + ... + fn(Nn) where (N1 + ... Nn) = N as many times as there are ways of distributing N into n numbers, and find a specific case where X is the greatest.
How would I actually go about finding the best distribution of N with the least computation power, using whatever libraries currently available?
If you are happy with allocations constrained to be whole numbers then there is a dynamic programming solution of cost O(Nn) - so you can increase accuracy by scaling if you want, but this will increase cpu time.
For each i=1 to n maintain an array where element j gives the maximum yield using only the first i functions giving them a total allowance of j.
For i=1 this is simply the result of f1().
For i=k+1 consider when working out the result for j consider each possible way of splitting j units between f_{k+1}() and the table that tells you the best return from a distribution among the first k functions - so you can calculate the table for i=k+1 using the table created for k.
At the end you get the best possible return for n functions and N resources. It makes it easier to find out what that best answer is if you maintain of a set of arrays telling the best way to distribute k units among the first i functions, for all possible values of i and k. Then you can look up the best allocation for f100(), subtract off the value this allocated to f100() from N, look up the best allocation for f99() given the resulting resources, and carry on like this until you have worked out the best allocations for all f().
As an example suppose f1(x) = 2x, f2(x) = x^2 and f3(x) = 3 if x>0 and 0 otherwise. Suppose we have 3 units of resource.
The first table is just f1(x) which is 0, 2, 4, 6 for 0,1,2,3 units.
The second table is the best you can do using f1(x) and f2(x) for 0,1,2,3 units and is 0, 2, 4, 9, switching from f1 to f2 at x=2.
The third table is 0, 3, 5, 9. I can get 3 and 5 by using 1 unit for f3() and the rest for the best solution in the second table. 9 is simply the best solution in the second table - there is no better solution using 3 resources that gives any of them to f(3)
So 9 is the best answer here. One way to work out how to get there is to keep the tables around and recalculate that answer. 9 comes from f3(0) + 9 from the second table so all 3 units are available to f2() + f1(). The second table 9 comes from f2(3) so there are no units left for f(1) and we get f1(0) + f2(3) + f3(0).
When you are working the resources to use at stage i=k+1 you have a table form i=k that tells you exactly the result to expect from the resources you have left over after you have decided to use some at stage i=k+1. The best distribution does not become incorrect because that stage i=k you have worked out the result for the best distribution given every possible number of remaining resources.

A special sample method in Map-Reduce implementation

I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.
But my way to get the sample is somehow special:
I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
repeat step 1 4*10^6 times(no matter if one record be selected multiple times).
I think up a method to solve this:
Generate a table A(num int), and there only one number in every record of table A which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above).
Load table A as resource file to every map, and if the ordinal number of the record which is on decision now is in table A, output this record, otherwise discard it.
I think my method is not so good because if I want to sample more record from the original table, the table A will became very large and can't be loaded as resource file.
So, could any one please give an elegant algorithm?
I'm not sure what "elegant" means, but perhaps you're interested in something analogous to reservoir sampling. Let k be the size of the sample and initialize a k-element array with nulls. The elements from which we are sampling arrive one by one. When the jth (counting from 1) element arrives, we iterate through the array and, for each cell, replace its contents by the current element independently with probability 1/j.
Naively, the running time is pretty bad -- to sample k elements from n with replacement costs O(k n). The number of writes into the array, however, is O(k log n) in expectation, because later elements in the stream rarely result in writes. Here's an efficient method based on the exponential distribution (warning: lightly tested Python ahead). The running time is O(n + k log n).
import math
import random
def sample_from(population, k):
for i, x in enumerate(population):
if i == 0:
sample = [x] * k
else:
t = float(k) * math.log(1.0 - 1.0 / float(i + 1))
while True:
t -= math.log(1.0 - random.random())
if t >= 0.0:
break
sample[random.randrange(k)] = x
return sample

How to test if one set of (unique) integers belongs to another set, efficiently?

I'm writing a program where I'm having to test if one set of unique integers A belongs to another set of unique numbers B. However, this operation might be done several hundred times per second, so I'm looking for an efficient algorithm to do it.
For example, if A = [1 2 3] and B = [1 2 3 4], it is true, but if B = [1 2 4 5 6], it's false.
I'm not sure how efficient it is to just sort and compare, so I'm wondering if there are any more efficient algorithms.
One idea I came up with, was to give each number n their corresponding n'th prime: that is 1 = 2, 2 = 3, 3 = 5, 4 = 7 etc. Then I could calculate the product of A, and if that product is a factor of the similar product of B, we could say that A is a subset of similar B with certainty. For example, if A = [1 2 3], B = [1 2 3 4] the primes are [2 3 5] and [2 3 5 7] and the products 2*3*5=30 and 2*3*5*7=210. Since 210%30=0, A is a subset of B. I'm expecting the largest integer to be couple of million at most, so I think it's doable.
Are there any more efficient algorithms?
The asymptotically fastest approach would be to just put each set in a hash table and query each element, which is O(N) time. You cannot do better (since it will take that much time to read the data).
Most set datastructures already support expected and/or amortized O(1) query time. Some languages even support this operation. For example in python, you could just do
A < B
Of course the picture changes drastically depending on what you mean by "this operation is repeated". If you have the ability to do precalculations on the data as you add it to the set (which presumably you have the ability to do so), this will allow you to subsume the minimal O(N) time into other operations such as constructing the set. But we can't advise without knowing much more.
Assuming you had full control of the set datastructure, your approach to keep a running product (whenever you add an element, you do a single O(1) multiplication) is a very good idea IF there exists a divisibility test that is faster than O(N)... in fact your solution is really smart, because we can just do a single ALU division and hope we're within float tolerance. Do note however this will only allow you roughly a speedup factor of 20x max I think, since 21! > 2^64. There might be tricks to play with congruence-modulo-an-integer, but I can't think of any. I have a slight hunch though that there is no divisibility test that is faster than O(#primes), though I'd like to be proved wrong!
If you are doing this repeatedly on duplicates, you may benefit from caching depending on what exactly you are doing; give each set a unique ID (though since this makes updates hard, you may ironically wish to do something exactly like your scheme to make fingerprints, but mod max_int_size with detection-collision). To manage memory, you can pin extremely expensive set comparison (e.g. checking if a giant set is part of itself) into the cache, while otherwise using a most-recent policy if you run into memory issues. This nice thing about this is it synergizes with an element-by-element rejection test. That is, you will be throwing out sets quickly if they don't have many overlapping elements, but if they have many overlapping elements the calculations will take a long time, and if you repeat these calculations, caching could come in handy.
Let A and B be two sets, and you want to check if A is a subset of B. The first idea that pops into my mind is to sort both sets and then simply check if every element of A is contained in B, as following:
Let n_A and n_B be the cardinality of A and B, respectively. Let i_A = 1, i_B = 1. Then the following algorithm (that is O(n_A + n_B)) will solve the problem:
// A and B assumed to be sorted
i_A = 1;
i_B = 1;
n_A = size(A);
n_B = size(B);
while (i_A <= n_A) {
while (A[i_A] > B[i_B]) {
i_B++;
if (i_B > n_B) return false;
}
if (A[i_A] != B[i_B}) return false;
i_A++;
}
return true;
The same thing, but in a more functional, recursive way (some will find the previous easier to understand, others might find this one easier to understand):
// A and B assumed to be sorted
function subset(A, B)
n_A = size(A)
n_B = size(B)
function subset0(i_A, i_B)
if (i_A > n_A) true
else if (i_B > n_B) false
else
if (A[i_A] <= B[i_B]) return (A[i_A] == B[i_B]) && subset0(i_A + 1, i_B + 1);
else return subset0(i_A, i_B + 1);
subset0(1, 1)
In this last example, notice that subset0 is tail recursive, since if (A[i_A] == B[i_B]) is false then there will be no recursive call, otherwise, if (A[i_A] == B[i_B]) is true, than there's no need to keep this information, since the result of true && subset0(...) is exactly the same as subset0(...). So, any smart compiler will be able to transform this into a loop, avoiding stack overflows or any performance hits caused by function calls.
This will certainly work, but we might be able to optimize it a lot in the average case if you have and provide more information about your sets, such as the probability distribution of the values in the sets, if you somehow expect the answer to be biased (ie, it will more often be true, or more often be false), etc.
Also, have you already written any code to actually measure its performance? Or are you trying to pre-optimize?
You should start by writing the simplest and most straightforward solution that works, and measure its performance. If it's not already satisfactory, only then you should start trying to optimize it.
I'll present an O(m+n) time-per-test algorithm. But first, two notes regarding the problem statement:
Note 1 - Your edits say that set sizes may be a few thousand, and numbers may range up to a million or two.
In following, let m, n denote the sizes of sets A, B and let R denote the size of the largest numbers allowed in sets.
Note 2 - The multiplication method you proposed is quite inefficient. Although it uses O(m+n) multiplies, it is not an O(m+n) method because the product lengths are worse than O(m) and O(n), so it would take more than O(m^2 + n^2) time, which is worse than the O(m ln(m) + n ln(n)) time required for sorting-based methods, which in turn is worse than the O(m+n) time of the following method.
For the presentation below, I suppose that sets A, B can completely change between tests, which you say can occur several hundred times per second. If there are partial changes, and you know which p elements change in A from one test to next, and which q change in B, then the method can be revised to run in O(p+q) time per test.
Step 0. (Performed one time only, at outset.) Clear an array F, containing R bits or bytes, as you prefer.
Step 1. (Initial step of per-test code.) For i from 0 to n-1, set F[B[i]], where B[i] denotes the i'th element of set B. This is O(n).
Step 2. For i from 0 to m-1, { test F[A[i]]. If it is clear, report that A is not a subset of B, and go to step 4; else continue }. This is O(m).
Step 3. Report that A is a subset of B.
Step 4. (Clear used bits) For i from 0 to n-1, clear F[B[i]]. This is O(n).
The initial step (clearing array F) is O(R) but steps 1-4 amount to O(m+n) time.
Given the limit on the size of the integers, if the set of B sets is small and changes seldom, consider representing the B sets as bitsets (bit arrays indexed by integer set member). This doesn't require sorting, and the test for each element is very fast.
If the A members are sorted and tend to be clustered together, then get another speedup by testing all the element in one word of the bitset at a time.

Number of ways to add up to a sum S with N numbers

Say S = 5 and N = 3 the solutions would look like - <0,0,5> <0,1,4> <0,2,3> <0,3,2> <5,0,0> <2,3,0> <3,2,0> <1,2,2> etc etc.
In the general case, N nested loops can be used to solve the problem. Run N nested loop, inside them check if the loop variables add upto S.
If we do not know N ahead of time, we can use a recursive solution. In each level, run a loop starting from 0 to N, and then call the function itself again. When we reach a depth of N, see if the numbers obtained add up to S.
Any other dynamic programming solution?
Try this recursive function:
f(s, n) = 1 if s = 0
= 0 if s != 0 and n = 0
= sum f(s - i, n - 1) over i in [0, s] otherwise
To use dynamic programming you can cache the value of f after evaluating it, and check if the value already exists in the cache before evaluating it.
There is a closed form formula : binomial(s + n - 1, s) or binomial(s+n-1,n-1)
Those numbers are the simplex numbers.
If you want to compute them, use the log gamma function or arbitrary precision arithmetic.
See https://math.stackexchange.com/questions/2455/geometric-proof-of-the-formula-for-simplex-numbers
I have my own formula for this. We, together with my friend Gio made an investigative report concerning this. The formula that we got is [2 raised to (n-1) - 1], where n is the number we are looking for how many addends it has.
Let's try.
If n is 1: its addends are o. There's no two or more numbers that we can add to get a sum of 1 (excluding 0). Let's try a higher number.
Let's try 4. 4 has addends: 1+1+1+1, 1+2+1, 1+1+2, 2+1+1, 1+3, 2+2, 3+1. Its total is 7.
Let's check with the formula. 2 raised to (4-1) - 1 = 2 raised to (3) - 1 = 8-1 =7.
Let's try 15. 2 raised to (15-1) - 1 = 2 raised to (14) - 1 = 16384 - 1 = 16383. Therefore, there are 16383 ways to add numbers that will equal to 15.
(Note: Addends are positive numbers only.)
(You can try other numbers, to check whether our formula is correct or not.)
This can be calculated in O(s+n) (or O(1) if you don't mind an approximation) in the following way:
Imagine we have a string with n-1 X's in it and s o's. So for your example of s=5, n=3, one example string would be
oXooXoo
Notice that the X's divide the o's into three distinct groupings: one of length 1, length 2, and length 2. This corresponds to your solution of <1,2,2>. Every possible string gives us a different solution, by counting the number of o's in a row (a 0 is possible: for example, XoooooX would correspond to <0,5,0>). So by counting the number of possible strings of this form, we get the answer to your question.
There are s+(n-1) positions to choose for s o's, so the answer is Choose(s+n-1, s).
There is a fixed formula to find the answer. If you want to find the number of ways to get N as the sum of R elements. The answer is always:
(N+R-1)!/((R-1)!*(N)!)
or in other words:
(N+R-1) C (R-1)
This actually looks a lot like a Towers of Hanoi problem, without the constraint of stacking disks only on larger disks. You have S disks that can be in any combination on N towers. So that's what got me thinking about it.
What I suspect is that there is a formula we can deduce that doesn't require the recursive programming. I'll need a bit more time though.

Resources