Pseudo-polynomial algorithms - math - complexity-theory

I understand when a given algorithm can be called pseudo-polynomial, however I can't find anywhere how to show it's exponential with respect to the size of input given in number of bits. What I mean here is formal proof that the relation between function of size of input and time complexity is exponential.
Maybe it would be easy to explain basing on knapsack problem.
Yes, I've read this thread: What is pseudopolynomial time? How does it differ from polynomial time?
...but it's not quite what I want.
Thanks in advance

(That was my original post, so I'd be happy to elaborate!)
Let's take the subset sum problem as an example. In this problem, we want to determine whether there's a subset of a set S of n numbers that adds up to exactly W. There's a pseudopolynomial-time DP algorithm that runs in time O(nW). Let's formally show that this is exponential in x, the number of bits of input.
To do this, we need to think about how you'd structure the input to this problem. If we were to write things out in plain English, we could write down an input to the problem by writing out all the numbers in S as a comma-separated list, appending the value of W at the end. For example, we might write the question "Is there a subset of {1, 2, 3, 8, 12} that sums up to 5?" by writing
1,2,3,8,12,5
This is done in decimal. If we write the numbers in binary, we get
1,10,11,1000,1100,101
To get the whole thing to fit into a single string of bits, we need to somehow encode these numbers and commas with separators interspersed. To do this, we'll use a standard trick. We'll double the bits of all the numbers, and replace the commas with the string 01. In this case, we'd get
110111000111110111000000011111000001110011
We can decode this input by reading off blocks of size 2. Every time we read 00, we know it's a 0. Every time we read 11, we know it's a 1. Every time we read 01, we know we've finished reading a number and should start on the next.
So how many bits are required here? Well, if there are n numbers, we'll have n separators, requiring 2n bits. If the numbers themselves have a total of b bits, we'll need 2b space to store them. Finally, we need lg W bits to write out W in binary, so we need 2 lg W bits to write out W. This means that the total number of bits, denoted x, satisfies
x = 2(n + b + lg W)
So now look at the runtime of our algorithm, which is O(nW). If our input set consists of n copies of the number 1, then size of our input will be 2(n + 1 + lgW) = 2n + 2 + 2 lg W. If we now choose W to be equal to 2n, then the size of the input is 2n + 2 + 2n = 4n + 2. This means that x = 4n + 2, so n = Θ(x) and W = 2x = 2Θ(x). Therefore, if the runtime of the algorithm is O(nW), the runtime is O(x 2Θ(x)), which is exponential in the size of the input.
Hope this helps!

Related

Most efficient way of storing exact set membership?

I have N slots. There are M slots occupied.
I want to be able to tell exactly whether each slot is occupied. (No Bloom filter answers, please.)
What is the absolute most storage-space efficient way of storing this information for M << N?
Guess 0: A bitmap of N bits.
Guess 1: An array of the positions of the occupied slots. Reasonably good for small M.
Guess 2: p0 + (M-1)p1 + (M-1)(M-2)p2 + ... where pX is the position of an occupied slot, among the remaining unoccupied slots. This is slightly more efficient than guess 1, as the choice of unoccupied slot narrows as slots are filled.
Guess 2 still has a lot of waste; it includes the order in which the slots were filled, which is information that is not required.
What method is more efficient than Guess 2?
If M and N are known, then one way of achieving the best compression is to store the index of the combination.
There are t= N!/((N-M)!*M!) ways of choosing the M slots to be filled, so you will always need at least log2(t) bits to represent this information.
Storing the index of the combination allows you to use exactly ceil(log2(t))bits.
Assuming that there is no other information about the distribution of the values (i.e., every possible sample of M values is equally probable) then the optimal compression technique is that given by #PeterdeRivaz, which is to simply using the ordinal of the enumeration of the sample out of the set of possible samples.
However, that is not trivial to compute, since the enumeration requires arithmetic on very large numbers.
Instead, it is possible to use a variant on Golomb compression, with only a small impact on the compression ratio.
Assume the numbers in the sample are in order. We start by computing the successive differences. Because we will never have two equal numbers, the sequence of differences will never include a 0; to gain a tiny additional compression, we start with one more than the first element of the sample, rather than the first element itself -- which means that the sequence never contains a 0 -- and then subtract one from each value. We now select some convenient number of bits k, and encode each value δ in the sequence as follows:
While δ > 2k, send a 1 bit and subtract 2k from δ
Now δ can be written in k bits. Send a 0 followed by the k-bit value of δ
We can choose k as ⌊log2(N/M)⌋, which means that N < 2k+1M. (Taking the ceiling would be another possibility.) Consequently, the number of 1 bits sent in all of the iterations of step 1 in the above algorithm is less than 2M (because each 1 accounts for 2k of the cumulative sum of the series, which is less than N). Each step 2 sends exactly k + 1 bits, and there are exactly M executions of step 2, one for each value in the sample. Thus, the total number of bits sent is somewhere between M × (k + 1) and M × (k + 2). But since we know k < log2(N/M) + 1, the total size of the transmission is certainly less than M log2 N - M log2 M + 3M). (We also have to send the parameters k and M, so there is a bit of overhead.)
Now, let's consider the optimal transmission size. There are N choose M possible samples, so the size of the enumeration index in bits will be log2(N choose M). If N &gg; M, we can approximate N choose M as NM/M!, and then using Stirling's approximation we get:
log(N choose M) ≈ M log N − M log M + M
(That's actually a slight overestimate, but it is asymptotically correct.)
Thus, the difference between the compressed sequence and the information-theoretic limit is less than 2 bits per value. (In practice, it is generally around one bit per value, because step 1 executes far less than the maximum number of times.)

Determine the value of M and does M depends on k?

Here is an exercise I'm struggling with:
One way to improve the performance of QuickSort is to switch to
InsertionSort when a subfile has <= M elements instead of recursively calling itself.
Implement a recursive QuickSort with a cutoff to InsertionSort for subfiles with M or less elements. Empirically determine the value of M for which it performs fewest key comparisons on inputs of 60000 random natural numbers less than K for K = 10,100,1000, 10000, 100000, 1000000. Does the optimal value M depend on K?
My issues:
I would like to know whether the value of M differs from statement 1 and statement 3. If so, what would be the array size, and how to vary the random numbers ? How to compare M and K? Do i have any mathematical equation or i should it just do it using my code ?
Implement the sort algoritm as requested.
Add support for recording the number of comparisons (e.g. increment a global)
Generate 5 sets of input data for each k. So 30 files with 1,800,000 lines in total.
Run the sort on every set for every K and guess M a couple of times. Start with the low-valued inputs and make the favorable M guide your guesses as you progress towards high-valued inputs.
Describe your observations about the influence of M over K.
Pass the exercise like a pro

How does flajolet martin sketch works?

I am trying to understand this sketch but am not able to understand.
Correct me if I am wrong but basically, lets say I have a text data.. words.. I have a hash function.. which takes a word and create an integer hash and then I convert that hash to binary bit vector?? Right..
Then I keep a track of the first 1 I see from left.. And the position where that 1 is (say , k)... the cardinality of this set is 2^k?
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
But ... say I have just one word. and the hash function of it is such that hash it generates is 2^5, then I am guessing there are 5 (??) trailing 0's?? so it will predict 2^5 (??) cardinality?
That doesnt sounds right? What am I missing
For a single word the distribution of R is a geometric distribution with p = 1/2, and its standard deviation is sqrt(2) ≈ 1.41.
So for a word with hash ending in 100000b the algorithm will, indeed, yield 25/0.77351 = 41.37. But the probability of that is only 1/64, which is consistent with the statement that the standard deviation of R is close to 1.
http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html
We had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:
1/2 of them to have their binary representation end in 0(i.e. divisible by 2),
1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)
1/8 of them to have their binary representation end in 000 (i.e. divisible by 8)
Turning the problem around, if the hash function generated an integer ending in 0^m bits ..intuitively, the number of unique strings is around 2^m.
What is really important to remember is that the Flajolet Martin Algorithm is meant to count distinct elements (lets say M distinct elements) from a set of N elements, when M is expected to be very very large.
There is no point of using the algorithm if N or M are small enough for us to store all distinct elements in memory.
In the case where N and M are really large, the probability of the estimate being close to 2^k is actually very reasonable.
There is an explanation of this at : http://infolab.stanford.edu/~ullman/mmds/ch4.pdf (page 143)

Determining running time of an algorithm to compare two arrays

I want to know how it is possible to determine the run time of an algorithm written in pseudocode so that I can familiarize myself with run time. So for example, how do you know what the run time of an algorithm that will compare 2 arrays to determine if they are not the same?
Array 1 = [1, 5, 3, 2, 10, 12] Array 2 = [3, 2, 1, 5, 10, 12]
So these two arrays are not the same since they are ordered differently.
My pseudocode is:
1) set current pointer to first number in first array
2) set second pointer to first number in second array
3) while ( current pointer != " ") compare with same position element in other array
4) if (current pointer == second pointer)
move current pointer to next number
move second pointer to next number
5) else (output that arrays are not the same)
end loop
So I am assuming first off my code is correct. I know step 4 executes only once since it only takes 1 match to display arrays are not the same. So step 4 takes only constant time (1). I know step 1 and 2 only execute once also.
so far I know run time is 3 + ? (? being the run time of loop itself)
Now I am lost on the loop part. Does the loop run n times (n being number of numbers in the array?), since worst case might be every single number gets matched? Am I thinking of run time in the right way?
If someone can help with this, I'll appreciate it.
Thanks!
What you are asking about is called the time-complexity of your algorithm. We talk about the time complexity of algorithms using so called Big-O notation.
Big-O notation is a method for talking about the approximate number of steps our algorithms take relative to the size of the algorithms input, in the worst possible case for an input of that size.
Your algorithm runs in O(n) time (pronounced "big-oh of n" or "order n" or sometimes we just say "linear time").
You already know that steps 1,2, and 4 all run in a constant number of steps relative to the size of the array. We say that those steps run in O(1) time ("constant time").
So let's consider step 3:
If there are n elements in the array, then step 3 needs to do n comparisons in the worst case. So we say that step 3 takes O(n) time.
Since the algorithm takes O(n) time on step 3, and all other steps are faster, we say that the total time complexity of your algorithm is O(n).
When we write O(f), where f is some function, we mean that the algorithm runs within some constant factor of f for large values.
Take your algorithm for example. For large values of n (say n = 1000), the algorithm doesn't take exactly n steps. Suppose that a comparison takes 5 instructions to complete in your algorithm, on your machine of choice. (It could be any constant number, I'm just choosing 5 for example.) And suppose that steps 1, 2, 4 all take some constant number of steps each, totalling 10 instructions for all three of those steps.
Then for n = 1000 your algorithm would take:
Steps 1 + 2 + 4 = 10 instructions. Step 3 = 5*1000 = 5000 instructions.
This is a total of 5010 instructions. This is about 5*n instructions, which is a constant factor of n, which is why we say it is O(n).
For very large n, the 10 in f = 5*n + 10 becomes more and more insignificant, as does the 5. For this reason, we simply reduce the function to f is within a constant factor of n for large n by saying f is in O(n).
In this way it's easy to describe the idea that a quadratic function like f1 = n^2 + 2 is always larger than any linear function like f2 = 10000*n + 50000 when n is large enough, by simply writing f1 as O(n) and f2 as O(n^2).
You are correct. The running time is O(n) where n is the number of elements in the arrays. Each time you add 1 element to the arrays, you would have to execute the loop 1 more time in the worst case.

Anticipate factorial overflow

I'm wondering how could I anticipate whether the next iteration will generate an integer overflow while calculating the factorial F or not?
Let's say that at each iteration I have an int I and the maximum value is MAX_INT.
It sounds like a homework, I know. It's not. It's just me asking myself "stupid" questions.
Addendum
I though about, given a number of BITS (the width an integer can take, in bits), I could round up the number I to the next power of two, and detect if a shift to left would exceed BITS. But how would that look like, algorithmically?
Alternative hint:
a * b ≤ MAX_INT
is equivalent to
a ≤ MAX_INT / b
if b > 0.
Factorials are a series of multiplications, and the number of bits needed to hold the result of a multiplication is the sum of the bits of the two multiplicands. So, keep a running total of how many bits are used in your result, and the current number of bits needed to hold the value you are multiplying in. When that's greater than the number of bits left, you're about to overflow.
If you've so far got m = (n-1)! and you're about to multiply by n, you can guard against overflow by checking that
m <= MAX_INT / n
You can probably use Stirling's Approximation formula which says that
ln (n!) = n*ln(n) - n + ln(2*pi*n)/2 + O(1/n)
and will be quite accurate.
You don't actually need to go about trying to multiply etc. Of course, this does not directly answer what you asked, but given that you are just curious, hope this helps.

Resources