Strange Python Speedup for Modular Inverse - algorithm

Usually I am lazy and calculate modular inverses as follows:
def inv(a,m):
return 1 if a==1 else ((a-inv(m%a,a))*m+1)//a
print(inv(100**7000,99**7001))
But I was curious to know whether the method of passing more information back, namely the solution to Bezout's theorem (instead of just one of the pair), yields a faster or slower algorithm in practice:
def bez(a,b):
# returns [x,y,d] where a*x+b*y = d = gcd(a,b) since (b%a)*y+a*(x+(b//a)*y)=d
if a==0: return [0,1,b] if b>0 else [0,-1,-b]
r=bez(b%a,a)
return [r[1]-(b//a)*r[0],r[0],r[2]]
def inv(a,m):
r=bez(a,m)
if r[2]!=1: return None
return r[0] if r[0]>=0 else r[0]+abs(m)
print(inv(100**7000,99**7001))
I was amazed to find that the latter ran more than 50 times faster than the former! But both use nearly the same number of recursive calls and 1 integer division and 1 modulo operation per call, and the bit-length of the operands is roughly twice in the former on average (because the arguments involved are identical), so I only expect its time complexity to be roughly 4 times that of the latter, not 50 times.
So why am I observing this strange speedup?
Note that I am using Python 3, and I observed this when running online, with the following code added at the top to stop the Python interpreter complaining about exceeding maximum recursion depth or stack space:
import resource,sys
sys.setrecursionlimit(1000000)
resource.setrlimit(resource.RLIMIT_STACK,[0x10000000,resource.RLIM_INFINITY])

I figured it out finally. Suppose the initial inputs are n-bit integers. In typical cases, besides the first call to inv or inv2, the recursive calls have parameters whose sizes differ by just O(1) on average, and there are O(n) recursive calls on average, both due to some number-theoretic phenomenon. This O(1) size difference implies that the two methods actually have drastically different average time complexities. The implicit assumption in my question was that multiplication or division of two n-bit integers takes roughly O(n^2) time; this holds in the worst-case (for school-book multiplication), but not in the average case for the fast method!
For the fast method, it can be proven that when p,q > 0, the triple [x,y,d] returned by bez(p,q) satisfies abs(x) ≤ q and abs(y) ≤ p. Thus when each call bez(a,b) performs b%a and (b//a)*r[0], the modulo and division and multiplication each takes O(size(a)) time on average, since size(b)-size(a) ∈ O(1) on average and abs(r[0]) ≤ a. Total time is therefore roughly O(n^2).
For the slow method, however, when each call inv(a,m) performs ((a-inv(m%a,a))*m+1)//a, we have a-inv(m%a,a) roughly the same size as a, and so this division is of two integers where the first is roughly twice the size of the second, which would take O(size(a)^2) time! Total time is therefore roughly O(n^3).
Indeed, simulation yields very close agreement with the above analysis. For reference, using the fast method for inv(100**n,99**(n+1)) where n = 1400·k for k∈[1..6] took time/ms 17,57,107,182,281,366, while using the slow method where n = 500·k for k∈[1..6] took time/ms 22,141,426,981,1827,3101. These are very well fit by 9·k^2+9·k and 13·k^3+8·k^2 respectively, with mean fractional error ≈3.3% and ≈1.6% respectively. (We include the first two highest-order terms because they contribute a small but significant amount.)

Related

Big O Notation O(n^2) what does it mean?

For example, it says that in 1 sec 3000 number are sorted with selection sort. How can we predict how many numbers are gonna be sorted in 10 sec ?
I checked that selection sort needs O(n^2) but I dont understand how I am gonna calculate how many numbers are gonna be sorted in 10 sec.
We cannot use big O to reliably extrapolate actual running times or input sizes (whichever is the unknown).
Imagine the same code running on two machines A and B, different parsers, compilers, hardware, operating system, array implementation, ...etc.
Let's say they both can parse and run the following code:
procedure sort(reference A)
declare i, j, x
i ← 1
n ← length(A)
while i < n
x ← A[i]
j ← i - 1
while j >= 0 and A[j] > x
A[j+1] ← A[j]
j ← j - 1
end while
A[j+1] ← x[3]
i ← i + 1
end while
end procedure
Now system A spends 0.40 seconds on the initialisation part before the loop starts, independent on what A is, because on that configuration the initiation of the function's execution context including the allocation of the variables is a very, very expensive operation. It also needs to spend 0.40 seconds on the de-allocation of the declared variables and the call stack frame when it arrives at the end of the procedure, again because on that configuration the memory management is very expensive. Furthermore, the length function is costly as well, and takes 0.19 seconds. That's a total overhead of 0.99 seconds
On system B this memory allocation and de-allocation is cheap and takes 1 microsecond. Also the length function is fast and needs 1 microsecond. That's a total overhead that is 2 microseconds.
System A is however much faster on the rest of the statements in comparison with system B.
Both implementations happen to need 1 second to sort an array A having 3000 values.
If we now take the reasoning that we could predict the array size that can be sorted in 10 seconds based on the results for 1 second, we would say:
𝑛 = 3000, and the duration is 1 second which corresponds to 𝑛² = 9 000 000 operations. So if 9 000 000 operations corresponds to 1 second, then 90 000 000 operations correspond to 10 seconds, and 𝑛 = √(𝑛²) ~= 9 487 (the size of the array that can be sorted in 10 seconds).
However, if we follow the same reasoning, we can look at the time needed for completing the outer loop only (without the initialisation overhead), which also is O(𝑛²) and thus the same reasoning can be followed:
𝑛 = 3000, and the duration in system A is 0.01 second which corresponds to 𝑛² = 9 000 000 operations. So if 9 000 000 operations can be executed in 0.01 second then in 10 - 0.99 seconds (overhead is subtracted) we can execute 9.01 / 0.01 operations, i.e 𝑛² = 8 109 000 000 operations, and now 𝑛 = √(𝑛²) ~= 90 040.
The problem is that using the same reasoning on big O, the predicted outcomes differ by a factor of about 10!
We may be tempted to think that this is now only a "problem" of constant overhead, but similar things can be said about operations in the outer loop. For instance it might be that x ← A[i] has a relatively high cost for some reason on some system. These are factors that are not revealed in the big O notation, which only retains the most significant factor, omitting linear and constant factors that play a role.
The actual running time for an actual input size is dependent on a more complex function that is likely close to polynomial, like 𝑛² + 𝑎𝑛 + 𝑏. These coefficients 𝑎, and 𝑏 would be needed to make a more reasonable prediction possible. There might even be function components that are non-polynomial, like 𝑛² + 𝑎𝑛 + 𝑏 + 𝑐√𝑛... This may seem unlikely, but systems on which the code runs may do all kinds of optimisations while code runs which may have such or similar effect on actual running time.
The conclusion is that this type of reasoning gives no guarantee that the prediction is anywhere near the reality -- without more information about the actual code, system on which it runs,... etc, it is nothing more than a guess. Big O is a measure for asymptotic behaviour.
As the comments say, big-oh notation has nothing to do with specific time measurements; however, the question still makes sense, because the big-oh notation is perfectly usable as a relative factor in time calculations.
Big-oh notation gives us an indication of how the number of elementary operations performed by an algorithm varies as the number of items to process varies.
Simple algorithms perform a fixed number of operations per item, but in more complicated algorithms the number of operations that need to be performed per item varies as the number of items varies. Sorting algorithms are a typical example of such complicated algorithms.
The great thing about big-oh notation is that it belongs to the realm of science, rather than technology, because it is completely independent of your hardware, and of the speed at which your hardware is capable of performing a single operation.
However, the question tells us exactly how much time it took for some hypothetical hardware to process a certain number of items, so we have an idea of how much time that hardware takes to perform a single operation, so we can reason based on this.
If 3000 numbers are sorted in 1 second, and the algorithm operates with O( N ^ 2 ), this means that the algorithm performed 3000 ^ 2 = 9,000,000 operations within that second.
If given 10 seconds to work, the algorithm will perform ten times that many operations within that time, which is 90,000,000 operations.
Since the algorithm works in O( N ^ 2 ) time, this means that after 90,000,000 operations it will have sorted Sqrt( 90,000,000 ) = 9,486 numbers.
To verify: 9,000,000 operations within a second means 1.11e-7 seconds per operation. Since the algorithm works at O( N ^ 2 ), this means that to process 9,486 numbers it will require 9,486 ^ 2 operations, which is roughly equal to 90,000,000 operations. At 1.11e-7 seconds per operation, 90,000,000 operations will be done in roughly 10 seconds, so we are arriving at the same result via a different avenue.
If you are seriously pursuing computer science or programming I would recommend reading up on big-oh notation, because it is a) very important and b) a very big subject which cannot be covered in stackoverflow questions and answers.

Can the efficiency of an algorithm be modelled as a function between input size and time?

Consider the following algorithm (just as an example as the implementation is obviously inefficient):
def add(n):
for i in range(n):
n += 1
return n
The program adds one number with itself and returns it. Now the efficiency of an algorithm is sometimes modelled as a function between the size of the input and the number of primitive steps the algorithm has to compute. In this case, the input is an integer, n, and as n gets increased the number of steps necessary to complete the algorithm also increase (in this case linearly). But is it true that the size of the input increases? Let's assume that the machine where the program is running is representing integers in 8 bits. So if I increase the hypthetical input 3 for example to 7, the number of bits involved remains the same: 00000011 -> 00000111. However, the steps necessary to compute the algorithm increase. So it seems like that it's not always true that algorithmic efficiency can be modelled as a relation between input size and steps to compute. Could somebody explain to me where I go wrong or if I don't go wrong, why it still makes sense to model the efficiency of an algorithm as a function between the size of the input and the number of primitive steps to be computed?
Let S be the size of the input n. (Normally we'd use n for this size, but since the argument is also called n, that's confusing). For positive n, there's a relation between S and n, namely S = ceil(ln(n)). The program loops n times, and since n < 2^S, it loops at most 2^S times. You can also show it loops at least 1/2 * 2^S times, so the runtime (measured in loop iterations) is Theta(2^S).
This shows there's a way to model the runtime as a function of the size, even if it's not exact.
Whether it makes sense. In your example it doesn't much, but if your input is an array for sorting, taking size as the number of elements in the array does makes sense. (And it's typically what's used for example to model the number of comparisons done by different sort algorithms).

Does O(1) mean an algorithm takes one step to execute a required task?

I thought it meant it takes a constant amount of time to run. Is that different than one step?
O(1) is a class of functions. Namely, it includes functions bounded by a constant.
We say that an algorithm has the complexity of O(1) iff the amount of steps it takes, as a function of the size of the input, is bounded by a(n arbirtary) constant. This function can be a constant, or it can grow, or behave chaotically, or undulate as a sine wave. As long as it never exceeds some predefined constant, it's O(1).
For more information, see Big O notation.
It means that even if you increase the size of whatever the algorithm is operating on, the number of calculations required to run remains the same.
More specifically it means that the number of calculations doesn't get larger than some constant no matter how big the input gets.
In contrast, O(N) means that if the size of the input is N, the number of steps required is at most a constant times N, no matter how big N gets.
So for example (in python code since that's probably easy for most to interpret):
def f(L, index): #L a list, index an integer
x = L[index]
y=2*L[index]
return x + y
then even though f has several calculations within it, the time taken to run is the same regardless of how long the list L is. However,
def g(L): #L a list
return sum(L)
This will be O(N) where N is the length of list L. Even though there is only a single calculation written, the system has to add all N entries together. So it has to do at least one step for each entry. So as N increases, the number of steps increases proportional to N.
As everyone has already tried to answer it, it simply means..
No matter how many mangoes you've got in a box, it'll always take you the same amount of time to eat 1 mango. How you plan on eating it is irrelevant, there maybe a single step or you might go through multiple steps and slice it nicely to consume it.

Efficient way to multiply a large set of small numbers

This question was asked in an interview.
You have an array of small integers. You have to multiply all of them. You need not worry about overflow you have ample support for that. What can you do to speed up the multiplication on your machine?
Would multiple additions be better in this case?
I suggested multiplying using a divide and conquer approach but the interviewer was not impressed. What could be the best possible solution for this?
Here are some thoughts:
Divide-and-Conquer with Multithreading: Split the input apart into n different blocks of size b and recursively multiply all the numbers in each block together. Then, recursively multiply all n / b blocks back together. If you have multiple cores and can run parts of this in parallel, you could save a lot of time overall.
Word-Level Parallelism: Let's suppose that your numbers are all bounded from above by some number U, which happens to be a power of two. Now, suppose that you want to multiply together a, b, c, and d. Start off by computing (4U2a + b) × (4U2c + d) = 16U4ac + 4U2ad + 4U2bc + bd. Now, notice that this expression mod U2 is just bd. (Since bd < U2, we don't need to worry about the mod U2 step messing it up). This means that if we compute this product and take it mod U2, we get back bd. Since U2 is a power of two, this can be done with a bitmask.
Next, notice that
4U2ad + 4U2bc + bd < 4U4 + 4U4 + U2 < 9U4 < 16U4
This means that if we divide the entire expression by 16U4 and round down, we will end up getting back just ad. This division can be done with a bitshift, since 16U4 is a power of two.
Consequently, with one multiplication, you can get back the values of both ac and bd by applying a subsequent bitshift and bitmask. Once you have ac and bd, you can directly multiply them together to get back the value of abcd. Assuming that bitmasks and bitshifts are faster than multiplies, this reduces the number of multiplications necessary by 33% (two instead of three here).
Hope this helps!
Your divide and conquer suggestion was a good start. It just needed more explanation to impress.
With fast multiplication algorithms used to multiply large numbers (big-ints), it is much more efficient to multiply similar sized multiplicands than a series of mismatched sizes.
Here's an example in Clojure
; Define a vector of 100K random integers between 2 and 100 inclusive
(def xs (vec (repeatedly 100000 #(+ 2 (rand-int 99)))))
; Naive multiplication accumulating linearly through the array
(time (def a1 (apply *' xs)))
"Elapsed time: 7116.960557 msecs"
; Simple Divide and conquer algorithm
(defn d-c [v]
(let [m (quot (count v) 2)]
(if (< m 3)
(reduce *' v)
(*' (d-c (subvec v 0 m)) (d-c (subvec v m))))))
; Takes less than 1/10th the time.
(time (def a2 (d-c xs)))
"Elapsed time: 600.934268 msecs"
(= a1 a2) ;=> true (same result)
Note that this improvement does not rely on a set limit for the size of the integers in the array (100 chosen arbitrarily and to demonstrate the next algorithm), but only that they be similar in size. This is a very simple divide an conquer. As the numbers get larger and more expensive to multiply, it would make sense to invest more time in iteratively grouping them by similar size. Here I am relying on random distribution and chance that the sizes will stay similar, but it is still going to be significantly better than the naive approach even for the worst case.
As suggested by Evgeny Kluev in the comments, for a large number of small integers, there is going to be a lot of duplication, so efficient exponentiation is also better than naive multiplication. This depends a lot more on the relative parameters than the divide and conquer, that is the numbers must be sufficiently small relative to the count for enough duplicates to accumulate to bother, but certainly performs well with these parameters (100K numbers in the range 2-100).
; Hopefully an efficient implementation
(defn pow [x n] (.pow (biginteger x) ^Integer n))
; Perform pow on duplications based on frequencies
(defn exp-reduce [v] (reduce *' (map (partial apply pow) (frequencies v))))
(time (def a3 (exp-reduce xs)))
"Elapsed time: 650.211789 msecs"
Note the very simple divide and conquer performed just a wee better in this trial, but would be even relatively better if fewer duplicates were expected.
Of course we can also combine the two:
(defn exp-d-c [v] (d-c (mapv (partial apply pow) (frequencies v))))
(time (def a4 (exp-d-c xs)))
"Elapsed time: 494.394641 msecs"
(= a1 a2 a3 a4) ;=> true (all equal)
Note there are better ways to combine these two since the result of the exponentiation step is going to result in various sizes of multiplicands. The value of added complexity to do so depends on the expected number of distinct numbers in the input. In this case, there are very few distinct numbers so it wouldn't pay to add much complexity.
Note also that both of these are easily parallelized if multiple cores are available.
If many of the small integers occur multiple times, you could start by counting every unique integer. If c(n) is the number of occurrences of integer n, the product can be computed as
P = 2 ^ c(2) * 3 ^ c(3) * 4 ^ c(4) * ...
For the exponentiation steps, you can use exponentiation by squaring which can reduce the number of multiplications considerably.
If the count of numbers really is large compared to the range, then we have seen two asymptotic solutions presented to reduce the complexity considerably. One was based on successive squaring to compute c^k in O(log k) time for each number c, giving O(C mean(log k)) time if the largest number is C and k gives the exponent for each number between 1 and C. The mean(log k) term is maximized if every number appears an equal number of times, so if you have N numbers then the complexity becomes O(C log(N/C)), which is very weakly dependent on N and essentially just O(C) where C specifies the range of numbers.
The other approach we saw was sorting numbers by the number of times they appear, and keeping track of the product of leading numbers (starting with all numbers) and raising this to a power so that the least frequent number is removed from the array, and then updating the exponents on the remaining element in the array and repeating. If all numbers occur the same number of times K, then this gives O(C + log K) which is an improvement over O(C log K). However, say the kth number appears 2^k times. Then this will still give O(C^2 + C log(N/C)) time which is technically worse than the previous method O(C log(N/C)) if C > log(N/C). Thus, if you don't have good information on how evenly distributed the occurrences of each number are, you should go with the first approach, just take the appropriate power of each distinct number that appears in the product by using successive squaring, and take the product of the results. Total time O(C log (N/C)) if there are C distinct numbers and N total numbers.
To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:
Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.
Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values) and slightly improve balance of squarings/multiplications.
Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.
More information on this could be found in the answer by A. Webb.
If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.
Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C is number of different values in the array. (See also this answer).
If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).
But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.
To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.
Here are all the algorithms:
accumulate from standard library.
Trivial implementation with 4 accumulators.
Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
SIMD (AVX2). This implementation can use up to 8-bit values.
Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2.
Here are some more details for binary exponentiation test:
for (size_t i = 0; i < size / 8; i += 2)
{
auto compr = (qwords[i] << 4) | qwords[i + 1];
constexpr uint64_t lsb = 0x1111111111111111;
if ((compr & lsb) != lsb) // if there is at least one even value
{
auto b = reinterpret_cast<uint8_t*>(qwords + i);
acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
if (!acc1)
break;
}
else
{
const auto b2 = compr & 0x2222222222222222;
const auto b4 = compr & 0x4444444444444444;
const auto b24 = b4 & (b2 * 2);
const unsigned c7 = __builtin_popcountll(b24);
acc3 += __builtin_popcountll(b2) - c7;
acc5 += __builtin_popcountll(b4) - c7;
acc7 += c7;
}
}
const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);
This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.
Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.
Here are the results:
source size: 4 Mb: 400 Mb:
1. accumulate: 0.835392 ns 0.849199 ns
2. accum * 4: 0.290373 ns 0.286915 ns
3. 2 mul in 1: 0.178556 ns 0.182606 ns
4. mult table: 0.130707 ns 0.176102 ns
5. binary exp: 0.128484 ns 0.119241 ns
6. AVX2: 0.0607049 ns 0.0683234 ns
Here we can see that accumulate library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.
It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).
Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).
Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).
Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).
As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.
I have one solution. Let us discuss it with other solutions.
The key part of question is how to reduce times of multiply. And integers are small but set is big.
My solution:
use an small array to record how many times each number appears.
Remove number 1 from array. You don't need to count it.
Find the number which appears least times n. Then multiply all numbers and get result K. Then count K^n.
Remove this number (For instance, you can switch it with the last number of array and reduce size of array for 1). So next time you won't consider this number any more. At same time, the appearance times of other numbers need to be reduced with the times of removed number.
Once again get the number which appears least times. Do same thing as step 2.
Repeatedly do step 2-4 and complete counting.
Let me use an example to show how many multiply we need to do: Assume
we have 5 numbers [1, 2, 3, 4, 5]. Number 1 appears 100 times, number
2 appears 150 times, number 3 appears 200 times, number 4 appears 300
times, and number 5 appears 400 times.
method 1: multiply it directly or use divide and conquer
we need 100+150+200+300+400-1 = 1149 multiply to get result.
method 2: we do (1^100)(2^150)(3^200)(4^300)(5^400)
(100-1)+(150-1)+(200-1)+(300-1)+(400-1)+4 = 1149.[same as method 1]
Cause n^m will do m-1 multiply in deed. Plus you need time to go through all numbers, though this time is short.
method in this post:
First, you need time to go through all numbers. It can be discarded compare to time of multiply.
The real counting you are making is:
((2*3*4*5)^150)*((3*4*5)^50)*((4*5)^100)*(5^100)
Then you need to do multiply 3+149+2+49+1+99+99+3 = 405 times

Is there a Sorting Algorithm that sorts in O(∞) permutations?

After reading this question and through the various Phone Book sorting scenarios put forth in the answer, I found the concept of the BOGO sort to be quite interesting. Certainly there is no use for this type of sorting algorithm but it did raise an interesting question in my mind-- could their be a sorting algorithm that is infinitely impossible to complete?
In other words, is there a process where one could attempt to compare and re-order a fixed set of data and can yet never achieve an actual sorted list?
This is much more of a theoretical/philosophical question than a practical one and if I was more of a mathematician I'd probably be able to prove/disprove such a possibility. Has anyone asked this question before and if so, what can be said about it?
[edit:] no deterministic process with a finite amount of state takes "O(infinity)" since the slowest it can be is to progress through all possible states. this includes sorting.
[earlier, more specific answer:]
no. for a list of size n you only have state space of size n! in which to store progress (assuming that the entire state of the sort is stored in the ordering of the elements and it really is "doing something," deterministically).
so the worst possible behaviour would cycle through all available states before terminating and take time proportional to n! (at the risk of confusing matters, there must be a single path through the state - since that is "all the state" you cannot have a process move from state X to Y, and then later from state X to Z, since that requires additional state, or is non-deterministic)
Idea 1:
function sort( int[] arr ) {
int[] sorted = quicksort( arr ); // compare and reorder data
while(true); // where'd this come from???
return sorted; // return answer
}
Idea 2
How do you define O(infinity)? The formal definition of Big-O merely states that f(x)=O(g(x)) implies that M*g(x) is an upper bound of f(x) given sufficiently large x and some constant M.
Typically when you talking about "infinity", you are talking about some sort of unbounded limit. So in this case, the only reasonable definition is saying that O(infinity) is O(function that's larger than every function). Obviously a function that's larger than every function is an upper bound. Thus technically everything is "O(infinity)"
Idea 3
Assuming you mean theta notation (tight bound)...
If you impose the additional restriction that the algorithm is smart (returns when it finds a sorted permutation) and every permutation of the list must be visited in a finite amount of time, then the answer no. There are only N! permutations of a list. The upper bound for such a sorting algorithm is then a finite over finite numbers, which is finite.
Your question doesn't really have much to do with sorting. An algorithm which is guaranteed never to complete would be pretty dull. Indeed, even an algorithm which would might or might not ever complete would be pretty dull. Much more interesting would be an algorithm which would be guaranteed to complete, eventually, but whose worst-case computation time with respect to the size of the input would not be expressible as O(F(N)) for any function F that could itself be computed in bounded time. My hunch would be that such an algorithm could be devised, but I'm not sure how.
How about this one:
Start at the first item.
Flip a coin.
If it's heads, switch it with the next item.
If it's tails, don't switch them.
If list is sorted, stop.
If not, move onto the next pair ...
It's a sorting algorithm -- the kind a monkey might do. Is there any guarantee that you'll arrive at a sorted list? I don't think so!
Yes -
SortNumbers(collectionOfNumbers)
{
If IsSorted(collectionOfNumbers){
reverse(collectionOfNumbers(1:end/2))
}
return SortNumbers(collectionOfNumbers)
}
Input: A[1..n] : n unique integers in arbitrary order
Output: A'[1..n] : reordering of the elements of A
such that A'[i] R(A') A'[j] if i < j.
Comparator: a R(A') b iff A'[i] = a, A'[j] = b and i > j
More generally, make the comparator something that's either (a) impossible to reconcile with the output specification, so that no solution can exist, or (b) uncomputable (e.g., sort these (input, turing machine) pairs in order of the number of steps needed for the machine to halt on the input).
Even more generally, if you have a procedure that fails to halt on a valid input, the procedure is not an algorithm which solves the problem on that input/output domain... which means you don't have an algorithm at all, or that what you have is only an algorithm if you appropriately restrict the domain.
Let's suppose that you have a random coin flipper, infinite arithmetic, and infinite rationals. Then the answer is yes. You can write a sorting algorithm which has 100% chance of successfully sorting your data (so it really is a sorting function), but which on average will take infinite time to do so.
Here is an emulation of this in Python.
# We'll pretend that these are true random numbers.
import random
import fractions
def flip ():
return 0.5 < random.random()
# This tests whether a number is less than an infinite precision number in the range
# [0, 1]. It has a 100% probability of returning an answer.
def number_less_than_rand (x):
high = fractions.Fraction(1, 1)
low = fractions.Fraction(0, 1)
while low < x and x < high:
if flip():
low = (low + high) / 2
else:
high = (low + high) / 2
return high < x
def slow_sort (some_array):
n = fractions.Fraction(100, 1)
# This loop has a 100% chance of finishing, but its average time to complete
# is also infinite. If you haven't studied infinite series and products, you'll
# just have to take this on faith. Otherwise proving that is a fun exercise.
while not number_less_than_rand(1/n):
n += 1
print n
some_array.sort()

Resources