I'm testing sorting algorithm and tried it with different amount of data.
100 thousand elements
1 million elements
up to 10 million of elements.
I need to calculate the complexity of this algorithm by having outputs for how long every sorting took.
How can I do that?
While you can't find the running time of an algorithm without doing mathematical analysis, the empirical measurements can give you a reasonable idea of how the running time of the algorithm---or rather the program---behaves.
For example, if you have n measurements (x1, y1), (x2, y2), ..., (xn, yn), where xi is the size of the input and yi is the time of the program on the input of that size, then you can plot the function to see whether it's a polynomial. In practice it often is. However, it's hard to see what the exponent should be from the plot.
To find the exponent you could find the slope of the line that best fits the points (log xi, log yi). This is because if y=C*x^k+lower order terms, then since the term C*x^k dominates we expect log y =~ k*log x + log C, i.e., the log-log equation is a linear one whenever the "original" equation is a polynomial one. (Whenever you see a linear function in the log-log plot, your running time is polynomial; the slope of the line tells you the degree of the polynomial.)
Here's a plot of the quadratic function y(x)=x^2:
And here's the corresponding log-log plot:
We can see that it's a line with slope 2 (in practice you would compute this using, for example, linear least squares). This is expected because log y(x) = 2 * log(x).
The code I used:
x = 1:1:100;
y = x.^2;
plot(x, y);
plot(log(x), log(y));
In practice the function looks messier and the slope can (or should) only be used as a rule of thumb when nothing else is available.
I imagine there are many other tricks to learn about program behavior from running time measurements. I'll give others a chance to share their experience.
Related
I would like to know if there is any method to sample form the Geometric distribution in constant time without using log which can be hard to approximate. Thanks.
Without relying on logarithms, there is no algorithm to sample from a geometric(p) distribution in constant expected time. Rather, on a realistic computing model, such an algorithm's expected running time must grow at least as fast as 1 + log(1/p)/w, where w is the word size of the computer in bits (Bringmann and Friedrich 2013). The following algorithm, which is equivalent to one in the Bringmann paper, generates a geometric(px/py) random number without relying on logarithms, and when px/py is very small, the algorithm is considerably faster than the trivial algorithm of generating trials until a success:
Set pn to px, k to 0, and d to 0.
While pn*2 <= py, add 1 to k and multiply pn by 2.
With probability (1− px /py)2k, add 1 to d and repeat this step.
Generate a uniform random integer in [0, 2k), call it m, then with probability (1− px /py)m, return d*2k+m. Otherwise, repeat this step.
(The actual algorithm described in the Bringmann paper is in fact much more involved than this; see my note "On a Geometric Sampler".)
REFERENCES:
Bringmann, K., and Friedrich, T., 2013, July. Exact and efficient generation of geometric random variates and random graphs, in International Colloquium on Automata, Languages, and Programming (pp. 267-278).
I made a program for finding a triple integral of an f(x,y,z) function over a general region, but it's not working for some reason.
Here's an excerpt of the program, for when the order of integration is dz dy dx:
(B-A)/N→D
0→V
Dsum(seq(fnInt(Y₅,Y,Y₉,Y₀),X,A+.5D,B,D))→V
For(K,1,P)
A+(B-A)rand→X
Y₉+(Y₀-Y₉)rand→Y
Y₇+(Y₈-Y₇)rand→Z
Y₆→ʟW(K)
End
Vmean(ʟW)→V
Variables used explained below:
Y₆: Equation of f(x,y,z)
Y₇,Y₈: Lower and upper bounds of the innermost integral (dz)
Y₉,Y₀: Lower and upper bounds of the middle integral (dy)
A,B: Lower and upper bounds of the outermost integral (dx)
Y₅: Y₈-Y₇
N: Number of Δx intervals
D: Size of Δx interval
P: Number of points on D to guess the average value of f(x,y,z)
ʟW: List of various values of f(x,y,z)
V: Volume of the region of integration, then of the entire triple integral
So here's how I'm approaching it:
I first find the volume of just the region of integration using Dsum(seq(fnInt(Y₅,Y,Y₉,Y₀),X,A+.5D,B,D)). Then I pick a bunch of random (x,y,z) points in that region, and I plug those points into f(x,y,z) to generate a long list of various values for w = f(x,y,z). I then take the average of those w-values, which should give me a pretty good estimate for the average "height" of the 4D solid that is the triple integral; and by multiplying the region of integration "base" with the average w-value "height" (Vmean(ʟW)), it should give me a good estimate for the hypervolume of the triple integral.
It should naturally follow that as the number of (x,y,z) points tested increases, the value of the triple integral should more or less converge to the actual value.
For some reason, it doesn't. For some integrals it works fantastically, for others it misses by a long shot. Good example of this is ∫[0, 2] ∫[0, 2-x] ∫[0, 2-x-y] 2x dz dy dx. The correct answer is 4/3 or 1.333..., but the program converges to a completely different number: 2.67, give or take.
Why is it doing this? Why is the triple integral converging to a wrong number?
EDIT: My guess is—assuming I didn't make any mistakes, for which there are no promises—that the RNG algorithm used by the calculator can only generate numbers slightly greater than 0 and is throwing the program off, but I have no way to confirm this, nor to account for it since "slightly greater than 0" isn't quantified.
I'm looking at listing/counting the number of integer points in R^N (in the sense of Euclidean space), within certain geometric shapes, such as circles and ellipses, subject to various conditions, for small N. By this I mean that N < 5, and the conditions are polynomial inequalities.
As a concrete example, take R^2. One of the queries I might like to run is "How many integer points are there in an ellipse (parameterised by x = 4 cos(theta), y = 3 sin(theta) ), such that y * x^2 - x * y = 4?"
I could implement this in Haskell like this:
ghci> let latticePoints = [(x,y) | x <- [-4..4], y <-[-3..3], 9*x^2 + 16*y^2 <= 144, y*x^2 - x*y == 4]
and then I would have:
ghci> latticePoints
[(-1,2),(2,2)]
Which indeed answers my question.
Of course, this is a very naive implementation, but it demonstrates what I'm trying to achieve. (I'm also only using Haskell here as I feel it most directly expresses the underlying mathematical ideas.)
Now, if I had something like "In R^5, how many integer points are there in a 4-sphere of radius 1,000,000, satisfying x^3 - y + z = 20?", I might try something like this:
ghci> :{
Prelude| let latticePoints2 = [(x,y,z,w,v) | x <-[-1000..1000], y <- [-1000..1000],
Prelude| z <- [-1000..1000], w <- [-1000..1000], v <-[1000..1000],
Prelude| x^2 + y^2 + z^2 + w^2 + v^2 <= 1000000, x^3 - y + z == 20]
Prelude| :}
so if I now type:
ghci> latticePoints2
Not much will happen...
I imagine the issue is because it's effectively looping through 2000^5 (32 quadrillion!) points, and it's clearly unreasonably of me to expect my computer to deal with that. I can't imagine doing a similar implementation in Python or C would help matters much either.
So if I want to tackle a large number of points in such a way, what would be my best bet in terms of general algorithms or data structures? I saw in another thread (Count number of points inside a circle fast), someone mention quadtrees as well as K-D trees, but I wouldn't know how to implement those, nor how to appropriately query one once it was implemented.
I'm aware some of these numbers are quite large, but the biggest circles, ellipses, etc I'd be dealing with are of radius 10^12 (one trillion), and I certainly wouldn't need to deal with R^N with N > 5. If the above is NOT possible, I'd be interested to know what sort of numbers WOULD be feasible?
There is no general way to solve this problem. The problem of finding integer solutions to algebraic equations (equations of this sort are called Diophantine equations) is known to be undecidable. Apparently, you can write equations of this sort such that solving the equations ends up being equivalent to deciding whether a given Turing machine will halt on a given input.
In the examples you've listed, you've always constrained the points to be on some well-behaved shape, like an ellipse or a sphere. While this particular class of problem is definitely decidable, I'm skeptical that you can efficiently solve these problems for more complex curves. I suspect that it would be possible to construct short formulas that describe curves that are mostly empty but have a huge bounding box.
If you happen to know more about the structure of the problems you're trying to solve - for example, if you're always dealing with spheres or ellipses - then you may be able to find fast algorithms for this problem. In general, though, I don't think you'll be able to do much better than brute force. I'm willing to admit that (and in fact, hopeful that) someone will prove me wrong about this, though.
The idea behind the kd-tree method is that you recursive subdivide the search box and try to rule out whole boxes at a time. Given the current box, use some method that either (a) declares that all points in the box match the predicate (b) declares that no points in the box match the predicate (c) makes no declaration (one possibility, which may be particularly convenient in Haskell: interval arithmetic). On (c), cut the box in half (say along the longest dimension) and recursively count in the halves. Obviously the method can choose (c) all the time, which devolves to brute force; the goal here is to do (a) or (b) as much as possible.
The performance of this method is very dependent on how it's instantiated. Try it -- it shouldn't be more than a couple dozen lines of code.
For nicely connected region, assuming your shape is significantly smaller than your containing search space, and given a seed point, you could do a growth/building algorithm:
Given a seed point:
Push seed point into test-queue
while test-queue has items:
Pop item from test-queue
If item tests to be within region (eg using a callback function):
Add item to inside-set
for each neighbour point (generated on the fly):
if neighbour not in outside-set and neighbour not in inside-set:
Add neighbour to test-queue
else:
Add item to outside-set
return inside-set
The trick is to find an initial seed point that is inside the function.
Make sure your set implementation gives O(1) duplicate checking. This method will eventually break down with large numbers of dimensions as the surface area exceeds the volume, but for 5 dimensions should be fine.
For NxM matrix with integer values, what is the most efficient way to find minimum element for region (x1,y1) (x2,y2) where 0 <= x1<=x2 < M and 0 <= y1 <= y2 < N
We can assume that we will query different regions numerous times.
I am wondering if we can extend range minumum query methods to this question.
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor
Pretty straightforward solution could be use the most efficient solution of RMQ (Segment Tree) then apply it row or column wise.
Worst case complexity would be min(N,M)*log(max(N,M))
But still I believe we can do better than that.
It depends on what do you mean by "the most efficient way". It is possible to minimize query time itself, or preprocessing time, or memory requirements.
If only query time should be minimized, "the most efficient way" is to pre-compute all possible regions. Then each query is handled by returning some pre-computed value. Query time is O(1). Both memory and preprocessing time are huge: O((NM)2).
More practical is to use Sparse Table algorithm from the page referred in OP. This algorithm prepares a table of all power-of-two length intervals and uses a pair of these intervals to handle any range-minimum-query. Query time is O(1). Memory and preprocessing time are O(N log N). And this algorithm can be easily extended to two-dimensional case.
Just prepare a table of all power-of-two length and power-of-two height rectangles and use four of these rectangles to handle any range-minimum-query. The result is just a minimum of four minimum values for each of these rectangles. Query time is O(1). Memory and preprocessing time are O(NM*log(N)*log(M)).
This paper: "Two-Dimensional Range Minimum Queries" by Amihood Amir, Johannes Fischer, and Moshe Lewenstein suggests how to decrease memory requirements and preprocessing time for this algorithm to almost O(MN).
This paper: "Data Structures for Range Minimum Queries in Multidimensional Arrays" by Hao Yuan and Mikhail J. Atallah gives different algorithm with O(1) query time and O(NM) memory and preprocessing time.
Given no other information about the contents of the matrix or about the way it's stored, it's impossible to make any suggestion besides just scanning every entry in the given region. That's O((x2-x1) * (y2-y1)). Your question is too vague to state anything else.
You could perhaps do better (probabilistically, in the average case) if you knew something else about the matrix, for example if you know that the elements are probably sorted in some way.
Pseudocode:
function getMax(M, x1, x2, y1, y2)
max = M[x1,y1]
for x = x1 to x2 do
for y = y1 to y2 do
if M[x,y] > max then max = M[x, y]
return max
This is O(n) in the input size, which can only reasonable be interpreted as the size of the matrix region (x2 - x1) * (y2 - y1). If you want the minimum, change max to min and > to <. You cannot do better than O(n), i.e., checking each of the possible elements. Assume you had an algorithm that was faster than O(n). Then it doesn't check all elements. To get a failing case for the algorithm, take one of the elements it doesn't check, and replace it with (max + 1), and re-run the algorithm.
Some time ago i was pretty interested in GAs and i studied about them quite a bit. I used C++ GAlib to write some programs and i was quite amazed by their ability to solve otherwise difficult to compute problems, in a matter of seconds. They seemed like a great bruteforcing technique that works really really smart and adapts.
I was reading a book by Michalewitz, if i remember the name correctly and it all seemed to be based on the Schema Theorem, proved by MIT.
I've also heard that it cannot really be used to approach problems like factoring RSA private keys.
Could anybody explain why this is the case ?
Genetic Algorithm are not smart at all, they are very greedy optimizer algorithms. They all work around the same idea. You have a group of points ('a population of individuals'), and you transform that group into another one with stochastic operator, with a bias in the direction of best improvement ('mutation + crossover + selection'). Repeat until it converges or you are tired of it, nothing smart there.
For a Genetic Algorithm to work, a new population of points should perform close to the previous population of points. Little perturbation should creates little change. If, after a small perturbation of a point, you obtain a point that represents a solution with completely different performance, then, the algorithm is nothing better than random search, a usually not good optimization algorithm. In the RSA case, if your points are directly the numbers, it's either YES or NO, just by flipping a bit... Thus using a Genetic Algorithm is no better than random search, if you represents the RSA problem without much thinking "let's code search points as the bits of the numbers"
I would say because factorisation of keys is not an optimisation problem, but an exact problem. This distinction is not very accurate, so here are details.
Genetic algorithms are great to solve problems where the are minimums (local/global), but there aren't any in the factorising problem. Genetic algorithm as DCA or Simulated annealing needs a measure of "how close I am to the solution" but you can't say this for our problem.
For an example of problem genetics are good, there is the hill climbing problem.
GAs are based on fitness evaluation of candidate solutions.
You basically have a fitness function that takes in a candidate solution as input and gives you back a scalar telling you how good that candidate is. You then go on and allow the best individuals of a given generation to mate with higher probability than the rest, so that the offspring will be (hopefully) more 'fit' overall, and so on.
There is no way to evaluate fitness (how good is a candidate solution compared to the rest) in the RSA factorization scenario, so that's why you can't use them.
GAs are not brute-forcing, they’re just a search algorithm. Each GA essentially looks like this:
candidates = seed_value;
while (!good_enough(best_of(candidates))) {
candidates = compute_next_generation(candidates);
}
Where good_enough and best_of are defined in terms of a fitness function. A fitness function says how well a given candidate solves the problem. That seems to be the core issue here: how would you write a fitness function for factorization? For example 20 = 2*10 or 4*5. The tuples (2,10) and (4,5) are clearly winners, but what about the others? How “fit” is (1,9) or (3,4)?
Indirectly, you can use a genetic algorithm to factor an integer N. Dixon's integer factorization method uses equations involving powers of the first k primes, modulo N. These products of powers of small primes are called "smooth". If we are using the first k=4 primes - {2,3,5,7} - 42=2x3x7 is smooth and 11 is not (for lack of a better term, 11 is "rough"). Dixon's method requires an invertible k x k matrix consisting of the exponents that define these smooth numbers. For more on Dixon's method see https://en.wikipedia.org/wiki/Dixon%27s_factorization_method.
Now, back to the original question: There is a genetic algorithm for finding equations for Dixon's method.
Let r be the inverse of a smooth number mod N - so r is a rough number
Let s be smooth
Generate random solutions of rx = sy mod N. These solutions [x,y] are the population for the genetic algorithm. Each x, y has a smooth component and a rough component. For example suppose x = 369 = 9 x 41. Then (assuming 41 is not small enough to count as smooth), the rough part of x is 41 and the smooth part is 9.
Choose pairs of solutions - "parents" - to combine into linear combinations with ever smaller rough parts.
The algorithm terminates when a pair [x,y] is found with rough parts [1,1], [1,-1],[-1,1] or [-1,-1]. This yields an equation for Dixon's method, because rx=sy mod N and r is the only rough number left: x and y are smooth, and s started off smooth. But even 1/r mod N is smooth, so it's all smooth!
Every time you combine two pairs - say [v,w] and [x,y] - the smooth parts of the four numbers are obliterated, except for the factors the smooth parts of v and x share, and the factors the smooth parts of w and y share. So we choose parents that share smooth parts to the greatest possible extent. To make this precise, write
g = gcd(smooth part of v, smooth part of x)
h = gcd(smooth part of w, smooth part of y)
[v,w], [x,y] = [g v/g, h w/h], [g x/g, h y/h].
The hard-won smooth factors g and h will be preserved into the next generation, but the smooth parts of v/g, w/h, x/g and y/h will be sacrificed in order to combine [v,w] and [x,y]. So we choose parents for which v/g, w/h, x/g and y/h have the smallest smooth parts. In this way we really do drive down the rough parts of our solutions to rx = sy mod N from one generation to the next.
On further thought the best way to make your way towards smooth coefficients x, y in the lattice ax = by mod N is with regression, not a genetic algorithm.
Two regressions are performed, one with response vector R0 consisting of x-values from randomly chosen solutions of ax = by mod N; and the other with response vector R1 consisting of y-values from the same solutions. Both regressions use the same explanatory matrix X. In X are columns consisting of the remainders of the x-values modulo smooth divisors, and other columns consisting of the remainders of the y-values modulo other smooth divisors.
The best choice of smooth divisors is the one that minimizes the errors from each regression:
E0 = R0 - X (inverse of (X-transpose)(X)) (X-transpose) (R0)
E1 = R1 - X (inverse of (X-transpose)(X)) (X-transpose) (R1)
What follows is row operations to annihilate X. Then apply a result z of these row operations to the x- and y-values from the original solutions from which X was formed.
z R0 = z R0 - 0
= z R0 - zX (inverse of (X-transpose)(X)) (X-transpose) (R0)
= z E0
Similarly, z R1 = z E1
Three properties are now combined in z R0 and z R1:
They are multiples of large smooth numbers, because z annihilates remainders modulo smooth numbers.
They are relatively small, since E0 and E1 are small.
Like any linear combination of solutions to ax = by mod N, z R0 and z R1 are themselves solutions to that equation.
A relatively small multiple of a large smooth number might just be the smooth number itself. Having a smooth solution of ax = by mod N yields an input to Dixon's method.
Two optimizations make this particularly fast:
There is no need to guess all the smooth numbers and columns of X at once. You can run regressions continuosly, adding one column to X at a time, choosing columns that reduce E0 and E1 the most. At no time will any two smooth numbers with a common factor be selected.
You can also start with a lot of random solutions of zx = by mod N, and remove the ones with the largest errors between selections of new columns for X.