What is the complexity of the next grammar - algorithm

I'm developing a Generalized Parsing Algorithm and I'm testing it with next rule
S ::= a | SS
Well, the algorithm is showing me all trees generated for the string composed of n a's.
For example next table shows the time used by the algorithm due to the quantity of a's
length trees time(ms)
1 1 1
2 1 1
3 2 2
4 5 2
5 14 2
6 42 2
7 132 5
8 429 13
9 1430 28
10 4862 75
11 16796 225
12 58786 471
13 208012 1877
14 742900 10206
I dont know what O (Big O notation) is my algorithm. How can i measure it because of course the time depends of three things:
The length of the string to parse
The grammar complexity
The performance of the algorithm

S can match any string of all a's.
Any binary tree with n leaf nodes could be a parse tree, and the number of such trees is given by the Catalan numbers.

Big-O isn't a matter of measuring run-times; that's profiling. Big-O is a matter of algorithm analysis, which is impossible without seeing the algorithm.
Broadly speaking, break the algorithm down into basic operations, loops and recursive calls. The basic operations have a defined timing (generally, O(1)). The timing of loops is the number of iterations times the timing of the loop body. Recursion is trickier: you have to define the timing in terms of a recurrence relation, then solve for an explicit solution. Looking at the process call tree can offer hints as to what the explicit solution might be.

We don't know the complexity neither because you didn't post the algorithm. But obviously there is a chance that you have an implementation that blows up pretty bad. The problem is not necessarily in the algorithm, thought, but in the grammar itself. A suitable preprocessor for the grammar could rewrite it to the more natural form
S ::= a | a S
which would be much more efficient to handle.

Related

A question regarding the tower of hanoi recursive algorithm time complexity

I am doing a coding exercise today. After finishing the examination, I checked the results and I faced a problem whose problem statement is shown as follows:
Given 4 disks in the tower of Hanoi problem, the recursive algorithm calls the same function at most ___ times.
A. 10
B. 16
C. 22
D. 31
The only thing I knew is that I selected B. 16 and I was wrong.
I searched on the internet upon discovering that it should be 2n - 1 times, or 15 times.
However, it was not in the options.
Which option is correct?
Any advice will be appreciated.
Thank you.
The 4-disk puzzle takes 15 moves. The number of recursive calls, though, depends on how it's implemented.
If your recursive base case is 1 disk => 1 move, then it's 15. If your recursive base case is 0 disks => 0 moves, then it's 31.

Optimal Search Tree : Calculate the cost of the search tree and show that it is not optimal

Consider the following binary search tree, along with the following frequencies of lookups:
13 Key | 13 | 11 | 26 | 1 | 12 | 28
/ \ -----------------------------------------
11 26 Frequency | 26 | 5 | 25 | 1 | 3 | 15
/ \ \
1 12 28
I was given this question:
Calculate the cost of the search tree above for the given search
frequencies and show that the search tree isn't optimal for the given
search frequencies.
I calculated the cost, but my teacher say I did so incorrectly, but did not explain why.
So what we need to do for calculate cost is check where the first node is. 13 is in level 1 and the frequency of 13 is 26. So we do 26*1=26
The nodes 11 and 26 are in level 2, the nodes 1, 12 and 28 are in level 3.
In the end we have for cost: 26*1 + 5*2 + 25*2 + 1*3 + 3*3 + 15*3. My teacher says that this calculation is incorrect, but didn't explain why.
Also, how do you show that a tree isn't optimal? Here is definition I have from our script:
Let K be a set of keys and R a workload. A search tree T over K is optimal for R iff P(T) = min{P(T') | T' is search tree for K}
#templatetypedef Thank you very much for take your time and help!! Your answer is very nice for me I understand many things from it. Here is tree I found it is more optimal than this tree from task:
26
/ \
13 28
/
11
/ \
1 12
Tree above have cost of 143 and this one have 138. So this one is really more optimal and task is solved :)
Fundamentally, you're approaching the question of calculating the total lookup time in a BST correctly. You're taking each node in the tree, using the depth to determine the number of comparisons necessary to perform a lookup that ends at that node, multiplying those values by the number of lookups, and summing the results. I didn't meticulously double-check your exact calculations and so it's possible that you missed something, though.
Your second question was about determining whether a binary search tree is optimal for a given set of lookups. You've given the rigorous mathematical definition, but in this case I think it might be a bit easier to explain this at a higher level.
The calculation you did earlier here is a way of starting with a BST and information about what lookups will be performed, then computing a number corresponding to the number of comparisons that will end up being made in the course of performing those lookups. That number essentially tells you how fast those lookups are going to be - higher numbers mean that the lookups take longer, and lower numbers mean that the lookups will take less time.
Now, imagine that you want to make a BST that will take the least total amount of time to perform the lookups in question. In other words, you want the "best" BST for the given set of keys and lookup frequencies. That BST would be one with the lowest total lookup cost, where that cost is calculated using the approach you worked through earlier on. The terminology for a BST with that property - that it has the best lookup speed for those frequencies among all the possible BSTs that you can make - is an optimal BST.
The question here is to show that the tree you have isn't optimal. That means that you need to show that this isn't the best possible tree you can make. One way to do this would be to find an even better tree. So can you find another BST with the same keys where the total lookup time is lower than the one you were given?
Good luck!

Finding the constant c in the time complexity of certain algorithms

I need help finding and approximating the constant c in the complexity of insertion sort (cn^2) and merge sort (cnlgn) by inspecting the results of their running times.
A bit of background, my purpose was to "implement insertion sort and merge sort (decreasing order) algorithms and measure the performance of these two algorithms. For each algorithm, and for each n = 100, 200, 300, 400, 500, 1000, 2000, 4000, measure its running time when the input is
already sorted, i.e. n, n-1, …, 3, 2,1;
reversely sorted 1, 2, 3, … n;
random permutation of 1, 2, …, n.
The running time should exclude the time for initialization."
I have done the code for both algorithms and put the measurements (microseconds) in a spreadsheet. Now, I'm not sure how to find this c due to differing values for each condition of each algorithm.
For reference, the time table:
InsertionSort MergeSort
n AS RS Random AS RS Random
100 12 419 231 192 191 211
200 13 2559 1398 1303 1299 1263
300 20 236 94 113 113 123
400 25 436 293 536 641 556
500 32 504 246 91 81 105
1000 65 1991 995 169 246 214
2000 9 8186 4003 361 370 454
4000 17 31777 15797 774 751 952
I can provide the code if necessary.
It's hardly possible to determine values of these constants, especially for modern processors that uses caches, pipelines, and other "performance things".
Of course, you can try to find an approximation, and then you'll need Excel or any other spreadsheet.
Enter your data, create chart, and then add trendline. The spreadsheet calculates the values of constants for you.
First to understand is, that complexity and running times are not the same and maybe does not have very much to do with each other.
The complexity is a theoretical measurement to get an idea of how an algorithm slow down on bigger inputs compared to smaller inputs or compared to other algorithms.
The running time depends on the exact implementation, the computer it is running on, the other programms that run on the same computer and many other things. You will also notice, that the running time will slow down if the input is to big for your cache, and jump an other time if its also to big for your RAM. As you can see for n = 200 you got some weird running times. This will not help you finding the constants.
In cases where you don't have the code, you have no other choise to use the running times to approximat the complexity. Then you should use only big inputs (1000 should be the smallest input in your case). If your algorithm is deterministic, just input the worst case. Random cases can be good and bad, and so you never get anything about the real complexity. An other problem is, that the complexity measures "operations", so evaluating and if-statement or incrementing a variable is the same, but in running time an if needs more time than an incrementing something.
So what you can do is to plot your complexity and the values you measured and look for a factor that holds...
E.g. This is a plot of n² skaled by 1/500 and the points from your chart.
First some notes:
you have very small n
The algorithm complexity start corresponding to runtime only if n is big enough. For n=4000 is ~4KB of data which can still fit into most of CPU CACHE's so increasing to at least n=1000000 can and will change the relation between runtime and n considerably !
Runtime measurement
for random data you need the average runtime measurement not single one so for any n do at least 5 measurements each with different dataset and use average time from all
Now how to obtain c
If program has complexity O(n^2) it means that for big enough n the runtime is:
t(n)=c*n^2
so take few measurements. I choose last 3 from your insert sort, reverse sorted because that should match the worst case O(n^2) complexity if I am not mistaken so:
c*n^2 =t(n)
c*1000^2= 1.991
c*2000^2= 8.186
c*4000^2=31.777
solve the equations:
c=t(n)/(n^2)
c= 1.991/ 1000000=1.991 us
c= 8.186/ 4000000=2.0465 us
c=31.777/16000000=1.9860625 us
If everything is alright then the c for different n should be relatively the same. In your case it is around 2 us per element but as I mentioned above with increasing n this will change due to CACHE usage. Also if any dynamic container is used then you have to include complexity of its usage to the algorithm which can be sometimes significant !!!
Take the case of 4000 elements and divide the time by the respective complexity estimate, 4000² or 4000 Lg 4000.
This is not worse than any other method.
For safety, you should check anyway that the last values align on a relatively smooth curve, so that the value for 4000 is representative.
As others commented, this is rather poor methodology. You should also consider the standard deviation of the running times, or even better, the histogram of running times, and cover a larger range of sizes.
On another hand, getting accurate values is not so important as knowing the values of the constants is not helpful to compare the two algorithms.

Second best solution to an assignmentproblem using the Hungarian Algorithm

For finding the best solution in the assignment problem it's easy to use the Hungarian Algorithm.
For example:
A | 3 4 2
B | 8 9 1
C | 7 9 5
When using the Hungarian Algorithm on this you become:
A | 0 0 1
B | 5 5 0
C | 0 1 0
Which means A gets assigned to 'job' 2, B to job 3 and C to job 1.
However, I want to find the second best solution, meaning I want the best solution with a cost strictly greater that the cost of the optimal solution. According to me I just need to find the assignment with the minimal sum in the last matrix without it being the same as the optimal. I could do this by just searching in a tree (with pruning) but I'm worried about the complexity (being O(n!)). Is there any efficient method for this I don't know about?
I was thinking about a search in which I sort the rows first and then greedily choose the lowest cost first assuming most of the lowest costs will make up for the minimal sum + pruning. But assuming the Hungarian Algorithm can produce a matrix with a lot of zero's, the complexity is terrible again...
What you describe is a special case of the K best assignments problem -- there was in fact a solution to this problem proposed by Katta G. Murty in the following 1968 paper "An Algorithm for Ranking all the Assignments in Order of Increasing Cost." Operations Research 16(3):682-687.
Looks like there are actually a reasonable number of implementations of this, at least in Java and Matlab, available on the web (see e.g. here.)
In r there is now an implementation of Murty's algorithm in the muRty package.
CRAN
GitHub
It covers:
Optimization in both minimum and maximum direction;
output by rank (similar to dense rank in SQL), and
the use of either Hungarian algorithm (as implemented in clue) or linear programming (as implemented in lpSolve) for solving the initial assignment(s).
Disclaimer: I'm the author of the package.

How to generate a function that will algebraically encode a sequence?

Is there any way to generate a function F that, given a sequence, such as:
seq = [1 2 4 3 0 5 4 2 6]
Then F(seq) will return a function that generates that sequence? That is,
F(seq)(0) = 1
F(seq)(1) = 2
F(seq)(2) = 4
... and so on
Also, if it is, what is the function of lowest complexity that does so, and what is the complexity of the generated functions?
EDIT
It seems like I'm not clear, so I'll try to exemplify:
F(seq([1 3 5 7 9])}
# returns something like:
F(x) = 1 + 2*x
# limited to the domain x ∈ [1 2 3 4 5]
In other words, I want to compute a function that can be used to algebraically, using mathematical functions such as +, *, etc, restore a sequence of integers, even if you cleaned it from memory. I don't know if it is possible, but, as one could easily code an approximation for such function for trivial cases, I'm wondering how far it goes and if there is some actual research concerning that.
EDIT 2 Answering another question, I'm only interested in sequences of integers - if that is important.
Please let me know if it is still not clear!
Well, if you just want to know a function with "+ and *", that is to say, a polynomial, you can go and check Wikipedia for Lagrange Polynomial (https://en.wikipedia.org/wiki/Lagrange_polynomial).
It gives you the lowest degree polynomial that encodes your sequence.
Unfortenately, you probably won't be able to store less than before, as the probability of the polynom being of degree d=n-1 where n is the size of the array is very high with random integers.
Furthermore, you will have to store rational numbers instead of integers.
And finally, the access to any number of the array will be in O(d) (using Horner algorithm for polynomial evaluation), in comparison to O(1) with the array.
Nevertheless, if you know that your sequences may be very simple and very long, it might be an option.
If the sequence comes from a polynomial with a low degree, an easy way to find the unique polynomial that generates it is using Newton's series. Constructing the polynomial for a n numbers has O(n²) time complexity, and evaluating it has O(n).
In Newton's series the polynomial is expressed in terms of x, x(x-1), x(x-1)(x-2) etc instead of the more familiar x, x², x³. To get the coefficients, basically you compute the differences between subsequent items in the sequence, then the differences between the differences, until only one is left or you get a sequence of all zeros. The numbers you get along the bottom, divided by factorial of the degree of the term, give you the coefficients. For example with the first sequence you get these differences:
1 2 4 3 0 5 4 2 6
1 2 -1 -3 5 -1 -2 4
1 -3 -2 8 -6 -1 6
-4 1 10 -14 5 7
5 9 -24 19 2
4 -33 43 -17
-37 76 -60
113 -136
-249
The polynomial that generates this sequence is therefore:
f(x) = 1 + x(1 + (x-1)(1/2 + (x-2)(-4/6 + (x-3)(5/24 + (x-4)(4/120
+ (x-5)(-37/720 + (x-6)(113/5040 + (x-7)(-249/40320))))))))
It's the same polynomial you get using other techniques, like Lagrange interpolation; this is just the easiest way to generate it as you get the coefficients for a polynomial form that can be evaluated with Horner's method, unlike the Lagrange form for example.
There is no magic if you say that the sequence could be completely random. And yet, it is always possible, but won't save you memory. Any interpolation method requires the same amount of memory in the worst case. Because, if it didn't, it would be possible to compress everything to a single bit.
On the other hand, it is sometimes possible to use a brute force, some heuristics (like genetic algorithms), or numerical methods to reproduce some kind of mathematical expression having a specified type, but good luck with that :)
Just use some archiving tools instead in order to save memory usage.
I think it will be useful for you to read about this: http://en.wikipedia.org/wiki/Entropy_(information_theory)

Resources