How do I calculate computational complexity of automatic differentiation? - complexity-theory

I'm using autogrid implemented in Pytorch to train a Neural Network and I need to calculate the computational complexity of the whole algorithm. Where do I find a complete calculation of computational complexity of the autograd? I've searched in the Pytorch documentation but I'm not finding any answer.
Thank you

try this https://github.com/Lyken17/pytorch-OpCounter. It calculates macs of nn(similar to complexity).
You could see some guide on the hub, use profile to measure the FLOPs of models in pytorch.

It depends on how you calculate complexity in this case. Each forward operation is paired with it's respective backward operation (returning derivative with respect to network's last node, usually cost function).
If you treat each operation as one unit, than forward and backward would be two units, essentially not changing anything in complexity, otherwise it depends on type of neural network and operations inside it.

Related

Quantum computing Grover's Algorithm

Question:-
How much does exploiting quantum computing actually speed up computing? (We know that it hassome effect, because of Grover’s algorithm, but how much? Does BQP=P?)
What I know
I understand Grover's Algorithm but solving this question seems to be a tough.
Source of Grover's Algorithm:-
https://en.m.wikipedia.org/wiki/Grover%27s_algorithm
Is there any way to solve this?
Well, using a classical naive search algorithm where you look at one entry after the other in a register, it would take on average N/2 calls before you find the result you are looking for. Grover's algorithm would, assuming you have a register of all entries in a superposition state already prepared, only take square root of N calls on average. For large registers, this is a huge gain.
What the story doesn't tell is that the preparation of the register is costly. Everytime you call Grover's algorithm, you "consume" the entire register. Therefore, Grover's algorithm's real cost would be square root of N * (cost of preparation of the register). Sadly, the preparation of the quantum register (superposition of state of all entries in the register) scales with N. Therefore, Grover's algorithm might not provide an actual gain to the classical search algorithm!
It remains to be seen if there are efficient ways to prepare the quantum register. If one could find a O(sqrt(N)) way to prepare it, it would, at the very least, be as efficient as the classical search algorithm.
The observations by #Exeko on the computational cost of Grover's algorithm based Search Operations is quite valid and important concern when it is implemented out of the box. However, the cost of preparation and cost of information retrieval from the quantum register can be minimised by introducing a quantum bloom filter with verifiable random functions. Quantum Bloom Filter will help us to eliminate false positives in the register. Hence we don't need to consume the entire register every time. We have implemented Grover's algorithm in IBM Q last year with an additional Quantum Bloom Filter with a full adder circuit. This could help us to achieve quadratic speed-up in the end to end search performance.

Expectation vs. direct numerical optimization of likelihood function for estimating high-dimensional Markov-Switching /HMM model

I am currently estimating a Markov-switching model with many parameters using direct optimization of the log likelihood function (through the forward-backward algorithm). I do the numerical optimization using matlab's genetic algorithm, since other approaches such as the (mostly gradient or simplex-based) algorithms in fmincon and fminsearchbnd were not very useful, given that likelihood function is not only of very high dimension but also shows many local maxima and is highly nonlinear.
The genetic algorithm seems to work very well. However, I am planning to further increase the dimension of the problem. I have read about an EM algorithm to estimate Markov-switching models. From what I understand this algorithm releases a sequence of increasing log-likelhood values. It thus seems suitable to estimate models with very many parameters.
My question is if the EM algorithm is suitable for my application involving many parameters (perhaps better suitable as the genetic algorithm). Speed is not the main limitation (the genetic algorithm is altready extremely slow) but I would need to have some certainty to end up close to the global optimum and not run into one of the many local optima. Do you have any experience or suggestions regarding this?
The EM algorithm finds local optima, and does not guarantee that they are global optima. In fact, if you start it off with a HMM where one of the transition probabilities is zero, that probability will typically never change from zero, because those transitions will appear only with expectation zero in the expectation step, so those starting points have no hope of finding a global optimum which does not have that transition probability zero.
The standard workaround for this is to start it off from a variety of different random parameter settings, pick the highest local optima found, and hope for the best. You might be slightly reassured if a significant proportion of the runs converged to the same (or to equivalent) best local optimum found, on the not very reliable theory that anything better would be found from at least the same fraction of random starts, and so would have showed up by now.
I haven't worked it out in detail, but the EM algorithm solves such a general set of problems that I expect that if it guaranteed to find the global optimum then it would be capable of finding the solution to NP-complete problems with unprecedented efficiency.

maximum likelihood and support vector complexity

Can anyone give some references showing how to determine the maximum likelihood and support vector machine classifiers' computation complexity?
I have been searching the web but don't seem to find a good docs that details how to find the equations that model the computation complexity of those classifier algorithms.
Thanks
Support vector machines, and a number of maximum likelihood fits are convex minimization problems. Therefore they could in theory be solved in polynomial time using http://en.wikipedia.org/wiki/Ellipsoid_method.
I suspect that you can get much better estimates if you consider methods. http://www.cse.ust.hk/~jamesk/papers/jmlr05.pdf says that standard SVM fitting on m instances costs O(m^3) time and O(m^2) space. http://research.microsoft.com/en-us/um/people/minka/papers/logreg/minka-logreg.pdf gives costs per iteration for logistic regression but does not give a theoretical basis for estimating the number of iterations. In practice I would hope that this goes to quadratic convergence most of the time and is not too bad.

How to determine the complexity of an algorithm function?

How do you know if a algorithm function takes linear/constant/logarithmic time for a specific operation? does it depend on the cpu cycles?
There are three ways you can do it (at least).
Look up the algorithm on the net and see what it says about its time complexity.
Examine the algorithm yourself to look at things like nested loops and recursion conditions, and how often each loop is run or each recursion is done, based on the input size. An extension of this is a rigorous mathematical analysis.
Experiment. Vary the input variable and see how long it takes depending on that. Calculate an equation that gives you said runtime based on the variable (simultaneous equation solving is one possibility here for O(nc)-type functions.
Of these, probably the first is the easiest for the layman since it will almost certainly have been produced by someone more knowledgable doing the second :-)
At first the function may take any time to execute the algorithm. It can be quite non-linear also and even infinite.
Shortly if you have an algorithm then it is used the abstraction called Turing machine. It is used to measure a number of operations required to perform the algorithm before it halts.
More precise info you may get here WIKI::Computational complexity theory
About dependency on CPU:
The answer is NO - time complexity is totally cpu independent. This is because complexity shows - How algorithm's demands on cpu resources increases with increasing algorithm input data size. In other words it is a function. And functions are the same everywhere - be it on different machines or on different planet :)

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Resources