In the gradient descent algorithm, can we choose the learning rate to be different in each iteration of the algorithm until its convergence?
Yes, there are a variety of ways to set your hyperparameters according to epoch/iteration or loss-derivative functions. Changing the learning rate in gradient descent intuitively means changing the step size, with one tradeoff being large steps escape local optima but potentially requiring more steps to converge. Typically starting large and getting smaller makes sense, but there are many more optimized methods accelerating/regularizing the behavior of fit and learning rate scalar
Related
Although I have individually researched these concepts, I am confused on whether one or the other can be chosen for a solution, or can both of these be used simultaneously to improve results? Any guidance you can provide will be much appreciated.
My understanding is that the cost function of gradient descent is based on the entire training set whereas stochastic gradient descent approximates the cost of the true gradient using much less than the entire training set.
The question of which to use and when is based on determining whether there is sufficient computing power to calculate the exact cost of the gradient. If there is sufficient computing power and time then calculate it exactly.
If the training set is too large, stochastic gradient descent is worth a try. Use both for testing the quality of the approximation.
In general, I would not use both for the same reason I would never average an exact value and it's approximation. (Ex: 1=1 but 1 is also approximately 0.99 so (1+0.99)/2 = 0.995)
Let's say I have N positive-valued 1-d functions. Does it take more function evaluations for a numerical minimizer to minimize their product in N-dimensional space rather than do N individual 1d minimizations?
If so, is there an intuitive way to understand this? Somehow I feel like both problems should be equal in complexity.
Minimizing their product is minimizing the sum of their logs. There are many algorithms for min(max)imizing N-dimensional functions. One is the old standby OPTIF9.
If you have to use hard limits, so you're minimizing in a box, that can be a lot harder, but you can usually avoid it.
The complexity is not linear in the number of variables. Typically n small problems is better than one big problem. Or in other words: making the problem twice as big (in terms of variables) will make it more than twice as expensive to solve.
In some special cases it may be somewhat beneficial to batch a few problems, mainly due to fixed overhead (some solvers do a lot of things before actually starting iterating).
I am currently estimating a Markov-switching model with many parameters using direct optimization of the log likelihood function (through the forward-backward algorithm). I do the numerical optimization using matlab's genetic algorithm, since other approaches such as the (mostly gradient or simplex-based) algorithms in fmincon and fminsearchbnd were not very useful, given that likelihood function is not only of very high dimension but also shows many local maxima and is highly nonlinear.
The genetic algorithm seems to work very well. However, I am planning to further increase the dimension of the problem. I have read about an EM algorithm to estimate Markov-switching models. From what I understand this algorithm releases a sequence of increasing log-likelhood values. It thus seems suitable to estimate models with very many parameters.
My question is if the EM algorithm is suitable for my application involving many parameters (perhaps better suitable as the genetic algorithm). Speed is not the main limitation (the genetic algorithm is altready extremely slow) but I would need to have some certainty to end up close to the global optimum and not run into one of the many local optima. Do you have any experience or suggestions regarding this?
The EM algorithm finds local optima, and does not guarantee that they are global optima. In fact, if you start it off with a HMM where one of the transition probabilities is zero, that probability will typically never change from zero, because those transitions will appear only with expectation zero in the expectation step, so those starting points have no hope of finding a global optimum which does not have that transition probability zero.
The standard workaround for this is to start it off from a variety of different random parameter settings, pick the highest local optima found, and hope for the best. You might be slightly reassured if a significant proportion of the runs converged to the same (or to equivalent) best local optimum found, on the not very reliable theory that anything better would be found from at least the same fraction of random starts, and so would have showed up by now.
I haven't worked it out in detail, but the EM algorithm solves such a general set of problems that I expect that if it guaranteed to find the global optimum then it would be capable of finding the solution to NP-complete problems with unprecedented efficiency.
I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.
I am trying to understand how r-tree works, and saw that there are two types of splits: quadratic and linear.
What are actually the differences between linear and quadratic? and in which case one would be preferred over the other?
The original R-Tree paper describes the differences between PickSeeds and LinearPickSeeds in sections 3.5.2 and 3.5.3, and the charts in section 4 show the performance differences between the two algorithms. Note that figure 4.2 uses an exponential scale for the Y-axis.
http://www.cs.bgu.ac.il/~atdb082/wiki.files/paper6.pdf
I would personally use LinearPickSeeds for cases where the R-Tree has high "churn" and memory usage is not critical, and QuadraticPickSeeds for cases where the R-Tree is relatively static or in a limited memory environment. But that's just a rule of thumb; I don't have benchmarks to back that up.
Both are heuristics to find small area split.
In quadratic you choose two objects that create as much empty space as possible. In linear you choose two objects that are farthest apart.
Quadratic provides a bit better quality of split. However for many practical purposes linear is as simple, fast and good as quadratic.
There are even more variants: Exhaustive search, Greenes split, Ang Tan split and the R*-tree split.
All of them are heuristics to find a good split in acceptable time.
In my experiments, R*-tree splitting works best, because it produces more rectangular pages. Ang-Tan, while being "linear" produces slices that are actually a pain for most queries. Often, cost at construction/insertion is not too important, but query is.