What is the difference between Stochastic Gradient Descent and LightGBM? - lightgbm

Although I have individually researched these concepts, I am confused on whether one or the other can be chosen for a solution, or can both of these be used simultaneously to improve results? Any guidance you can provide will be much appreciated.

My understanding is that the cost function of gradient descent is based on the entire training set whereas stochastic gradient descent approximates the cost of the true gradient using much less than the entire training set.
The question of which to use and when is based on determining whether there is sufficient computing power to calculate the exact cost of the gradient. If there is sufficient computing power and time then calculate it exactly.
If the training set is too large, stochastic gradient descent is worth a try. Use both for testing the quality of the approximation.
In general, I would not use both for the same reason I would never average an exact value and it's approximation. (Ex: 1=1 but 1 is also approximately 0.99 so (1+0.99)/2 = 0.995)

Related

Berndt–Hall–Hall–Hausman algorithm does not work

I am studying economics and trying to use BHHH algorithm to maximize the loglikelihood function. However, the algorithm converges immediately since the absolute value of approximate Hessian is too large compared with the magnitude of the derivatives. This might happen when the starting point is far from the optimal. But I have the feeling it will also happen when the magnitude of loglikelihood is large and so is the derivative. Am I right? In this case, which algorithm should be a better option? Thank you.

Gradient Descent Algorithm And Different Learning Rates

In the gradient descent algorithm, can we choose the learning rate to be different in each iteration of the algorithm until its convergence?
Yes, there are a variety of ways to set your hyperparameters according to epoch/iteration or loss-derivative functions. Changing the learning rate in gradient descent intuitively means changing the step size, with one tradeoff being large steps escape local optima but potentially requiring more steps to converge. Typically starting large and getting smaller makes sense, but there are many more optimized methods accelerating/regularizing the behavior of fit and learning rate scalar

2nd order symplectic exponentially fitted integrator

I have to solve equations of motion of a charged particle under the effect of electromagnetic field. Since I have to deal with speed over precision I could not use adaptive stepsize algorithms (like Runge-Kutta Cash-Karp) because they would take too much time. I was looking for an algorithm which is both symplectic (like Boris integration) and exponentially fitted (in order to solve the equation of motion even if the equation is stiff). I found a method but it is for second order differential equations:
https://www.math.purdue.edu/~xiaj/work/SEFRKN.pdf
Later I found a paper which would describe a fourth order symplectic exponentially-fitted Runge-Kutta:
http://users.ugent.be/~gvdbergh/files/publatex/annals1.pdf
Since I have to deal with speed I was looking for a lower order algorithm. Does a 2nd order symplectic exponentially fitted ODE algorithm exist?

Which clustering algorithm is the best when number of cluster is unknown but there is no noise?

I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Resources