I've implemented both the batch and stochastic gradient descent. I'm experiencing some issues though. This is the stochastic rule:
1 to m {
theta(j):=theta(j)-step*derivative (for all j)
}
The issue I have is that, even though the cost function is becoming smaller and smaller the testing says it's not good. If I change the step a bit and change the number of iterations, the cost function is a bit bigger in value but the results are ok. Is this an overfitting "symptom"? How do I know which one is the right one? :)
As I said, even though the cost function is more minimized the testing says it's not good.
Gradient descent is a local search method for minimizing a function. When it reaches a local minimum in the parameter space, it won't be able to go any further. This makes gradient descent (and other local methods) prone to getting stuck in local minima, rather than reaching the global minimum. The local minima may or may not be good solutions for what you're trying to achieve. What to expect will depend on the function that you're trying to minimize.
In particular, high-dimensional NP-complete problems can be tricky. They often have exponentially many local optima, with many of them nearly as good as the global optimum in terms of cost, but with parameter values orthogonal to those for the global optimum. These are hard problems: you don't generally expect to be able to find the global optimum, instead just looking for a local minimum that is good enough. These are also relevant problems: many interesting problems have just these properties.
I'd suggest first testing your gradient descent implementation with an easy problem. You might try finding the minimum in a polynomial. Since it's a one-parameter problem, you can plot the progress of the parameter values along the curve of the polynomial. You should be able to see if something is drastically wrong, and can also observe how the search gets stuck in local minima. You should also be able to see that the initial parameter choice can matter quite a lot.
For dealing with harder problems, you might modify your algorithm to help it escape the local minima. A few common approaches:
Add noise. This reduces the precision of the parameters you've found, which can "blur" out local minima. The search can then jump out of local minima that are small compared to the noise, while still being trapped in deeper minima. A well-known approach for adding noise is simulated annealing.
Add momentum. Along with using the current gradient to define the step, also continue in the same direction as the previous step. If you take a fraction of the previous step as the momentum term, there is a tendency to keep going, which can take the search past the local minimum. By using a fraction, the steps decay exponentially, so poor steps aren't a big problem. This was always a popular modification to gradient descent when used to train neural networks, where gradient descent is known as backpropagation.
Use a hybrid search. First use a global search (e.g., genetic algorithms, various Monte Carlo methods) to find some good starting points, then apply gradient descent to take advantage of the gradient information in the function.
I won't make a recommendation on which to use. Instead, I'll suggest doing a little research to see what others have done with problems related to what you're working on. If it's purely a learning experience, momentum is probably the easiest to get working.
There are lots of things that could be going on:
your step could be a bad choice
your derivative might be off
your "expected value" might be mistaken
your gradient descent could simply be slow to converge
I would try increasing the run length, and plot runs with a variety of step values. A smaller step will have a better chance of avoiding the problems of, er, steps that are too big.
Related
I solved an analytically unsolvable problem with numerical methods. I am searching for X, based on a desired Y value. f(x)=y is possible, x=f^-1(y) is not.
Currently the algorithm does a binary search. It starts at X=50%, calculates Y, returns Y_err=Y-Y_demand. It keeps stepping by intervals of 5% in the direction of shrinking Y_err, until Y_err changes sign, then it reduces the step, and steps in the opposite direction. This works, but it's embarassingly slow & inefficient.
Below, an example chart of x=f^-1(y). I chose one with high coefficients for the nonlinear part.
Example chart of x=f^-1(y)
It varies depending on coefficients, but always has this pseudoparabolic shape. It's of course nonlinear and even 9th order polynomial approximations don't offer satisfactory precision.
For simplicity's sake let's say the inflecton point is at X=50%, and am looking only for solutions where X>50%.
How should I proceed? I'm looking to optimise as much as possible. What are some good algorithms? Thanks.
EDIT: Thank you for pointing out that this is not in fact a binary search. I've updated the code and now have much better results by comparison.
I'm not sure if Newton's method applies here, or at least I don't know how to apply it. One-way trial and error is all I can do. When I have some more time I will try to learn and implement regula falsi.
I'm conducting a molecular dynamics simulation for silica. Some time ago I turned to the fluctuating dipole model, and after much effort I'm still having problems implementing it.
In short, all oxygen atoms in the system are polarizable, and their dipole moments depend on their position with respect to all the other atoms in the system. More particularly, I use TS potential (http://digitallibrary.sissa.it/bitstream/handle/1963/2874/tangney.pdf?sequence=2), where dipoles are found iteratively at each time step.
This means that when evaluating forces acting on atoms, I have to take into account this potential energy dependency on coordinates.
Before, I was using simple pairwise potential models, and so I would set my program to compute forces using analytic formulas obtained by differentiating potential energy expression.
Now I'm at a loss: how to implement this new potential? In all the articles that I've found they only give you formulas, but not the algorithm. As I see it, when I compute forces, acting on a certain atom, I have to take into account the change of the dipole of this atom, the change of dipoles of all the neighboring atoms, then tge change of dipoles of still more atoms, and so on, as they depend on each other. After all, it is because of this interdependency that the dipoles are found iteratively at each time step. Clearly, I can't compute forces iteratively for each atom, because the computational complexity of the algorithm would be way too high. Should I use some simple functions to account for the change of dipoles? This doesn't look like a good idea either, cause dipoles are calculated iteratively, with high precision, and then, where it actually matters (computing forces), we would use crude functions?
So how do I implement this model? Also, is it possible to compute forces analytically, as I did before, or is it necessary to compute them using finite difference formula for derivative?
I haven't found the answer for my question in literature, but if you know some article, or site, or book, where this material is highlighted, please, direct me to that source.
Thank you for your time!
==================================================================================
UPDATE:
Thank you for your answer. Unfortunately, this was not my question. I didn't ask how to compute dipoles, but how to compute forces given that those dipoles vary with movement considerably.
I tried to compute forces in straightforward manner (not taking into account dipoles interdependance via their distances, just compute the dipoles on each steps, then compute the forces as if those dipoles are static), but the results I got were not physically correct.
To analyze the situation, I set up a simulation of a system consisting of just two atoms: Si and O. They have opposite charges, and so they oscillate. And the energy time dependence graphic looks like this:
The curve on the top represents kinetic energy, the one in the middle represents potential energy without taking dipole interaction into account, and the one at the bottom represents potential energy of the system, where dipole interaction was taken into account.
You can clearly see from this graphic, that the system is doing what it shouldn't do: climbing up the potential slope. So I decided this is due to the fact that I didn't take dipole moment coordinate dependence into account. For instance, at a given time point, we compute the forces, and they are directed so as to move both atoms toward each other. But when we do move them towards each other (even slightly), the dipole moment changes, and we find out that we actually ended up with higher potential energy than before! During the next time step the situation is the same.
So the question is, how to take this effect into account, cause what few ways I can think of are either way too computationally intense, or way too crude.
Not sure I fully understand your question, but it sounds like you might be needing to implement a Markov chain type solution?
See this nice post for more info: http://freakonometrics.hypotheses.org/6803
EDIT.
Reason I suggest this is that it sounds like you have a system where state of each atom depends on it's neighbors, and in turn the neighbors state depends on their neighbors and so on. Conceptually this could be modeled as a huge matrix and you iteratively update each value based on it's neighbors (???). This is intractable, but the article linked to shows how to solve a problem of very large transition matrices using Markov chains instead of computing the actual matrix.
I am writing a maze generation algorithm, and this wikipedia article caught my eye. I decided to implement it in java, which was a cinch. The problem I am having is that while a maze-like picture is generated, the maze often is not solvable and is not often interesting. What I mean by interesting is that there are a vast number of unreachable places and often there are many solutions.
I implemented the 1234/3 rule (although is is changeable easily, see comments for an explanation) with a roughly 50/50 distribution in the start. The mazes always reach an equilibrium where there is no change between t-steps.
My question is, is there a way to guarantee the mazes solvability from a fixed start and endpoint? Also, is there a way to make the maze more interesting to solve (fewer/one solution and few/no unreachable places)? If this is not possible with cellular automata, please tell me. Thank you.
I don't think it's possible to ensure a solvable, interesting maze through simple cellular automata, unless there's some specific criteria that can be placed on the starting state. The fact that cells have no knowledge of the overall shape because each cell won't be able to coordinate with the group as a whole.
If you're insistent on using them, you could do some combination of modification and pathfinding after generation is finished, but other methods (like the ones shown in the Wikipedia article or this question) are simpler to implement and won't result in walls that take up a whole cell (unless you want that).
the root of the problem is that "maze quality" is a global measure, but your automaton cells are restricted to a very local knowledge of the system.
to resolve this, you have three options:
add the global information from outside. generate mazes using the automaton and random initial data, then measure the maze quality (eg using flood fill or a bunch of other maze solving techniques) and repeat until you get a result you like.
use a much more complex set of explicit rules and state. you can work out a set of rules / cell values that encode both the presence of walls and the lengths / quality of paths. for example, -1 would be a wall and a positive value would be the sum of all neighbours above and to the left. then positive values encode the path distance from top left, roughly. that's not enough, but it shows the general idea... you need to encode an algorithm about the maze "directly" in the rules of the system.
use a less complex, but still turing complete, set of rules, and encode the rules for maze generation in the initial state. for example, you could use conway's life and construct an initial state that is a "program" that implements maze generation via gliders etc etc.
if it helps any you could draw a parallel between the above and:
ghost in the machine / external user
FPGA
programming a general purpose CPU
Run a path finding algorithm over it. Dijkstra would give you a sure way to compute all solutions. A* would give you one good solution.
The difficulty of a maze can be measured by the speed at which these algorithms solve it.
You can add some dead-ends in order to shut down some solutions.
I was wondering if anyone knows which kind of algorithm could be use in my case. I already have run the optimizer on my multivariate function and found a solution to my problem, assuming that my function is regular enough. I slightly perturbate the problem and would like to find the optimum solution which is close to my last solution. Is there any very fast algorithm in this case or should I just fallback to a regular one.
We probably need a bit more information about your problem; but since you know you're near the right solution, and if derivatives are easy to calculate, Newton-Raphson is a sensible choice, and if not, Conjugate-Gradient may make sense.
If you already have an iterative optimizer (for example, based on Powell's direction set method, or CG), why don't you use your initial solution as a starting point for the next run of your optimizer?
EDIT: due to your comment: if calculating the Jacobian or the Hessian matrix gives you performance problems, try BFGS (http://en.wikipedia.org/wiki/BFGS_method), it avoids calculation of the Hessian completely; here
http://www.alglib.net/optimization/lbfgs.php you find a (free-for-non-commercial) implementation of BFGS. A good description of the details you will here.
And don't expect to get anything from finding your initial solution with a less sophisticated algorithm.
So this is all about unconstrained optimization. If you need information about constrained optimization, I suggest you google for "SQP".
there are a bunch of algorithms for finding the roots of equations. If you know approximately where the root is, there are algorithms that will get you arbitrarily close very quickly, in ln n time or better.
One is Newton's method
another is the Bisection Method
Note that these algorithms are for single variable functions, but can be expanded to multivariate functions.
Every minimization algorithm performs better (read: perform at all) if you have a good initial guess. The initial guess for the perturbed problem will be in your case the minimum point of the non perturbed problem.
Then, you have to specify your requirements: you want speed. What accuracy do you want ? Does space efficiency matters ? Most importantly: what information do you have: only the value of the function, or do you also have the derivatives (possibly second derivatives) ?
Some background on the problem would help too. Looking for a smooth function which has been discretized will be very different than looking for hundreds of unrelated parameters.
Global information (ie. is the function convex, is there a guaranteed global minimum or many local ones, etc) can be left aside for now. If you have trouble finding the minimum point of the perturbed problem, this is something you will have to investigate though.
Answering these questions will allow us to select a particular algorithm. There are many choices (and trade-offs) for multivariate optimization.
Also, which is quicker will very much depend on the problem (rather than on the algorithm), and should be determined by experimentation.
Thought I don't know much about using computers in this capacity, I remember an article that used neuroevolutionary techniques to find "best-fit" equations relatively efficiently, given a known function complexity (linear, Nth-polynomial, exponential, logarithmic, etc) and a set of point plots. As I recall it was one of the earliest uses of what we now know as computational neuroevolution; because the functional complexity (and thus the number of terms) of the equation is known and fixed, a static neural net can be used and seeded with your closest values, then "mutated" and tested for fitness, with heuristics to make new nets closer to existing nets with high fitness. Using multithreading, many nets can be created, tested and evaluated in parallel.
I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.