Continous action-state-space and tiling - algorithm

After getting used to the Q-Learning algorithm in discrete action-state-space I would like to expand this now to continous spaces. To do this I read the chapter On-Policy Control with Approximation of SuttonĀ“s introduction. Here, the usage of differentiable functions like a linear function or an ANN are recommended to solve the problem of continous action-state-space. Nevertheless Sutton then discribes the tiling method which maps the continous variables onto a discrete presentation. Is this always necessary?
Trying to understand this methods I tried to implement the example of the Hill Climbing Car in the book without the tiling method and a linear base function q. As my state space is 2 dimensional, and my action is one dimensional I used a three dimensional weight vector w in this equation:
When I now try to choose the action which will maximize the output, the obvious answer will be a=1, if w_2 > 0. Therefore, the weight will slowly converge to positive zero and the agent will not learn anything useful. As Sutton is able to solve the problem using the tiling I am wondering if my problem is caused by the absence of the tiling method or if I am doing anything else wrong.
So: Is the tiling always necessary?

Regarding your main question about tiling, the answer is no, not always it is necessary using tiling.
As you tried, it's a good idea to implement some easy example as the Hill Climbing Car in order to fully understand the concepts. Here, however, you are misundertanding something important. When the book talks about linear methods, it is refering to linear in the parameters, which means that you can extract a set of (non linear) features and combine them linearly. This kind of approximators can represent functions much more complex than a standard linear regression.
The parametrization you have proposed it's not able to represent a non-linear Q function. Taking into account that in the Hill Climbing problem you want to learn Q-functions of this style:
You will need something more powefull than . An easy solution for your problem could be to use a Radial Basis Function (RBF) network. In this case, you use a set of features (or BF, like for example Gaussians functions) to map your state space:
Additionally, if your action space is discrete and small, the easiest solution is to maintain an independent RBF network for each action. For selecting the action, simply compute the Q value for each action and select the one with higher value. In this way you avoid the (complex) optimization problem of selecting the best action in a continuous function.
You can find a more detailed explanation on the Busoniu et al. book Reinforcement Learning and Dynamic Programming Using Function Approximators, pages 49-51. It's available for free here.

Related

Finding an optimum learning rule for an ANN

How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.
Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.
When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.

Algorithm, find local/global minima, function of 2 variables

Let us have a function of 2 variables:
z=f(x,y) = ....
Can you advise me any suitable method (simply algorithmizable, fast convergence) to calculate the the local extreme on some intervals or the global extreme?
Thanks for your help.
Gradient Descent is a wise choice for finding local minima for functions, assuming you can calculate the gradient.
Depending on the specific domain - sometimes there are other solutions as well.
For example, for Linear-Least-Squares (which is used for regression in the field of machine learning) , you can find local (and global, the function in this case is convex) - you can use normal equations
EDIT: As suggested in comments: If you don't have any information on the function, you might be able to use a hill climbing algorithm where you sample the candidates where to advance (you need to take a sample because there are infinite number of directions if the function is of real numbers) - and chose the most promising one.
You can also try to extract the derivatives numerically using numerical differentiation, and use gradient descent.
You might also look into simulated annealing if you like the idea of algorithms driven by ideas from thermodynamics and metallurgy.
Or perhaps you'd rather look at genetic algorithms, because you like the current explosion of knowledge in biology.

Multiple parameter optimization with lots of local minima

I'm looking for algorithms to find a "best" set of parameter values. The function in question has a lot of local minima and changes very quickly. To make matters even worse, testing a set of parameters is very slow - on the order of 1 minute - and I can't compute the gradient directly.
Are there any well-known algorithms for this kind of optimization?
I've had moderate success with just trying random values. I'm wondering if I can improve the performance by making the random parameter chooser have a lower chance of picking parameters close to ones that had produced bad results in the past. Is there a name for this approach so that I can search for specific advice?
More info:
Parameters are continuous
There are on the order of 5-10 parameters. Certainly not more than 10.
How many parameters are there -- eg, how many dimensions in the search space? Are they continuous or discrete - eg, real numbers, or integers, or just a few possible values?
Approaches that I've seen used for these kind of problems have a similar overall structure - take a large number of sample points, and adjust them all towards regions that have "good" answers somehow. Since you have a lot of points, their relative differences serve as a makeshift gradient.
Simulated
Annealing: The classic approach. Take a bunch of points, probabalistically move some to a neighbouring point chosen at at random depending on how much better it is.
Particle
Swarm Optimization: Take a "swarm" of particles with velocities in the search space, probabalistically randomly move a particle; if it's an improvement, let the whole swarm know.
Genetic Algorithms: This is a little different. Rather than using the neighbours information like above, you take the best results each time and "cross-breed" them hoping to get the best characteristics of each.
The wikipedia links have pseudocode for the first two; GA methods have so much variety that it's hard to list just one algorithm, but you can follow links from there. Note that there are implementations for all of the above out there that you can use or take as a starting point.
Note that all of these -- and really any approach to this large-dimensional search algorithm - are heuristics, which mean they have parameters which have to be tuned to your particular problem. Which can be tedious.
By the way, the fact that the function evaluation is so expensive can be made to work for you a bit; since all the above methods involve lots of independant function evaluations, that piece of the algorithm can be trivially parallelized with OpenMP or something similar to make use of as many cores as you have on your machine.
Your situation seems to be similar to that of the poster of Software to Tune/Calibrate Properties for Heuristic Algorithms, and I would give you the same advice I gave there: consider a Metropolis-Hastings like approach with multiple walkers and a simulated annealing of the step sizes.
The difficulty in using a Monte Carlo methods in your case is the expensive evaluation of each candidate. How expensive, compared to the time you have at hand? If you need a good answer in a few minutes this isn't going to be fast enough. If you can leave it running over night, it'll work reasonably well.
Given a complicated search space, I'd recommend a random initial distributed. You final answer may simply be the best individual result recorded during the whole run, or the mean position of the walker with the best result.
Don't be put off that I was discussing maximizing there and you want to minimize: the figure of merit can be negated or inverted.
I've tried Simulated Annealing and Particle Swarm Optimization. (As a reminder, I couldn't use gradient descent because the gradient cannot be computed).
I've also tried an algorithm that does the following:
Pick a random point and a random direction
Evaluate the function
Keep moving along the random direction for as long as the result keeps improving, speeding up on every successful iteration.
When the result stops improving, step back and instead attempt to move into an orthogonal direction by the same distance.
This "orthogonal direction" was generated by creating a random orthogonal matrix (adapted this code) with the necessary number of dimensions.
If moving in the orthogonal direction improved the result, the algorithm just continued with that direction. If none of the directions improved the result, the jump distance was halved and a new set of orthogonal directions would be attempted. Eventually the algorithm concluded it must be in a local minimum, remembered it and restarted the whole lot at a new random point.
This approach performed considerably better than Simulated Annealing and Particle Swarm: it required fewer evaluations of the (very slow) function to achieve a result of the same quality.
Of course my implementations of S.A. and P.S.O. could well be flawed - these are tricky algorithms with a lot of room for tweaking parameters. But I just thought I'd mention what ended up working best for me.
I can't really help you with finding an algorithm for your specific problem.
However in regards to the random choosing of parameters I think what you are looking for are genetic algorithms. Genetic algorithms are generally based on choosing some random input, selecting those, which are the best fit (so far) for the problem, and randomly mutating/combining them to generate a next generation for which again the best are selected.
If the function is more or less continous (that is small mutations of good inputs generally won't generate bad inputs (small being a somewhat generic)), this would work reasonably well for your problem.
There is no generalized way to answer your question. There are lots of books/papers on the subject matter, but you'll have to choose your path according to your needs, which are not clearly spoken here.
Some things to know, however - 1min/test is way too much for any algorithm to handle. I guess that in your case, you must really do one of the following:
get 100 computers to cut your parameter testing time to some reasonable time
really try to work out your parameters by hand and mind. There must be some redundancy and at least some sanity check so you can test your case in <1min
for possible result sets, try to figure out some 'operations' that modify it slightly instead of just randomizing it. For example, in TSP some basic operator is lambda, that swaps two nodes and thus creates new route. Your can be shifting some number up/down for some value.
then, find yourself some nice algorithm, your starting point can be somewhere here. The book is invaluable resource for anyone who starts with problem-solving.

Genetic Algorithms applied to Curve Fitting

Let's imagine I have an unknown function that I want to approximate via Genetic Algorithms. For this case, I'll assume it is y = 2x.
I'd have a DNA composed of 5 elements, one y for each x, from x = 0 to x = 4, in which, after a lot of trials and computation and I'd arrive near something of the form:
best_adn = [ 0, 2, 4, 6, 8 ]
Keep in mind I don't know beforehand if it is a linear function, a polynomial or something way more ugly, Also, my goal is not to infer from the best_adn what is the type of function, I just want those points, so I can use them later.
This was just an example problem. In my case, instead of having only 5 points in the DNA, I have something like 50 or 100. What is the best approach with GA to find the best set of points?
Generating a population of 100,
discard the worse 20%
Recombine the remaining 80%? How?
Cutting them at a random point and
then putting together the first
part of ADN of the father with the
second part of ADN of the mother?
Mutation, how should I define in
this kind of problem mutation?
Is it worth using Elitism?
Any other simple idea worth using
around?
Thanks
Usually you only find these out by experimentation... perhaps writing a GA to tune your GA.
But that aside, I don't understand what you're asking. If you don't know what the function is, and you also don't know the points to being with, how do you determine fitness?
From my current understanding of the problem, this is better fitted by a neural network.
edit:
2.Recombine the remaining 80%? How? Cutting them at a random point and then putting together the first part of ADN of the father with the second part of ADN of the mother?
This is called crossover. If you want to be saucey, do something like pick a random starting point and swapping a random length. For instance, you have 10 elements in an object. randomly choose a spot X between 1 and 10 and swap x..10-rand%10+1.. you get the picture... spice it up a little.
3.Mutation, how should I define in this kind of problem mutation?
usually that depends more on what is defined as a legal solution than anything else. you can do mutation the same way you do crossover, except you fill it with random data (that is legal) rather than swapping with another specimen... and you do it at a MUCH lower rate.
4.Is it worth using Elitism?
experiment and find out.
Gaussian adaptation usually outperforms standard genetic algorithms. If you don't want to write your own package from scratch, the Mathematica Global Optimization package is EXCELLENT -- I used it to fit a really nasty nonlinear function where standard fitters failed miserably.
Edit:
Wikipedia Article
If you hunt down prints of the listed papers on the article, you can find whitepapers and implementations. In general though, you should have some idea what the solution space for your maximizing the fitness function look like. If the number of variables is small, or the number of local maxima is small or they are connected/slope down to a global maxima, simple least squares works fine. If the area around each local maxima is small (IE you have to get a damned good solution to hit the best one, otherwise you hit a bad one), then fancier algorithms are needed.
Choosing variables for a genetic algorithm depends on what the solution space will look like.

What are some compact algorithms for generating interesting time series data?

The question sort of says it all.
Whether it's for code testing purposes, or you're modeling a real-world process, or you're trying to impress a loved one, what are some algorithms that folks use to generate interesting time series data? Are there any good resources out there with a consolidated list? No constraints on values (except plus or minus infinity) or dimensions, but I'm looking for examples that people have found useful or exciting in practice.
Bonus points for parsimonious and readable code samples.
There are a ton of PRN generators out there, and you can always get free random bits, or even buy them on CD or DVD.
I've used simple sine wave generators mixed together with some phase and amplitude noise thrown in to get signals that sound and look interesting to humans when put through speakers or lights, but I don't know what you mean by interesting.
There are ways to generate data that looks interesting in a chart form, but that would be different than data used on a stock chart, and neither would make a nice "static" image such as produced by an analog television tuned to a null channel.
You can use Conway's game of life as a PRN, and "listen" to cells (or run all the cells through a logic circuit) to get some interesting time based signals.
It would be interesting to look at the graph of DB updates/inserts for Stackoverflow over time, and you could mine that data.
There really are infinite ways to generate an "interesting" time series data. Can you narrow the scope of your question?
Don't have an answer for the algorithm part but you can see how "realistic" your data is with Benford's law
Try the kind of recurrences that can give variously simple or chaotic series based on the part of their phase spaces you explore: the simplest I can think of is the logistic map x(n+1) = r * x(n) * ( 1 - x(n) ). With r approx. 3.57 you get chaotic results that depend on the initial point.
If you graph this versus time you can get lots of different series just by manipulating that parameter r. If you were to graph it as x(n+1) v. x(n) without connecting dots, you see a simple parabola take shape over time.
This is one of the most basic functions from chaos theory and trying more interesting polynomials, graphing them as x(n+1) v. x(n) and watching a shape form, and then graphing x(n) v. n is a fun and interesting way to create series.
Graphing x(n+1) v. x(n) makes it quickly obvious if you're only visiting a small number of points. Deeper recurrences become more interesting as well, and using different values of x(0) to check on sensitivity to initial conditions is also of interest.
But for simplicity, control by a single parameter, and ability to find something to read about your recurrence, it'll be hard to beat the logistic map.
I recommend: http://en.wikipedia.org/wiki/Logistic_map. It has a nice description of what to expect from different values of r.

Resources