In Reinforcement learning using feature approximation, does one have a single set of weights or a set of weights for each action? - feature-extraction

This question is an attempt to reframe this question to make it clearer.
This slide shows an equation for Q(state, action) in terms of a set of weights and feature functions.
These discussions (The Basic Update Rule and Linear Value Function Approximation) show a set of weights for each action.
The reason they are different is that the first slide assumes you can anticipate the result of performing an action and then find features for the resulting states. (Note that the feature functions are functions of both the current state and the anticipated action.) In that case, the same set of weights can be applied to all the resulting features.
But in some cases, one can't anticipate the effect of an action. Then what does one do? Even if one has perfect weights, one can't apply them to the results of applying the actions if one can't anticipate those results.
My guess is that the second pair of slides deals with that problem. Instead of performing an action and then applying weights to the features of the resulting states, compute features of the current state and apply possibly different weights for each action.
Those are two very different ways of doing feature-based approximation. Are they both valid? The first one makes sense in situations, e.g., like Taxi, in which one can effectively simulate what the environment will do at each action. But in some cases, e.g., cart-pole, that's not possible/feasible. Then it would seem you need a separate set of weights for each action.
Is this the right way to think about it, or am I missing something?
Thanks.

Related

How to use Kalman Filter Theorm to calculate next event date

I am having a problem statement which requires that if a particular error/event happens on
1-Jan-2017 and then on
22-Feb-2017,
3-April-2017,
9-July-2017
so i have to predict that when the next event is gonna occur , I am planning to try it with Kalman Filter theorm but it has very statistical terms and on the internet I didn't found any easy explanation or easy programming example for Kalman filter algorithm which estimates the next event dates . Can someone explain in simple terms or any parellel algorithm which can be used for the same purpose ?
Let Ei be the ith event, and let IETi = Ei+1 - Ei be the ith Inter-Event Time, i.e., the time between one event and the next. Then Ei+1 = Ei + IETi — the next event can be forecast from the most recent event based on IET.
Since the past is already determined the only thing random when you're projecting the next event is IET, so E[Ei+1] = Ei + E[IETi] (where E[] denotes a common notation for expected value). You don't need to know the distribution of IET to estimate its expected value, you only need to assume that the IETs are identically distributed. (They don't even need to be independent.) In other words, if IETs are identically distributed then the average of historical IETs is an unbiased estimator of their expected value.
There is a simple Kalman filter estimator to update estimates of an average as you obtain new data. See equations (2) & (3) from this post on math.stackexchange.
Note that this approach just gives a point predictor for the expected value. It won't allow you to make any probability statements about how likely it is the next event happens before or after some specified date. To do that you would need distributional information about the IETs.
Edit: Thanks to #pjs for his remarks. I will update my answer accordingly as soon as I can. However, many authors in the robotics/computer vision communities (e.g. Thrun et al) seem to directly define Kalman filters as Gaussian filters (and for those that are familiar with the computer vision/SLAM litterature, some computer vision works seem to discard standard EKF-based SLAM Since the Gaussian assumption doesn't hold for 3d points). In #pmj's answer, the Gaussian filter is actually nothing more than a running average and doesn't provide covariance (which can in some applications, be considered as the only justification for using a Kalman filter instead of non-linear minimizations on an equivalent cost-function) , so it seems pretty useless without an assumption on the distribution... So I wonder if this is what motivates said authors choices, or if it is just to simplify the discussion.
I think that Kalman filtering has very little to do with what you want to achieve. I will detail why after briefly explaining, in simple terms, what a Kalman filter does.
A Kalman filter estimates the current state x_t of a dynamic system based on all the previous observations, or in more mathematical terms, it models the probability distribution
p(x_t|z_1,...,z_t)
where the z_i are you observations (i.e measurements). Moreover, it is designed with a Gaussian assumption in mind. That is, it assumes that the distribution of your state/errors, including the one above, are Gaussian. Furthermore, it requires a model that links the measurements to the states, something like
z_t=f(x_t)+some_gaussian_noise
and you alos need a transition model, that links the previous state with the current one, e.g.
x_t=g(x_{t-1})+some_gaussian_noise
This comes with the assumption of having a "complete state": the knowledge of the current state is taken to be enough to predict the next one.
So, this is why I think it won't work with your model:
Given the information you've given, I see no sign that you can assume the distribution of the events is Gaussian. Its is probably not.
You don't have any transition equation, I don't even think it is possible to define one for your problem. Moreover, state completion doesn't hold.
Your state, as well as its observations, seem to be discrete, while Kalman filters are designed with continuous parameter spaces in mind.
Unfortunately, you haven't provided much information, so I can only suggest that you model your problem as a Markov chain, which I think you already had thought about.
Hope this helps a bit.

How to handle multiple optimal edit paths implementing Needleman-Wunsche algorithm?

Trying to implement Needleman-Wunsche algorithm for biological sequences comparison. In some circumstances there exist multiple optimal edit paths.
What is the common practice in bio-seq-compare tools handling this? Any priority/preferences among substitute/insert/deletion?
If I want to keep multiple edit paths in memory, any data structure is recommended? Or generally, how to store paths with branches and merges?
Any comments appreciated.
If two paths are have identical scores, that means that the likelihood of them is the same no matter which kinds of operations they used. Priority for substitutions vs. insertions or deletions has already been handled in getting that score. So if two scores are the same, common practice is to break the tie arbitrarily.
You should be able to handle this by recording all potential cells that you could have arrived at the current one from in your traceback matrix. Then, during traceback, start a separate branch whenever you come to a branching point. In order to allow for merges too, store some additional data about each cell (how will depend on what language you're using) indicating how many different paths left from it. Then, during traceback, wait at a given cell until that number of paths have arrived back at it, and then merge them into one. You can either be following the different branches with true parallel processing, or by just alternating which one you are advancing.
Unless you have an a reason to prefer one input sequence over the other in advance it should not matter.
Otherwise you might consider seq_a as the vertical axis and seq_b as the horizontal axis then always choose to step in your preferred direction if there is a tie to break ... but I'm not convincing myself there is any difference to the to alignment assuming one favors one of the starting sequences over the other
As a lot of similar algorithms, Needleman-Wunsche one is just a task of finding the shortest way into a graph (square grid in this case). So I would use A* for defining a sequence & store the possible paths as a dictionary with nodes passes.

How to determine if a current set of data values represent or relate to previous historic data values?

I am trying to develop an method to identify browsing pattern of a user on the basis of page requests.
In a simple example I have created 8 pages and for each page request from the user to the page I have stored that page's request frequency in the database as you can see below:
Now, my hypothesis is to identify the difference in the page request pattern, which leads to my assumption that if the pattern differs from pre-existing one then its a different (fraudulent) user. I am trying to develop this method as a part of an Multifactor-Authentication system.
Now when a user logs in and browses with a different pattern from the ones observed previously, the system should be able to identify it as a change in pattern.
Question is how to utilize these data values to check if current pattern relates to pre-existing patterns or not.
OK, here's a pretty simple idea (and basically, what you're looking to do is generate a set of features, then identify if the current session behaviour is different to the previously observed behaviour). I like to think of these one-class problems (only normal behaviour to train on, want to detect significant departure) as density estimation problems, so here's a simple probability model which will allow you to get the probability of a current request pattern. Basically, when this gets too low (and how low that is will be something you need to tune for the desired behaviour), something is going on.
Our observations consist of counts for each of the pages. Let their sum, the total number of requests, be equal to c_total, and counts for each page i be p_i. Then I'd propose:
c_total ~ Poisson(\lambda)
p|c_total ~ Multinomial(\theta, c_total)
This allows you to assign probability to a new observation given learned user-specific parameters \lambda (uni-variate) and \theta (vector of same dimension as p). To do this, calculate the probability of seeing that many requests from the pmf of the Poisson distribution, then calculate the probability of seeing the page counts from the multinomial, and multiply them together. You probably then want to normalise by c_total so that you can compare sessions with different numbers of requests (since the more requests, the more numbers < 1 you're multiplying together).
So, all that's left is to get the parameters from previous, "good" sessions from that user. The simplest thing is maximum likelihood, where \lambda is the mean total number of requests in previous sessions, and \theta_i is the proportion of all page views which were p_i (for that particular user). This may work for you: however, given that you want to be learning from very small numbers of observations, I'd be tempted to go with a full Bayesian model. This will also let you neatly update parameters after each non-suspicious observation. Inference in these distributions is very easy, with conjugate priors for \lambda and \theta and analytic predictive distributions, so it won't be difficult if you're familiar with these kinds of model at all.
One approach would be to use an unsupervised learning method such as a Self-Organizing Map (SOM, http://en.wikipedia.org/wiki/Self-organizing_map). Train the SOM on data representing expected/normal user behavior and then see how well the candidate data set fits the trained map. Keywords to search for in conjunction with "Self-organizing maps" might be "novelty/anomaly/intrusion detection" (turns up e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.2616&rep=rep1&type=pdf)
You should think about whether fraudulent use-cases can be modeled in advance (in which case you can train detectors specifically for them) or whether only deviations from normal behavior are of interest.
If you want to start simple, implement a cosine similarity measure. This would allow you to define a set of "good" vectors. The current user's activity could be compared to the good vectors. If you cannot retrieve a good vector, then the activity is flagged.

Action constraints in actor-critic reinforcement learning

I've implemented the natural actor-critic RL algorithm on a simple grid world with four possible actions (up,down,left,right), and I've noticed that in some cases it tends to get stuck oscillating between up-down or left-right.
Now, in this domain up-down and left-right are opposites and feel that learning might be improved if I were somehow able to make the agent aware of this fact. I was thinking of simply adding a step after the action activations are calculated (e.g. subtracting the left activation from the right activation and vice versa). However, I'm afraid of this causing convergence issues in the general case.
It seems as so adding constraints would be a common desire in the field, so I was wondering if anyone knows of a standard method I should be using for this purpose. And if not, then whether my ad-hoc approach seems reasonable.
Thanks in advance!
I'd stay away from using heuristics in the selection of actions, if at all possible. If you want to add heuristics to your training, I'd do it in the calculation of the reward function. That way the agent will learn and embody the heuristic as a part of the value function it is approximating.
About the oscillation behavior, do you allow for the action of no movement (i.e. stay in the same location)?
Finally, I wouldn't worry too much about violating the general case and convergence guarantees. They are merely guidelines when doing applied work.

Multiobjective Optimisation: Selection using NSGA vs Selection using VEGA

I was wondering what differences exist between the Vector Generated Genetic Algorithm (VEGA) and Nondominated Sorting Genetic Algorithm (NSGA) algorithms in the context of selection in Multi Objective Optimisation?
(I am aware that NSGA is pareto-based while VEGA is non-pareto based.)
The differences are quite large. As you say, one is Pareto-based and the other is not. In MOO, that's a huge thing. VEGA works by partitioning the population into disjoint sets and forcing the different sets to evolve towards different single objectives. There's a bit of machinery there to help combine them into a meaningful representation of the Pareto set, but it's basically just a union of solutions with respect to different objectives. Selection is done by selecting solutions that are better with respect to their individually set objective functions.
NSGA and other Pareto-based methods are completely different. They do selection not based on any particular choice of objective, but on the properties of the solutions as compared to one another. Each such algorithm makes slightly different choices in how they perform these comparisons, and NSGA-II (you should definitely use the second version of the algorithm) does it by non-dominated sorting. Basically, you find all the non-dominated solutions and call them set #1. Then you find all the solutions that would be non-dominated if you removed the elements of set #1 -- they become set #2. You keep going until all the solutions are accounted for, and the result is something like peeling the layers of an onion. The selection procedure is then that you always select members of the lower classes (set #1, then #2, and so on). If you can't take all the elements of a particular level, you break ties by choosing solutions within that level that further from the others, the idea being that if you can't take them all, you should at least try not to pick the ones you do take from one tiny little cluster.
In general, you should be looking at Pareto-based methods. They've been the proven choice for at least 10-15 years. In particular, you should focus on elitist Pareto-based methods like NSGA-II, SPEA2, the epsilon-MOEA, and a few more recent contenders.

Resources