SARSA in Reinforcement Learning - algorithm

I am coming across the SARSA algorithm in model-free reinforcement learning. Specifically, in each state, you would take an action a, and then observed a new state s'.
My question is, if you don't have the state transition probability equation P{next state | current state = s0}, how do you know what your next state will be?
My attempt: do you simply try that action a out, and then observe from the enviroment?

Typically yes, you perform the action in the environment, and the environment tells you what the next state is.

Yes. Based on the agent experience, stored in an action-value function, his behavior policy pi maps the current state s in an action a that leads him to a next state s' and then to a next action a'.
Fluxogram of state-action pairs sequences.

A technique called TD-Learning is used in Q-learning and SARSA to avoid learning the transition probabilities.
In short, when you are sampling, i.e. interacting with the system, and collecting data samples, (state, action, reward, next state, next action), in SARSA, the transition probabilities are implicitly considered when you use samples to update the parameters of your model. For example, every time you choose an action at the current state and then you get a reward and the new state, the system, in fact, generated the reward and the new state according to the transition probability p(s', r| a, s).
You can find a simple description in this book,
Artificial Intelligence A Modern Approach

Related

confusion in selecting reward in q-learning

I am new to the field of Q-learning (QL) and I am trying to implement a small task using QL in MATLAB. The task is : Say there is one transmitter, one receiver and between them there are 10 relays. The main part is that I want to choose one of the relay using QL that will carry the signal from transmitter to receiver successfully.
So, as per QL theory, we need to define state, action, reward. Hence I had chosen them as:
State : [P1,...,P10] where P1 is power from 1st relay to receiver. Like wise P10 is power from 10th relay to receiver.
action : [1,...,10] where action is nothing but choosing that relay which has highest power at that time.
My query is I am not getting how should I choose reward in this case ?
Any help in this regard will be highly appreciated.
There is only one state (i.e., this is actually a multi-armed bandit problem).
There are ten actions, one per relay.
The reward of each action is the power of the corresponding relay.

How to deal with discrete time system in GEKKO?

I am dealing with a discrete time system with sampling time of 300s.
My question is that how to express the state equation or output eqatuin like
x(k+1)=A*x(k)+B*u(k)
y(k)=C*x(k)
where x(k) is the state and y(k) is the output. I have all the value of A, B, C matrix.
I found some information about discrete time system on webpage https://apmonitor.com/wiki/index.php/Apps/DiscreteStateSpace
I want to know whether there is another way to express state equation other than
x,y,u = m.state_space(A,B,C,D=None,discrete=True)
The discrete state space model is the preferred way to pose your model. You could also convert your equations to a discrete time series form or to a continuous state space form. These all are equivalent forms. Another way to write your model is to use IMODE=2 (algebraic equations) but this is much more complicated. Here is an example of MIMO identification where we estimate ARX parameters with IMODE=2. I recommend the m.state_space model and to use it with IMODE>=4.
Here is a pendulum state space model example.
and a flight control state space model.
These both use continuous state space models but the methods are similar to what is needed for your application.

cb_explore input format : Use of providing probability value in training

The cb_explore input format requires specifying action:cost:action_probability for each example.
However the cb algorithms within are already trying to learn the optimal policy i.e. probability for each action from the data. Then, why does it need the probability of each action in the input? Is it just for initialization?
If I understand correctly, you are asking why the label associated with cb_explore is a set of action/probability pairs.
The probability of the label action is used as an importance weight for training. This has the effect of amplifying the updates for actions that are played less frequently, making them less likely to be drowned out by actions played more frequently.
As well, this type of label is very useful during predict-time, because it generates a log that can be used to perform unbiased counterfactual analyses. In other words, by logging the probability of playing each of the actions before sampling (see cb_sample - this implements how a single action/probability vector is sampled, as for example in the ccb reduction: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/cb_sample.cc#L37), we can then use the log to train another policy, and compare how it performs against the original.
See the "A Multi-World Testing Decision Service" paper to describe the mechanism to do unbiased offline experimentation with logged data: https://arxiv.org/pdf/1606.03966v1.pdf

Q-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columns

I have set up a Q-learning problem in R, and would like some help with the theoretical correctness of my approach in framing the problem.
Problem structure
For this problem, the environment consists of 10 possible states. When in each state, the agent has 11 potential actions which it can choose from (these actions are the same regardless of the state which the agent is in). Depending on the particular state which the agent is in and the subsequent action which the agent then takes, there is a unique distribution for transition to a next state i.e. the transition probabilities to any next state are dependant on (only) the previous state as well as the action then taken.
Each episode has 9 iterations i.e. the agent can take 9 actions and make 9 transitions before a new episode begins. In each episode, the agent will begin in state 1.
In each episode, after each of the agent's 9 actions, the agent will get a reward which is dependant on the agent's (immediately) previous state and their (immediately) previous action as well as the state which they landed on i.e. the agent's reward structure is dependant on a state-action-state triplet (of which there will be 9 in an episode).
The transition probability matrix of the agent is static, and so is the reward matrix.
I have set up two learning algorithms. In the first, the q-matrix update happens after each action in each episode. In the second, the q-matrix is updated after each episode. The algorithm uses an epsilon greedy learning formula.
The big problem is that in my Q-learning, my agent is not learning. It gets less and less of a reward over time. I have looked into other potential problems such as simple calculation errors, or bugs in code, but I think that the problem lies with the conceptual structure of my q-learning problem.
Questions
I have set up my Q-matrix as being a 10 row by 11 column matrix i.e. all the 10 states are the rows and the 11 actions are the columns. Would this be the best way to do so? This means that an agent is learning a policy which says that "whenever you are in state x, do action y"
Given this unique structure of my problem, would the standard Q-update still apply? i.e. Q[cs,act]<<-Q[cs,act]+alpha*(Reward+gamma*max(Q[ns,])-Q[cs,act])
Where cs is current state; act is action chosen; Reward is the reward given your current state, your action chosen and the next state which you will transition to; ns is the next state which you will transition to given your last state and last action (note that you transitioned to this state stochastically).
Is there an open AI gym in R? Are there Q-learning packages for problems of this structure?
Thanks and cheers
There is a problem in your definition of the problem.
Q(s,a) is the expected utility of taking action a in state s and following the optimal policy afterwards.
Expected rewards are different after taking 1, 2 or 9 steps. That means that the reward of being in state s_0 and taking action a_0 is different in step 0 from what you get in step 9
The "state" as you have defined does not ensure you any reward, it is the combination of "state+step" what does it.
To adequate model the problem, you should reframe it and consider the state to be both the 'position'+'step'. You will now have 90 states (10pos*9steps).

Cellular Automata update rules Mathematica

I am trying to built some rules about cellular automata. In every cell I don't have one element (predator/prey) I have a number of population. To achieve movement between my population can I compare every cell with one of its neighbours every time? or do I have to compare the cell with all of its neighbours and add some conditions.
I am using Moore neighbourhood with the following update function
update[site,_,_,_,_,_,_,_,_]
I tried to make them move according to all of their neighbours but it's very complicated and I am wondering if by simplify it and check it with all of its neighbours individually it will be wrong.
Thanks
As a general advice I would recommend against going down the pattern recognition technique in Mathematica for specifying the rule table in CA, they tend to get out of hand very quickly.
Doing a predator-prey kind of simulation with CA is a little tricky since in each step, (unlike in traditional CA) the value of the center cell changes ALONG WITH THE VALUE OF THE NEIGHBOR CELL!
This will lead to issues since when the transition function is applied on the neighbor cell it will again compute a new value for itself, but it also needs to "remember" the changes done to it previously when it was a neighbor.
In fluid dynamics simulations using CA, they encounter problems like this and they use a different neighborhood called Margolus neighborhood. In a Margolus neighborhood, the CA lattice is broken into distinct blocks and the update rule applied on each block. In the next step the block boundaries are changed and the transition rules applied on the new boundary, hence information transfer occurs across block boundaries.
Before I give my best attempted answer, you have to realize that your question is oddly phrased. The update rules of your cellular automaton are specified by you, so I wouldn't know whether you have additional conditions you need to implement.
I think you're asking what is the best way to select a neighborhood, and you can do this with Part:
(* We abbreviate 'nbhd' for neighborhood *)
getNbhd[A_, i_Integer?Positive, j_Integer?Positive] :=
A[[i - 1 ;; i + 1, j - 1 ;; j + 1]];
This will select the appropriate Moore neighborhood, including the additional central cell, which you can filter out when you call your update function.
Specifically, to perform the update step of a cellular automaton, one must update all cells simultaneously. In the real world, this means creating a separate array, and placing the updated values there, scrapping the original array afterward.
For more details, see my blog post on Cellular Automata, which includes an implementation of Conway's Game of Life in Mathematica.

Resources