confusion in selecting reward in q-learning

confusion in selecting reward in q-learning - algorithm

I am new to the field of Q-learning (QL) and I am trying to implement a small task using QL in MATLAB. The task is : Say there is one transmitter, one receiver and between them there are 10 relays. The main part is that I want to choose one of the relay using QL that will carry the signal from transmitter to receiver successfully.
So, as per QL theory, we need to define state, action, reward. Hence I had chosen them as:
State : [P1,...,P10] where P1 is power from 1st relay to receiver. Like wise P10 is power from 10th relay to receiver.
action : [1,...,10] where action is nothing but choosing that relay which has highest power at that time.
My query is I am not getting how should I choose reward in this case ?
Any help in this regard will be highly appreciated.

There is only one state (i.e., this is actually a multi-armed bandit problem).
There are ten actions, one per relay.
The reward of each action is the power of the corresponding relay.

Related

Change the transmission signal strength for a specific set of vehicles during the run-time

I started (since about one week) using veins (4.4) under omnet++ (5.0).
My current task is to let vehicles adjust their transmission range according to a specific context. I did read a lot of asked questions like these ones (and in other topics/forums):
Dynamical transmission range in the ieee802.11p module
Vehicles Receive Beacon Messages outside RSU Range
How coverage distance and interference distance are affected by each other
Maximum transmission range vs maximum interference distance
Reduce the coverage area between vehicles
how to set the transmission range of a node under Veins 2.0?
My Question:
How to -really- change the transmission range of just some nodes?
From the links above, I knew that the term "transmission range", technically, is related to the received power, noise,sensitivity threshold, etc. which defines the probability of reception.
Since I am new to veins (and omnet++ as well), I did few tests and I concluded the following:
"TraCIMobility" module can adjust the nodes' parameters (for each vehicle, there is an instance) such as the ID, speed, etc.
I could, also, instantiate the "Mac1609_4" (for each vehicle) and changed some of its parameters like the "txPower" during simulation run-time but it had no effect on the real communication range.
I could not instantiate (because it was global) the "connection manager" module which was the only responsible of (and does override) the effective communication range. this module can be configured in the ".ini" file but I want different transmission powers and most importantly "can be changed during run-time".
The formula to calculate the transmission range is in the attached links, I got it, but it must be a way to define or change these parameters in one of the layers (even if it is in the phy layer, i.e., something like the attached signal strength...)
Again, maybe there is some wrong ideas in what I have said, I just want to know what/how to change this transmission range.
Best regards,

You were right to increase the mac1609_4.txPower parameter to have a node send with more power (hence, the signal being decodable further away). Note, however, that (for Veins 4.4) you will also need to increase connectionManager.pMax then, as this value is used to determine the maximum distance (away from a transmitting simulation module) that a receiving simulation module will be informed about an ongoing transmission. Any receiving simulation module further away will not be influenced by the transmission (in the sense of it being a candidate for decoding, but also in the sense of it contributing to interference).
Also note that transmissions on an (otherwise) perfectly idle channel will reach much further than transmissions on a typically-loaded channel. If you want to obtain a good measurement of how far a transmission reaches, have some nodes create interference (by transmitting broadcasts of their own), then look at how the Frame Delivery Rate (FDR) drops as distance between sender and receiver increases.
Finally, note that both 1) the noise floor and 2) the minimum power level necessary for the simulation module of a receiver to attempt decoding a frame need to be calibrated to the WLAN card you want to simulate. The values chosen in the Veins 4.4 tutorial example are very useful for demonstrating the concepts of Veins, whereas the values of more recent versions of Veins come closer to what you would expect from a "typical" WLAN card used in some of the more recent field tests. See the paper Bastian Bloessl and Aisling O'Driscoll, "A Case for Good Defaults: Pitfalls in VANET Physical Layer Simulations," Proceedings of IFIP Wireless Days Conference 2019, Manchester, UK, April 2019 for a more detailed discussion of these parameters.

I am just giving my opinion in case someone was already in my situation:
In veins (the old version that I am using is 4.4), the "connection manager" is the responsible for evaluating a "potential" exchange of packets, thus, its transmission power is almost always set to the upper-bound.
I was been confused after I changed the vehicles "Mac1609_4" transmission power and "graphically", the connection manager was still showing me that the packets are received by some far nodes which in fact was not the case, it was just evaluating whether it is properly received or not (via the formula discussed in the links above).
Thus: changing the "TxPower" of each vehicle had really an effect beside graphically (the messages were not mounted to the upper layers).
In sum, to make a transmission range aware scheme, this is what must be done:
In the sender node (vehicle), and similarly to the pointer "traci" which deals with the mobility features, a pointer to the "mac1609" must be created and pointed to it as follows:
In "tracidemo11p.h" add ->
#include "veins/modules/mac/ieee80211p/Mac1609_4.h"//added
#include "veins/base/utils/FindModule.h"//added
and as a protected variable in the class of "tracidemo11p" in the same ".h" file ->
Mac1609_4* mac;//added
In "tracidemo11p.cc" add ->
mac = FindModule<Mac1609_4*>::findSubModule(getParentModule());
now you can manipulate "mac" as in "traci", the appropriate methods are in the "modules/mac/ieee80211p/Mac1609_4.cc & .h"
for our work, the method will be:
mac->setTxPower(10);//for example
This will have an impact on the simulation in real-time for each node instance.
It may had described it with basic concepts because I am new to omnet-veins, these was done in less than one week (and will be provided for new users as well).
I hope it will be helpful (and correct)

SARSA in Reinforcement Learning

I am coming across the SARSA algorithm in model-free reinforcement learning. Specifically, in each state, you would take an action a, and then observed a new state s'.
My question is, if you don't have the state transition probability equation P{next state | current state = s0}, how do you know what your next state will be?
My attempt: do you simply try that action a out, and then observe from the enviroment?

Typically yes, you perform the action in the environment, and the environment tells you what the next state is.

Yes. Based on the agent experience, stored in an action-value function, his behavior policy pi maps the current state s in an action a that leads him to a next state s' and then to a next action a'.
Fluxogram of state-action pairs sequences.

A technique called TD-Learning is used in Q-learning and SARSA to avoid learning the transition probabilities.
In short, when you are sampling, i.e. interacting with the system, and collecting data samples, (state, action, reward, next state, next action), in SARSA, the transition probabilities are implicitly considered when you use samples to update the parameters of your model. For example, every time you choose an action at the current state and then you get a reward and the new state, the system, in fact, generated the reward and the new state according to the transition probability p(s', r| a, s).
You can find a simple description in this book,
Artificial Intelligence A Modern Approach

Q-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columns

I have set up a Q-learning problem in R, and would like some help with the theoretical correctness of my approach in framing the problem.
Problem structure
For this problem, the environment consists of 10 possible states. When in each state, the agent has 11 potential actions which it can choose from (these actions are the same regardless of the state which the agent is in). Depending on the particular state which the agent is in and the subsequent action which the agent then takes, there is a unique distribution for transition to a next state i.e. the transition probabilities to any next state are dependant on (only) the previous state as well as the action then taken.
Each episode has 9 iterations i.e. the agent can take 9 actions and make 9 transitions before a new episode begins. In each episode, the agent will begin in state 1.
In each episode, after each of the agent's 9 actions, the agent will get a reward which is dependant on the agent's (immediately) previous state and their (immediately) previous action as well as the state which they landed on i.e. the agent's reward structure is dependant on a state-action-state triplet (of which there will be 9 in an episode).
The transition probability matrix of the agent is static, and so is the reward matrix.
I have set up two learning algorithms. In the first, the q-matrix update happens after each action in each episode. In the second, the q-matrix is updated after each episode. The algorithm uses an epsilon greedy learning formula.
The big problem is that in my Q-learning, my agent is not learning. It gets less and less of a reward over time. I have looked into other potential problems such as simple calculation errors, or bugs in code, but I think that the problem lies with the conceptual structure of my q-learning problem.
Questions
I have set up my Q-matrix as being a 10 row by 11 column matrix i.e. all the 10 states are the rows and the 11 actions are the columns. Would this be the best way to do so? This means that an agent is learning a policy which says that "whenever you are in state x, do action y"
Given this unique structure of my problem, would the standard Q-update still apply? i.e. Q[cs,act]<<-Q[cs,act]+alpha*(Reward+gamma*max(Q[ns,])-Q[cs,act])
Where cs is current state; act is action chosen; Reward is the reward given your current state, your action chosen and the next state which you will transition to; ns is the next state which you will transition to given your last state and last action (note that you transitioned to this state stochastically).
Is there an open AI gym in R? Are there Q-learning packages for problems of this structure?
Thanks and cheers

There is a problem in your definition of the problem.
Q(s,a) is the expected utility of taking action a in state s and following the optimal policy afterwards.
Expected rewards are different after taking 1, 2 or 9 steps. That means that the reward of being in state s_0 and taking action a_0 is different in step 0 from what you get in step 9
The "state" as you have defined does not ensure you any reward, it is the combination of "state+step" what does it.
To adequate model the problem, you should reframe it and consider the state to be both the 'position'+'step'. You will now have 90 states (10pos*9steps).

Chisel: how to implement a one-hot mux that is efficient?

I have a table, where each row of the table contains state (registers). There is logic that chooses one particular row. Only one row receives the "selected" signal. State from that chosen row is then accessed. Either a portion of the state is connected as an output to the IO of the module, or else a portion of the IO is used as input to update the state.
If I were implementing this with a circuit, I would use pass-gates. The selected signal would turn on one set of pass-gates, which would connect the row's registers to a bus. The bus would then be wired to the IO bundle. This is fast, small area, and low energy.
There is a straight forward way of implementing this in Chisel. It encodes the selected row as a binary number, and then applies that number to the select input of a traditional mux. Unfortunately, for a table with 20 to 50 rows, and state of hundreds of bits, this implementation can be quite slow, and wasteful in area and energy.
The question has two parts:
1) Is there a way to specify busses in Chisel, such that you have pass-gates or traditional tri-state drivers all hung off the bus?
2) Failing that, is there a fast, small area, low energy way of doing this in Chisel?
Thanks

1) Chisel does not fully support bidirectional wires, but via the experimental Analog type (see example), you can at least stitch a bus through your Chisel code between Verilog Black Boxes.
2) Have you tried Mux1H in chisel3.util? It emits essentially a sum of products of the inputs and their corresponding select bits. I'm not sure how this compares to your proposed implementation. I would love to see a QOR comparison. If this construct is not sufficient and you cannot express precisely what you want in chisel, you can use a parameterized BlackBox to implement your one-hot mux and instantiate it as you please.

Algorithm/Heuristic for grouping chat message histories by 'conversation'/implicit sessions from time stamps?

The problem: I have a series of chat messages -- between two users -- with time stamps. I could present, say, an entire day's worth of chat messages at once. During the entire day, however, there were multiple, discrete conversations/sessions...and it would be more useful to the user to see these divided up as opposed to all of the days as one continuous stream.
Is there an algorithm or heuristic that can 'deduce' implicit session/conversation starts/breaks from time stamps? Besides an arbitrary 'if the gap is more than x minutes, it's a separate session'. And if that is the only case, how is this interval determined? In any case, I'd like to avoid this.
For example, there are...fifty messages that get sent between 2:00 and 3:00, and then a break, and then twenty messages sent between 4:00 and 5:00. There would be a break inserted between there...but how would the break be determined?
I'm sure that there is already literature on this subject, but I just don't know what to search for.
I was playing around with things like edge detection algorithms and gradient-based approaches for a while.
(see comments for more clarification)

EDIT (Better idea):
You can view each message as being of two types:
A continuation of a previous conversation
A brand new conversation
You can model these two types of messages as independent Poisson processes, where the time difference between adjacent messages is an exponential distribution.
You can then empirically determine the exponential parameters for these two types of messages by hand (wouldn't be too hard to do given some initial data). Now you have a model for these two events.
Finally when a new message comes along, you can calculate the probability of the message being of type 1 or type 2. If type 2, then you have a new conversation.
Clarification:
The probability of the message being a new conversation, given that the delay is some time T.
P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T)
Using Bayes' Rule:
= P(delay=T | new conversation)*P(new conversation)/P(delay=T)
The same calculation goes for P(old conversation | delay=T).
P(delay=T | new conversation) comes from the model. P(new conversation) is easily calculable from the data used to generate your model. P(delay=T) you don't need to calculate at all since all you want to do is compare the two probabilities.
The difference in timestamps between adjacent messages depends on the type of conversation and the people participating. Thus you'll want an algorithm that takes into account local characteristics, as opposed to a global threshold parameter.
My proposition would be as follows:
Get the time difference between the last 10 adjacent messages.
Compute the mean (or median)
If the delay until the next message is more than 30 times the the mean, it's a new conversation.
Of course, I came up with these numbers on the spot. They would have to be tuned to fit your purpose.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio