How the losses are being computed for CATS algorithm - vowpalwabbit

I am struggling when trying to compute the losses when using CATS algorithm in VowpalWabit library.
Does anyone knows how it is computed (the average and the last)
I tried to calculate the average of the Cost as I found in the documentation (loss = cost = -reward)

The CATS algorithm computes the loss that is reported using the get_loss() function, here: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/core/src/reductions/cats.cc#L58-L82.
What it does can be broken down into a few steps:
Normalize and Discretize the set of actions (first by their interval, then into buckets when we floor the "ac"). What this does is turn the chosen action into an "action index" much like that of the standard CB algorithm
This is used to compute the "center" position of the action - in other words, if we always chose the center, rather than some part within the bandwidth of the discretization.
Then we compare the logged action with this center, and if the logged action falls within the bandwidth, we compute a loss (this functions equivalently to the indicator function).
If we are computing the loss, we need to ensure that we properly account for actions whose bandwidth exceeds the min/max allowed, and then use this, along with the logged probability of choosing an action, to perform an IPS-like computation over the cost of choosing the action.

Related

Computational Complexity of Finding Area Under Discrete Curve

I apologize if my questions are extremely misguided or loosely scoped. Math is not my strongest subject. For context, I am trying to figure out the computational complexity of calculating the area under a discrete curve. In the particular use case that I am interested in, the y-axis is the length of a queue and the x-axis is time. The curve will always have the following bounds: it begins at zero, it is composed of multiple timestamped samples that are greater than zero, and it eventually shrinks to zero. My initial research has yielded two potential mathematical approaches to this problem. The first is a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero (not sure if my understanding is completely correct there). I think the mathematical representation of this the formula found here:
https://en.wikipedia.org/wiki/Riemann_sum#Connection_with_integration.
The second is a discrete convolution. However, I am unable to tell the difference between, and applicability of, a discrete convolution and a Reimann sum over domain [a, b] where a is initially zero and b eventually becomes zero.
My questions are:
Is there are difference between the two?
Which approach is most applicable/efficient for what I am trying to figure out?
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
Edit:
For added context, there will be a function calculating average queue length by taking the sum of the area under two separate curves and dividing it by the total time interval spanning those two curves. The particular application can be seen on page 168 of this paper: https://www.cse.wustl.edu/~jain/cv/raj_jain_paper4_decbit.pdf
Is there are difference between the two?
A discrete convolution requires two functions. If the first one corresponds to the discrete curve, what is the second one?
Which approach is most applicable/efficient for what I am trying to figure out?
A Riemann sum is an approximation of an integral. It's typically used to approximate the area under a continuous curve. You can of course use it on a discrete curve, but it's not an approximation anymore, and I'm not sure you can call it a "Riemann" sum.
Is it even appropriate ask the computation complexity of either mathematical approach? If so, what are the complexities of each in this particular application?
In any case, the complexity of computing the area under a dicrete curve is linear in the number of samples, and it's pretty straightforward to find why: you need to do something with each sample, once or twice.
What you probably want looks like a Riemann sum with the trapezoidal rule. Pick the first two samples, calculate their average, and multiply that by the distance between two samples. Repeat for every adjacent pair and sum it all.
So, this is for the router feedback filter in the referenced paper...
That algorithm is specifically designed so that you can implement it without storing a lot of samples and timestamps.
It works by accumulating total queue_length * time during each cycle.
At the start of each "cycle", record the current queue length and current clock time and set the current cycle's total to 0. (The paper defines the cycle so that the queue length is 0 at the start, but that's not important here)
every time the queue length changes, get the new current clock time and add (new_clock_time - previous_clock_time) * previous_queue_length to the total. Also do this at the end of the cycle. Then, record new new current queue length and current clock time.
When you need to calculate the current "average queue length", it's just (previous_cycle_total + current_cycle_total + (current_clock_time - previous_clock_time)*previous_queue_length) / total_time_since_previous_cycle_start

How to write an algorithm that takes into account 3 weighted actions, with time decay?

I'm interested in creating an algorithm that provides a user ranking based on 3 actions that are weighted in significance. Example:
Action A (50%)
Action B (30%)
Action C (20%)
I would then like to have a time decay which provider maximum value at the time of the action and decays to 0 over a period like (day/week/month/year).
Any suggestions on where to start, how to go about implementing an algorithm like this?
Updates based on Jim's comments:
values of A,B,C are an aggregate number of points with equal value... the # of times a user performed the action
The time component should decay linearly. no acceleration.
Any suggestions on where to start
The obvious solution is to keep track of every event, along with a timestamp for that event. Then the rest is just math. However, that may require more storage, and more computation time than is desirable.
So my suggestion is to use binning. If the overall time decay period is one day, then use 12 two-hour bins. For example, at midnight the first bin (which represents the 00:00am to 02:00am time period) is cleared. Then any events that occur before 2:00am update the ABC counters in that bin. The bin has full weight until 2:00am, after which it is reduced in weight, until getting cleared again at midnight.
If the time period is a week, use 7 daily bins or 14 half-day bins. For a one month period, use 15 two-day bins, or 10 three-day bins. And for a year, use 12 monthly bins.
Taking a shot at giving a high level design.
There are only two reasons why a user's score will change:
The user performed some action; or,
A unit of time passed.
The time's interaction results in a linear decay†.
The Algorithm
You are trying to rank users, on the basis of score generated from their contribution to the Actions A, B, and C. Let's start with outlining what the software will do when one of the two causes for score change occurs.
When a user performs an action: Generate the user's scores for the rest of time assuming that user will commit no further action and put them in a queue within the user object. The front of the queue will tell the current score of the user.
When a unit of time passes: Just dequeue the front from its score queue.
The Data Structures
It seems to me that the traditional data structures - Arrays, Trees, Hashmaps - and even the usual augmented data structures - Linked Hashmap, Red Black Tree - will not be sufficient to calculate rank for such a scoring model. You will need to move a level up to get the right data structure for generating rank from this scoring system.
I can imagine a multi-doubly-linked kind of Hashmap. Would look somewhat like this:
So in the diagram above, we have one common storage containing all the user objects. Then we have multiple singly/doubly linked indices into the user storage. This way all the indices associated with the user object will be updatable, when user's score changes.
Finally, the ranking can be allowed to not necessarily begin from 1. The sorted-concurrent-hashmap can be updated and could hold negative ranks. Since the map is sorted, the most negative rank will be the first rank and further ranks can be obtained by sorted map's traversal. The ranks can be normalized back to start with some high positive number when the minimum rank gets close to the underflow limit.
This is a pretty big problem. There are many more ideas and optimizations that I have in mind. It is too big a task to mention all of them here. If you have a specific question, I can try to answer that.
†The time's interaction results in a linear decay. So I assume that the calculating the time decaying score from user's current score to next (let's say) 100 scores is simple. How many future scores need to be calculated will depend on what you consider to be one unit of time.

Exploration Algorithm

Massively edited this question to make it easier to understand.
Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.
http://imagizer.imageshack.us/a/img913/9274/qRsazT.jpg
What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)
I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.
One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:
http://imagizer.imageshack.us/a/img910/3050/HGu40d.jpg
In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.
Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.
The formalization would be:
States s: your NxM discrete grid.
Actions a: UP, DOWN, LEFT, RIGHT.
Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.
Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.
Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.
As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.
The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal).
The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).
As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).
I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

AI algorithm possible solution for shortest path

I need advice for heuristic for minesweeper game. If found 10 fields without mine, i am curious how to estimate what should be the next field to open? I was thinking about finding possibility for mines around every field with number, and at the end of computation to choose a field with least possibility but i don't think it will give me good results, because i need to open already safe field and what i need is to open a field which will opens the biggest area on the board. I would like to read good ideas, but just without cheating algorithms.
You could try an A* search with Monte Carlo simulation. That is, to define a cost/reward to each type of cell being opened (each type of action).
Assume you have K different actions you can perform (a_1,a_2,a_3...) at current timestep.
For each action (open cell X), and use the game model to simulate what would happen next. Store the reward for the sequence of actions, and accumulate the reward to the original action. You can add probability weight to actions and the consequences to make the estimate more accurate.
Take the average of simulated rewards for each action and action sequence. After M simulations at depth D (where M and D are just pre-defined values to ensure the algorithm doesn't take too long), choose one action from (a_1,a_2,a_3...) with highest simulated reward. Pruning is necessary to make this method efficient (that is, not to waste time on actions that are definitely not lead to high reward after a few steps simulations)

How to pick base samples deterministically in the particle filter algorithm?

The particle filter algorithm is known for its use in tracking objects in a video sequence: at each iteration, the algorithm generates hypotheses (or samples) about the motion of the object. In order to generate a new hypothesis, the first step of the condensation algorithm involves the selection of a sample: the example, provided in this web page, shows an implementation of the selection step, which uses the binary search in order to pick a base sample; the comment in support of the pick_base_sample() function explains that
The use of this routine makes Condensation O(NlogN) where N is the number of samples. It is probably better to pick base samples
deterministically, since then the algorithm is O(N) and probably
marginally more efficient, but this routine is kept here for
conceptual simplicity and because it maps better to the published
literature.
What it means to pick base samples deterministically?
How to pick base samples deterministically?
The condensation algorithm makes use of multiple samples to represent the current estimated state, each sample has an associated weight (that estimates the probability that the sample is correct).
The selection step chooses N samples from this set (with replacement, so the same sample can appear multiple times).
To explain the selection step, imagine drawing the samples as a series of line segments. Let the width of each line segment equal the weight of that sample.
For example, suppose we had samples A (weight 0.1) B (weight 0.3) and C (weight 0.6).
We would draw:
ABBBCCCCCC
The normal random selection process involves drawing samples by picking a random point along this line and seeing which sample appears at that position. The perceived problem with this approach is that it takes O(logN) operations to work out which sample appears at a particular location when using a tree data structure to hold the weights. (Although in practice I would not expect this to be the main processing bottleneck in an implementation)
An alternative deterministic (basically think "repeatable" and "not involving random numbers") approach is to simply choose samples by picking N regularly spaced points along the same line. The advantage of this is that the algorithm to do this takes time O(N) instead of O(NlogN).
(The deterministic algorithm is to loop over all the samples keeping track of the total weight seen so far. Whenever the total weight reaches the next regularly spaced point you collect a new sample. This only requires a single pass over the samples so is O(N).)

Resources