Dynamic Programming of Markov Decision Process with Value Iteration

Dynamic Programming of Markov Decision Process with Value Iteration - algorithm

I am learning about MDP's and value iteration in self-study and I hope someone can improve my understanding.
Consider the problem of a 3 sided dice having numbers 1, 2, 3. If you roll a 1 or a 2 you get that value in $ but if you roll a 3 you loose all your money and the game ends (finite horizon problem)
Conceptually I understand how this done with the following forumla:
So let's break that down:
Since this is a finite horizon problem we can ignore gamma.
If I observe 1, I can either go or stop. The utility/value of that is:
V(1) = max(Q(1, g), Q(1, s))
Q(1, g) = r + SUM( P( 2 | 1,g) * V(2) + P( 3 | 1,g) * V(3))
Q(1, s) = r + SUM( P( 2 | 1,s) * V(2) + P( 3 | 1,s) * V(3))
where r = 1
I observe 2, I can either go or stop:
V(2) = max(Q(2, g), Q(2, s))
Q(2, g) = r + SUM( P( 1 | 2,g) * V(1) + P( 3 | 1,g) * V(3))
Q(2, s) = r + SUM( P( 1 | 2,s) * V(1) + P( 3 | 1,s) * V(3))
where r = 2
I observe 3, the game ends.
Intuitively V(3) is 0 because the game is over, so we can remove that half from the equation of Q(1, g). We defined V(2) above also so we can substitute that as:
Q(1, g) = r + SUM( P( 2 | 1,g) *
MAX ((P( 1 | 2,g) * V(1)) , (P( 1 | 2,s) * V(1))))
This where things take a bad turn. I am not sure how to solve Q(1, g) if it has its own definition in its solution. This likely due to poor math background.
What I do understand is that the utilities or the values of the states will change based on the reward and therefore the decision will change.
Specifically if rolling three gave you $3 while rolling one ended the game, that will affect your decision because the utility has changed.
But I am not sure how to write code to calculate that.
Can someone explain how Dynamic Programming works in this? How do I solve Q(1,g) or Q(1,s) when it is in its own definition?

Special solution:
For your example, it is pretty easy to know whether "go" or "stop" should be chosen: there is a money-value X for which it is the same whether you "go" or "stop", for all smaller value you should "go", for all bigger values you should stop. So the only question, what is this value:
X=E("stop"|X)=E("go"|X)=1/3(1+X)+1/3(2+x) =>
1/3X=1 =>
X=3
Already in the first line, I used that even if I choose "go" and win I will choose stop in the next round. So knowing what decision should be made, it is easy to calculate the expected win with the perfect strategy, here in python:
def calc(money):
PROB=1.0/3.0
if money<3:#go
return PROB*calc(money+1)+PROB*calc(money+2)-PROB*0
else:#stop
return money
print "Expected win:", calc(0)
>>> Expected win: 1.37037037037
General solution:
I'm not sure the above course of action can be generalized for arbitrary scenarios. However, there is another possibility to solve such problems.
Let's change the game a little bit: No longer infinitely many turns are possible, but at most N turns. Then your recursion becomes:
E(money, N)=max(money, 1/3*E(money+1, N-1)+1/3*E(money+1, N-1))
As you can easily see the value E(money, N) no longer depends on itself but on results of a game with smaller number of turns.
Without a proof, I state, that the value you are looking for is E(money)=lim_{N->infinity} E(money, N).
For you special problem the python code would look like follows:
PROB=1.0/3.0
MAX_GOS=20#neglect all possibilities with more than 1000 decisions "GO"
LENGTH=2*MAX_GOS+1#per go 2$ are possible
#What is expected value if the game ended now?
expected=range(LENGTH)
for gos_left in range(1,MAX_GOS+1):
next=[0]*len(expected)
for money in range(LENGTH-gos_left*2):
next[money]=max(expected[money], PROB*expected[money+1]+PROB*expected[money+2])#decision stop or go
expected=next
print "Expected win:", expected[0]
>>> Expected win: 1.37037037037
I'm glad both methods yielded the same result!

Related

Easily implementable solution for this brain teaser?

So I have a brain teaser I read on one of the algorithm and puzzle meetups we have on our uni that goes like this:
There's a school that awards students that, during a given period, are
never late more than once and who don't ever happen to be absent for
three or more consecutive days. How many possible permutations with repetitions of
presence (or lack thereof) can we build for a given timeframe that
grant the student an award? Assume that each day is just a state
On-time, Late or Absent for the whole day, don't worry about specific
classes. Example: for three day timeframes, we can create 19 such
permutations with repetitions that grant an award.
I've already posted it on math.SE yesterday cause I was interested if there was some ready-bake formula we could derive to solve it but it turns out there isn't and all the transformations really are rather complex.
Thus, I'm asking here - how would you approach such a problem with an algorithm? I tried narrowing down the possibilities space but after a while taking all the possible permutations with repetitions became well too much and the algorithm started becoming really complex while I believe there should be some easy to implement way to solve it, especially since most of the puzzles we exchange on the meetup are rather like that.

Here is a simplified version of Python 3 code implementing the recursion in the answer by #ProgrammerPerson:
from functools import lru_cache
def count_variants(max_late, base_absent, period_length):
"""
max_late – maximum allowed number of days the student can be late;
base_absent – the number of consecutive days the student can be absent;
period_length – days in a period."""
#lru_cache(max_late * base_absent * period_length)
def count(late, absent, days):
if late < 0: return 0
if absent < 0: return 0
if days == 0: return 1
return (count(late, base_absent, days-1) + # Student is on time. Absent reset.
count(late-1, base_absent, days-1) + # Student is late. Absent reset.
count(late, absent-1, days-1)) # Student is absent.
return count(max_late, base_absent, period_length)
Run example:
In [2]: count_variants(1, 2, 3)
Out[2]: 19

This screams recursion (and/or dynamic programming)!
Suppose we try and solve a slightly general problem:
We give an award if a student is late no more than L times, and isn't
absent for A or more consecutive days.
Now we want to compute the number of possibilities for an n days time frame.
Call this method P(L, A, n)
Now try to build up a recursion based on three cases for the first day of the period.
1) If the student is on-time for the first day, then the number is simply
P(L, A, n-1)
2) If the student is late the first day, then the number is
P(L-1, A, n-1)
3) If the student is absent the first day, then the number is
P(L, A-1, n-1)
This gives us the recursion:
P(L, A, n) = P(L, A, n-1) + P(L-1, A, n-1) + P(L, A-1, n-1)
You can either memoize the recursion, or just have tables which you lookup.
Be careful about the base cases which are
P(0, *, *), P(*, 0, *) and P(*, *, 0) and can be computed by easy mathematical formulae.
Here is quick python code, with memoization + recursion to demonstrate:
import math
def binom(n, r):
return math.factorial(n)/(math.factorial(r)*math.factorial(n-r))
# The memoization table.
table = {}
def P(L, A, n):
if L == 0:
# Only ontime or absent.
# More absents than period.
if A > n:
return 2**n
# 2^n total possibilities.
# of that n-A+1 are non-rewarding.
return 2**n - (n - A + 1)
if A == 0:
# Only Late or ontime.
# need fewer than L+1 late.
# This is n choose 0 + n choose 1 + ... + n choose L
total = 0
for l in xrange(0, min(L,n)):
total += binom(n, l)
return total
if n == 0:
return 1
if (L, A, n) in table:
return table[(L, A, n)]
result = P(L, A, n-1) + P(L-1, A, n-1) + P(L, A-1, n-1)
table[(L, A, n)] = result
return result
print P(1, 3, 3)
Output is 19.

Let S(n) be the number of strings of length n without 3 repeated 1s.
Any such string (with length at least 3) ends in "0", "01" or "011" (and after removing the suffix, any string without three consecutive 1s can appear).
Then for n > 2, S(n) = S(n-1) + S(n-2) + S(n-3), and S(0)=1, S(1)=2, S(2)=4.
If you have a late day on day i (counting from 0), then you have S(i) ways of arranging absent days before, and S(n-i-1) ways of arranging absent days after.
Thus, the solution to the original problem is S(n) + sum(S(i)*S(n-i-1) | i = 0...n-1)
We can compute solutions iteratively like this:
def ways(n):
S = [1, 2, 4] + [0] * (n-2)
for i in xrange(3, n+1):
S[i] = S[i-1] + S[i-2] + S[i-3]
return S[n] + sum(S[i] * S[n-i-1] for i in xrange(n))
for i in xrange(1, 20):
print i, ways(i)
Output:
1 3
2 8
3 19
4 43
5 94
6 200
7 418
8 861
9 1753
10 3536
11 7077
12 14071
13 27820
14 54736
15 107236
16 209305
17 407167
18 789720
19 1527607

Solving linear equations represented as a string

I'm given a string 2*x + 5 - (3*x-2)=x + 5 and I need to solve for x. My thought process is that I'd convert it to an expression tree, something like,
=
/ \
- +
/\ /\
+ - x 5
/\ /\
* 5 * 2
/\ /\
2 x 3 x
But how do I actually reduce the tree from here? Any other ideas?

You have to reduce it using axioms from algebra
a * (b + c) -> (a * b) + (a * c)
This is done by checking the types of each node in the pass tree. Once the thing is fully expanded into terms, you can then check they are actually linear, etc.
The values in the tree will be either variables or numbers. It isn't very neat to represent these as classes inheriting from some AbstractTreeNode class however, because cplusplus doesn't have multiple dispatch. So it is better to do it the 'c' way.
enum NodeType {
Number,
Variable,
Addition //to represent the + and *
}
struct Node {
NodeType type;
//union {char*, int, Node*[2]} //psuedo code, but you need
//something kind of like this for the
//variable name ("x") and numerical value
//and the children
}
Now you can query they types of a node and its children using switch case.
As I said earlier - c++ idiomatic code would use virtual functions but lack the necessary multiple dispatch to solve this cleanly. (You would need to store the type anyway)
Then you group terms, etc and solve the equation.
You can have rules to normalise the tree, for example
constant + variable -> variable + constant
Would put x always on the left of a term. Then x * 2 + x * 4 could be simplified more easily
var * constant + var * constant -> (sum of constants) * var
In your example...
First, simplify the '=' by moving the terms (as per the rule above)
The right hand side will be -1 * (x + 5), becoming -1 * x + -1 * 5. The left hand side will be harder - consider replacing a - b with a + -1 * b.
Eventually,
2x + 5 + -3x + 2 + -x + -5 = 0
Then you can group terms ever which way you want. (By scanning along, etc)
(2 + -3 + -1) x + 5 + 2 + -5 = 0
Sum them up and when you have mx + c, solve it.

Assuming you have a first order equation, check all the leaves on each side. On each side, have two bins: one to add up all the leaves containing a multiple of X and one for all the leaves containing a multiples of a constant. Either add to a bin or multiply each bin as you step up the tree along each branch from the leaves. You will end up with something that is conceptually like
a*x + b = c*x + d
At that point, you can just solve
x = (d - b) / (a - c)

Assuming the equation can reduce to f(x) = 0, and f(x) = a * x + b.
You can transform all the leaves in expression tree to f(x), for example : 2 -> 0 * x + 2, 3 * x -> 3 * x + 0, then you can do arithmetic operations of f(x) in expression tree. finally solve the equation f(x) = 0.
If the function is much more complicated than polynomial, you can do a binary search on x, and using the expression tree to calculate the left and right side of equation.

Dynamic programming idiom for combinations

Consider the problem in which you have a value of N and you need to calculate how many ways you can sum up to N dollars using [1,2,5,10,20,50,100] Dollar bills.
Consider the classic DP solution:
C = [1,2,5,10,20,50,100]
def comb(p):
if p==0:
return 1
c = 0
for x in C:
if x <= p:
c += comb(p-x)
return c
It does not take into effect the order of the summed parts. For example, comb(4) will yield 5 results: [1,1,1,1],[2,1,1],[1,2,1],[1,1,2],[2,2] whereas there are actually 3 results ([2,1,1],[1,2,1],[1,1,2] are all the same).
What is the DP idiom for calculating this problem? (non-elegant solutions such as generating all possible solutions and removing duplicates are not welcome)

Not sure about any DP idioms, but you could try using Generating Functions.
What we need to find is the coefficient of x^N in
(1 + x + x^2 + ...)(1+x^5 + x^10 + ...)(1+x^10 + x^20 + ...)...(1+x^100 + x^200 + ...)
(number of times 1 appears*1 + number of times 5 appears * 5 + ... )
Which is same as the reciprocal of
(1-x)(1-x^5)(1-x^10)(1-x^20)(1-x^50)(1-x^100).
You can now factorize each in terms of products of roots of unity, split the reciprocal in terms of Partial Fractions (which is a one time step) and find the coefficient of x^N in each (which will be of the form Polynomial/(x-w)) and add them up.
You could do some DP in calculating the roots of unity.

You should not go from begining each time, but at max from were you came from at each depth.
That mean that you have to pass two parameters, start and remaining total.
C = [1,5,10,20,50,100]
def comb(p,start=0):
if p==0:
return 1
c = 0
for i,x in enumerate(C[start:]):
if x <= p:
c += comb(p-x,i+start)
return c
or equivalent (it might be more readable)
C = [1,5,10,20,50,100]
def comb(p,start=0):
if p==0:
return 1
c = 0
for i in range(start,len(C)):
x=C[i]
if x <= p:
c += comb(p-x,i)
return c

Terminology: What you are looking for is the "integer partitions"
into prescibed parts (you should replace "combinations" in the title).
Ignoring the "dynamic programming" part of the question, a routine
for your problem is given in the first section of chapter 16
("Integer partitions", p.339ff) of the fxtbook, online at
http://www.jjj.de/fxt/#fxtbook

John Tukey "median median" (or "resistant line") statistical test for R and linear regression

I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.
A student on a mailling list explain this algorithm in these terms :
"The way it's calculated is to divide
the data into three groups, find the
x-median and y-median values (called
the summary point) for each group, and
then use those three summary points to
determine the line. The outer two
summary points determine the slope,
and an average of all of them
determines the intercept."
Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/
Do you have an idea of where i could find this algorithm or R function ? In which packages,
Thanks a lot !

There's a description of how to calculate the median-median line here. An R implementation of that is
median_median_line <- function(x, y, data)
{
if(!missing(data))
{
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
}
stopifnot(length(x) == length(y))
#Step 1
one_third_length <- floor(length(x) / 3)
groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
one_third_length,
c(one_third_length, one_third_length + 1, one_third_length),
c(one_third_length + 1, one_third_length, one_third_length + 1)
))
#Step 2
x <- sort(x)
y <- sort(y)
#Step 3
median_x <- tapply(x, groups, median)
median_y <- tapply(y, groups, median)
#Step 4
slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
intercept <- median_y[1] - slope * median_x[1]
#Step 5
middle_prediction <- intercept + slope * median_x[2]
intercept <- intercept + (median_y[2] - middle_prediction) / 3
c(intercept = unname(intercept), slope = unname(slope))
}
To test it, here's an example:
dfr <- data.frame(
time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
median_median_line(time, distance, dfr)
#intercept slope
# -113.6 520.0
Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.

I'm a little late to the party, but have you tried line() from the stats package?
From the helpfile:
Value
An object of class "tukeyline".
References
Tukey, J. W. (1977). Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley.

As member of the R Core team, I now have digged in the source code, and also studied the history of it.
Conclusion: The source C source code, added in 19961997, when R was still called alpha (and around version 0.14alpha) already computed the quantiles not quite correctly... for some sample sizes.
More about this on the R mailing lists (not yet).

Google Code Jam 2008: Round 1A Question 3

At Google Code Jam 2008 round 1A, there is problem:
Calculate last three digits before the
decimal point for the number
(3+sqrt(5))^n
n can be big number up to 1000000.
For example: if n = 2 then (3+sqrt(5))^2 = 27.4164079... answer is 027.
For n = 3: (3+sqrt(5))^3 = 3935.73982... answer is 935.
One of the solution is to create matrix M 2x2 : [[0, 1], [-4, 6]] than calculate matrix P = M^n, Where calculation preformed by modulo 1000.
and the result is (6*P[0,0] + 28*P[0,1] - 1) mod 1000.
Who can explain me this solution?

I'll present a method to solve this problem without even understanding the solution.
Assuming that you are familiar with the fibonacci numbers:
ghci> let fib = 0 : 1 : zipWith (+) fib (tail fib)
ghci> take 16 fib
[0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610]
And are also familiar with its closed form expression:
ghci> let calcFib i = round (((1 + sqrt 5) / 2) ^ i / sqrt 5)
ghci> map calcFib [0..15]
[0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610]
And you notice the similarity of ((1 + sqrt 5) / 2)n and (3 + sqrt 5)n.
From here one can guess that there is probably a series similar to fibonacci to calculate this.
But what series? So you calculate the first few items:
ghci> let calcThing i = floor ((3 + sqrt 5) ^ i)
ghci> map calcThing [0..5]
[1,5,27,143,751,3935]
Guessing that the formula is of the form:
thingn = a*thingn-1 + b*thingn-2
We have:
27 = a*5 + b*1
143 = a*27 + b*5
We solve the linear equations set and get:
thingn = 4*thingn-1 + 7*thingn-2 (a = 4, b = 7)
We check:
ghci> let thing = 1 : 5 : zipWith (+) (map (* 4) (tail thing)) (map (* 7) thing)
ghci> take 10 thing
[1,5,27,143,761,4045,21507,114343,607921,3232085]
ghci> map calcThing [0..9]
[1,5,27,143,751,3935,20607,107903,564991,2958335]
Then we find out that sadly this does not compute our function. But then we get cheered by the fact that it gets the right-most digit right. Not understanding why, but encouraged by this fact, we try to something similar. To find the parameters for a modified formula:
thingn = a*thingn-1 + b*thingn-2 + c
We then arrive at:
thingn = 6*thingn-1 - 4*thingn-2 + 1
We check it:
ghci> let thing =
1 : 5 : map (+1) (zipWith (+)
(map (*6) (tail thing))
(map (* negate 4) thing))
ghci> take 16 thing == map calcThing [0..15]
True

Just to give an answer to a very old question:
Thanks to yairchu i've got the idea to reread the prove of Binet's formula on the wikipedia page. It's there not really that clear, but we can work with it.
We see on the wikipedia page there is a closed form with 'computation by rounding': Fn = ⌊φ/√5⌋n.
If we could replace the φ/√5 with 3 + √5 (call the latter x). We could compute the floor of xn fairly easily, especially mod 1000, by finding the nth term in our freshly constructed sequence (this is the analogon of F (later we will call this analogon U)).
What sequence are we looking for? Well, we'll try following the prove for the Binet's formula. We need a quadratic equation with x as a root. Let's say x2 = 6 x-4 this one has roots x and y := 3 - √5. The handy part is now:
Define Un (for every a and b) such:
Un = a xn + b yn
by definition of x and y you can see that
Un = 6 Un-1 - 4 Un-2
Now we can choose a and b freely. We need Un to be integers so I propose choosing a=b=1. Now is U0 = 2, U1 = 6, U2 = 28...
We still need to get our 'computation by rounding'. You can see that yn < 1 for every n (because y ≅ 0.76 < 1) so Un = xn + yn = ⌈xn⌉.
If we can compute Un we can find ⌊xn⌋, just subtract 1.
We could compute Un by it's recursive formula but that would require O(n) computation time. We can do better!
For computing such a recursive formula we can use matrices:
⌈ 0 1⌉ ⌈ U(n-1) ⌉ ⌈ U(n) ⌉
⌊-4 6⌋ ⌊ U(n) ⌋ = ⌊U(n+1)⌋
Call this matrix M. Now does M*(U(1), U(2)) compute (U(2), U(3)).
Now we can compute P = Mn-1 (notice that I use one less than n, you can see that this is right if you test the small cases: n=0, n=1, n=2) P*(6,28) gives us now the nth and (n+1)th term of our sequence so:
(P*(6,28))0 - 1 = ⌊xn⌋
Now we can take everything mod 1000 (this is simplifying the calculations (a lot)) and we get the desired result in computation time O(log(n)) (or even better with the computational wonders of powers of matrices (over a cyclic finite field)). This explains the very weird looking solution, I guess.

I don't know how to explain that, but the auther of the problem have compose this analysis.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Dynamic Programming of Markov Decision Process with Value Iteration - algorithm

Related

Easily implementable solution for this brain teaser?

Solving linear equations represented as a string

Dynamic programming idiom for combinations

John Tukey "median median" (or "resistant line") statistical test for R and linear regression

Google Code Jam 2008: Round 1A Question 3

Categories

Resources