Eligibility trace algorithm, the update order - algorithm

I am reading Silver et al (2012) "Temporal-Difference Search in Computer Go", and trying to understand the update order for the eligibility trace algorithm.
In the Algorithm 1 and 2 of the paper, weights are updated before updating the eligibility trace. I wonder if this order is correct (Line 11 and 12 in the Algorithm 1, and Line 12 and 13 of the Algorithm 2).
Thinking about an extreme case with lambda=0, the parameter is not updated with the initial state-action pair (since e is still 0). So I doubt the order possibly should be the opposite.
Can someone clarify the point?
I find the paper very instructive for learning the reinforcement learning area, so would like to understand the paper in detail.
If there is a more suitable platform to ask this question, please kindly let me know as well.

It looks to me like you're correct, e should be updated before theta. That's also what should happen according to the math in the paper. See, for example, Equations (7) and (8), where e_t is first computed using phi(s_t), and only THEN is theta updated using delta V_t (which would be delta Q in the control case).
Note that what you wrote about the extreme case with lambda=0 is not entirely correct. The initial state-action pair will still be involved in an update (not in the first iteration, but they will be incorporated in e during the second iteration). However, it looks to me like the very first reward r will never be used in any updates (because it only appears in the very first iteration, where e is still 0). Since this paper is about Go, I suspect it will not matter though; unless they're doing something unconventional, they probably only use non-zero rewards for the terminal game state.

Related

How to improve the performance of an alpha-beta minimax method in Ruby?

I’ve been a programming student for about six months. I’d been wanting to write a chess game since a long time, and finally made it. I’m very happy with the result, however, there’s a point I don’t know how to address. The AI is based on an alpha-beta pruning #minimax method that chooses the best move on the basis of the best possible outcome for the current player, with a depth default value of 3, which means the computer can think ahead 3 turns. The computer does choose the correct move, but the method in its current implementation is very slow.
#provisional makes and 'unmakes' a possible move, and returns the value returned from the code block. The evaluation function #evaluate is very simple, it’s just a sum of the material value of the pieces and their location values according to how are they placed on the board.
I’d really appreciate some light here, as I don’t know how to get a faster version of this method.
Thank you so much for your time.
This is the method:
def minimax(move, depth, alpha, beta, maximizing_player)
return board.evaluate if depth.zero?
board.provisional(move, color) do
if maximizing_player
best_minimizing_evaluation = Float::INFINITY
board.generate_moves(:black).each do |possible_move|
evaluation = minimax(possible_move, depth - 1, alpha, beta, false)
best_minimizing_evaluation = [best_minimizing_evaluation, evaluation].min
beta = [beta, evaluation].min
break if beta <= alpha
end
best_minimizing_evaluation
else
best_maximizing_evaluation = -Float::INFINITY
board.generate_moves(:white).each do |possible_move|
evaluation = minimax(possible_move, depth - 1, alpha, beta, true)
best_maximizing_evaluation = [best_maximizing_evaluation, evaluation].max
alpha = [alpha, evaluation].max
break if beta <= alpha
end
best_maximizing_evaluation
end
end
end
With an initial depth value of 3, it takes between 15 and 50 seconds for the method to resolve and return the chosen move; this is a lot, and makes the game barely enjoyable. Changing the depth value to 2, the times are more reasonable, being about a third of the previous times, but I’d really like to keep the depth at 3. With a depth of 1, of course, it takes less than a second.
I realize that some improvements can be made, however, I don’t know how to:
Will a negamax version of this method significantly improve its performance?
I’m aware of the Tail Recursion Optimization, however, is it possible in this case? I don’t know if it is a tree-generating method like this.
I’ve been told that pre-sorting the moves somehow before they are evaluated can improve the performance in alpha-beta minimaxes, but, how can I sort the moves? I can’t sort them by the immediate outcome, because, for instance, sometimes it’s worth to sacrifice a piece to win a better position. I could sort them by best possible outcome in 2 turns... but then I'd be computing twice, once for the sorting of the moves, and once for the actual move evaluation.
It's implementing a transposition table worth it? I mean, there are tons of potential positions, and they easy can make a file really big. For example, in a day, the program generated a 100 MB text file, and I didn't notice a huge performance improvement, as the computer don't always makes the same moves, and neither do I. The different position for a chess game are innumerable, unlike in a game like Tic Tac Toe.
Thank you so much.

Modelica events and hybrid modelling

I would like to understand the general idea behind hybrid modelling (in particular state events) from a numerical point of view (although I am not a mathematician :)). Given the following Modelica model:
model BouncingBall
constant Real g=9.81
Real h(start=1);
Real v(start=0);
equation
der(h)=v;
der(v)=-g;
algorithm
when h < 0 then
reinit(v,-pre(v));
end when;
end BouncingBall;
I understand the concept of when and reinit.
The equation in the when statement are only active when the condition become true right?
Let's assume that the ball would hit the floor at exactly 2sec. Since I am using multi-step solver does that mean that the solver "goes beyond 2 seconds", recognizes that h<0 (lets assume at simulation time = 2.5sec , h = -0.7). What does this mean "The time for the event is searched using a crossing function? Is there a simple explanation(example)?
Is the solver now going back? Taking a smaller step-size?
What does the pre() operation mean in that context?
noEvent(): "Expressions are taken literally instead of generating crossing functions. Since there is no crossing function, there is no requirement tat the expression can be evaluated beyond the event limit": What does that mean? Given the same example with the bouncing ball: The solver detects at time 2.5 that h = 0.7. Whats the difference between with and without noEvent()?
Yes, the body of when is only executed at events.
Simple view: The solver takes steps, and then uses a continuous extension to generate a (smooth) interpolation formula for the previous step. That interpolation formula can be used to generate a plot, and also for finding the first point where h has crossed zero (likely 2.000000001). An event iteration is then done at that interpolated point - and afterwards the solver is restarted.
I wouldn't say that the solver goes back. It takes a partial step and then continues forward. Some solvers need to reduce the step-size a lot after the event - others don't.
pre(x) is set to the value of x before the event.
noEvent(h<0) basically means evaluate the expression as written without all the bells-and-whistles of crossing functions. You cannot use when noEvent(h<0) then
There are many additional point:
If you are familiar with Sturm-sequences or control theory you might realize that it is not necessary to interpolate a formula to determine if it crossed zero or not in an interval (and some tools use that). The fact that the function is not necessarily smooth makes it a bit more complicated, and also means that derivative-tests cannot be used.
How much the solver is reset depends on the kind of solver. One-step solvers (Runge-Kutta) can be restarted directly as if virtually nothing happened, whereas multi-step solvers (BDF/Adams - such as dassl/lsodar/cvode) need to start with lower order and smaller step-size.

LightsOut game solving method "reduced echolean ".Does it always gives correct result?

I am studing the algorithm given here, and
somewhere it is claimed that it is efficent and always give correct result.
But, I try to run the algorithm and it is not giving me correct or efficent output for the following patterns.
For 5 x 5 grid, where (n) is light number and 0/1 state whethere the light is on/off, 1 ON and 0 OFF.
(1)1 (2)0 (3)0 (4)0 (5)0 the output should be 1,7,13,19,25(Pressing this light will make the full grid OFF. But what I am getting is this
(6)0 (7)1 (8)0 (9)0 (10)0 3,5,6,7,8,10,13,16,18,19,20,21,23.
(11)0 (12)0 (13)1 (14)0 (15)0
(16)0 (17)0 (18)0 (19)1 (20)0
(21)0 (22)0 (23)0 (24)0 (25)1
While for some pattern it is giving me correct output as below.
(1)0 (2)0 (3)0 (4)0 (5)1 the output should be 5,9,13,17,21, and the algorithm is giving me correct result.
(6)0 (7)0 (8)0 (9)1 (10)0
(11)0 (12)0 (13)1 (14)0 (15)0
(16)0 (17)1 (18)0 (19)0 (20)0
(21)1 (22)0 (23)0 (24)0 (25)0
If somebody need a code let me know I can post it.
Can please somebody let me know if this methods will always give correct as well as efficient result or not ?
(I'm the author of the code you linked to.) To the best of my knowledge, the code is correct (and I'm sure that the high-level algorithm of using Gaussian elimination over GF(2) is correct). The solution it produces is guaranteed to solve the puzzle, though it's not necessarily the minimal number of button presses. The "efficiency" I was referring to in the writeup refers to the time complexity of solving the puzzle overall (it can solve a Lights Out grid in polynomial time, as opposed to the exponential-time brute-force solution of trying all possible combinations) rather than to the "efficiency" of the generated solution.
I actually don't know any efficient algorithms for finding a solution requiring the minimum number of button presses. Let me know if you find one!
Hope this helps!

Finding a value of a variant in a permutation equation [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have a math problem that I can't solve: I don't know how to find the value of n so that
365! / ((365-n)! * 365^n) = 50%.
I am using the Casio 500ms scientific calculator but I don't know how.
Sorry because my question is too easy, I am changing my career so I have to review and upgrade my math, the subject that I have neglected for years.
One COULD in theory use a root-finding scheme like Newton's method, IF you could take derivatives. But this function is defined only on the integers, since it uses factorials.
One way out is to recognize the identity
n! = gamma(n+1)
which will effectively allow you to extend the function onto the real line. The gamma function is defined on the positive real line, though it does have singularities at the negative integers. And of course, you still need the derivative of this expression, which can be done since gamma is differentiable.
By the way, a danger with methods like Newton's method on problems like this is it may still diverge into the negative real line. Choose poor starting values, and you may get garbage out. (I've not looked carefully at the shape of this function, so I won't claim for what set of starting values it will diverge on you.)
Is it worth jumping through the above set of hoops? Of course not. A better choice than Newton's method might be something like Brent's algorithm, or a secant method, which here will not require you to compute the derivative. But even that is a waste of effort.
Recognizing that this is indeed a problem on the integers, one could use a tool like bisection to resolve the solution extremely efficiently. It never requires derivatives, and it will work nicely enough on the integers. Once you have resolved the interval to be as short as possible, the algorithm will terminate, and take vary few function evaluations in the process.
Finally, be careful with this function, as it does involve some rather large factorials, which could easily overflow many tools to evaluate the factorial. For example, in MATLAB, if I did try to evaluate factorial(365):
factorial(365)
ans =
Inf
I get an overflow. I would need to move into a tool like the symbolic toolbox, or my own suite of variable precision integer tools. Alternatively, one could recognize that many of the terms in these factorials will cancel out, so that
365! / (365 - n)! = 365*(365-1)*(365-2)*...*(365-n+1)
The point is, we get an overflow for such a large value if we are not careful. If you have a tool that will not overflow, then use it, and use bisection as I suggested. Here, using the symbolic toolbox in MATLAB, I get a solution using only 7 function evaluations.
f = #(n) vpa(factorial(sym(365))/(factorial(sym(365 - n))*365^sym(n)));
f(0)
ans =
1.0
f(365)
ans =
1.4549552156187034033714015903853e-157
f(182)
ans =
0.00000000000000000000000095339164972764493041114884521295
f(91)
ans =
0.000004634800180846641815683109605743
f(45)
ans =
0.059024100534225072005461014516788
f(22)
ans =
0.52430469233744993108665513602619
f(23)
ans =
0.49270276567601459277458277166297
Or, if you can't take an option like that, but do have a tool that can evaluate the log of the gamma function, AND you have a rootfinder available as MATLAB does...
f = #(n) exp(gammaln(365+1) - gammaln(365-n + 1) - n*log(365));
fzero(#(n) f(n) - .5,10)
ans =
22.7677
As you can see here, I used the identity relating gamma and the factorial function, then used the log of the gamma function, in MATLAB, gammaln. Once all the dirty work was done, then I exponentiated the entire mess, which will be a reasonable number. Fzero tells us that the cross-over occurs between 22 and 23.
If a numerical approximation is ok, ask Wolfram Alpha:
n ~= -22.2298272...
n ~= 22.7676903...
I'm going to assume you have some special reason for wanting an actual algorithm, even though you only have one specific problem to solve.
You're looking for a value n where...
365! / ((365-n)! * 365^n) = 0.5
And therefore...
(365! / ((365-n)! * 365^n)) - 0.5 = 0.0
The general form of the problem is to find a value x such that f(x)=0. One classic algorithm for this kind of thing is the Newton-Raphson method.
[EDIT - as woodchips points out in the comment, the factorial is an integer-only function. My defence - for some problems (the birthday problem among them) it's common to generalise using approximation functions. I remember the Stirling approximation of factorials being used for the birthday problem - according to this, Knuth uses it. The Wikipedia page for the Birthday problem mentions several approximations that generalise to non-integer values.
It's certainly bad that I didn't think to mention this when I first wrote this answer.]
One problem with that is that you need the derivative of that function. That's more a mathematics issue, though you can estimate the derivative at any point by taking values a short distance either side.
You can also look at this as an optimisation problem. The general form of optimisation problems is to find a value x such that f(x) is maximised/minimised. In your case, you could define your function as...
f(x)=((365! / ((365-n)! * 365^n)) - 0.5)^2
Because of the squaring, the result can never be negative, so try to minimise. Whatever value of x gets you the smallest f(x) will also give you the result you want.
There isn't so much an algorithm for optimisation problems as a whole field - the method you use depends on the complexity of your function. However, this case should be simple so long as your language can cope with big numbers. Probably the simplest optimisation algorithm is called hill-climbing, though in this case it should probably be called rolling-down-the-hill. And as luck would have it, Newton-Raphson is a hill-climbing method (or very close to being one - there may be some small technicality that I don't remember).
[EDIT as mentioned above, this won't work if you need an integer solution for the problem as actually stated (rather than a real-valued approximation). Optimisation in the integer domain is one of those awkward issues that helps make optimisation a field in itself. The branch and bound is common for complex functions. However, in this case hill-climbing still works. In principle, you can even still use a tweaked version of Newton-Raphson - you just have to do some rounding and check that you don't keep rounding back to the same place you started if your moves are small.]

How can I use TDD to solve a puzzle with an unknown answer?

Recently I wrote a Ruby program to determine solutions to a "Scramble Squares" tile puzzle:
I used TDD to implement most of it, leading to tests that looked like this:
it "has top, bottom, left, right" do
c = Cards.new
card = c.cards[0]
card.top.should == :CT
card.bottom.should == :WB
card.left.should == :MT
card.right.should == :BT
end
This worked well for the lower-level "helper" methods: identifying the "sides" of a tile, determining if a tile can be validly placed in the grid, etc.
But I ran into a problem when coding the actual algorithm to solve the puzzle. Since I didn't know valid possible solutions to the problem, I didn't know how to write a test first.
I ended up writing a pretty ugly, untested, algorithm to solve it:
def play_game
working_states = []
after_1 = step_1
i = 0
after_1.each do |state_1|
step_2(state_1).each do |state_2|
step_3(state_2).each do |state_3|
step_4(state_3).each do |state_4|
step_5(state_4).each do |state_5|
step_6(state_5).each do |state_6|
step_7(state_6).each do |state_7|
step_8(state_7).each do |state_8|
step_9(state_8).each do |state_9|
working_states << state_9[0]
end
end
end
end
end
end
end
end
end
So my question is: how do you use TDD to write a method when you don't already know the valid outputs?
If you're interested, the code's on GitHub:
Tests: https://github.com/mattdsteele/scramblesquares-solver/blob/master/golf-creator-spec.rb
Production code: https://github.com/mattdsteele/scramblesquares-solver/blob/master/game.rb
This isn't a direct answer, but this reminds me of the comparison between the Sudoku solvers written by Peter Norvig and Ron Jeffries. Ron Jeffries' approach used classic TDD, but he never really got a good solution. Norvig, on the other hand, was able to solve it very elegantly without TDD.
The fundamental question is: can an algorithm emerge using TDD?
From the puzzle website:
The object of the Scramble Squares®
puzzle game is to arrange the nine
colorfully illustrated square pieces
into a 12" x 12" square so that the
realistic graphics on the pieces'
edges match perfectly to form a
completed design in every direction.
So one of the first things I would look for is a test of whether two tiles, in a particular arrangement, match one another. This is with regard to your question of validity. Without that method working correctly, you can't evaluate whether the puzzle has been solved. That seems like a nice starting point, a nice bite-sized piece toward the full solution. It's not an algorithm yet, of course.
Once match() is working, where do we go from here? Well, an obvious solution is brute force: from the set of all possible arrangements of the tiles within the grid, reject those where any two adjacent tiles don't match. That's an algorithm, of sorts, and it's pretty certain to work (although in many puzzles the heat death of the universe occurs before a solution).
How about collecting the set of all pairs of tiles that match along a given edge (LTRB)? Could you get from there to a solution, quicker? Certainly you can test it (and test-drive it) easily enough.
The tests are unlikely to give you an algorithm, but they can help you to think about algorithms, and of course they can make validating your approach easier.
dunno if this "answers" the question either
analysis of the "puzzle"
9 tiles
each has 4 sides
each tile has half a pattern / picture
BRUTE FORCE APPROACH
to solve this problem
you need to generate 9! combinations ( 9 tiles X 8 tiles X 7 tiles... )
limited by the number of matching sides to the current tile(s) already in place
CONSIDERED APPROACH
Q How many sides are different?
IE how many matches are there?
therefore 9 X 4 = 36 sides / 2 ( since each side "must" match at least 1 other side )
otherwise its an uncompleteable puzzle
NOTE: at least 12 must match "correctly" for a 3 X 3 puzzle
label each matching side of a tile using a unique letter
then build a table holding each tile
you will need 4 entries into the table for each tile
4 sides ( corners ) hence 4 combinations
if you sort the table by side and INDEX into the table
side,tile_number
ABcd tile_1
BCda tile_1
CDab tile_1
DAbc tile_1
using the table should speed things up
since you should only need to match 1 or 2 sides at most
this limits the amount of NON PRODUCTIVE tile placing it has to do
depending on the design of the pattern / picture
there are 3 combinations ( orientations ) since each tile can be placed using 3 orientations
- the same ( multiple copies of the same tile )
- reflection
- rotation
God help us if they decide to make life very difficult
by putting similar patterns / pictures on the other side that also need to match
OR even making the tiles into cubes and matching 6 sides!!!
Using TDD,
you would write tests and then code to solve each small part of the problem,
as outlined above and write more tests and code to solve the whole problem
NO its not easy, you need to sit and write tests and code to practice
NOTE: this is a variation of the map colouring problem
http://en.wikipedia.org/wiki/Four_color_theorem

Resources