In looking through the dynamic programming algorithm for computing the minimum edit distance between two strings I am having a hard time grasping one thing. To me it seems like given the two strings s and t inserting a character into s would be the same as deleting a character from t. Why then do we need to consider these operations separately when computing the edit distance? I always have a hard time computing the indices in the recurrence relation because I can't intuitively understand this part.
I've read through Skiena and some other sources but they all don't explain this part well. This SO link explains the insert and delete operations better than elsewhere in terms of understanding what string is being inserted into or deleted from but I still can't figure out why they aren't one and the same.
Edit: Ok, I didn't do a very good job of detailing the source of my confusion.
The way Skiena explains computing the minimum edit distance m(i,j) of the first i characters of a string s and the first j characters of a string t based on already having computed solutions to the subproblems is as follows. m(i,j) will be the minimum of the following 3 possibilities:
opt[MATCH] = m[i-1][j-1].cost + match(s[i],t[j]);
opt[INSERT] = m[i][j-1].cost + indel(t[j]);
opt[DELETE] = m[i-1][j].cost + indel(s[i]);
The way I understand it the 3 operations are all operations on the string s. An INSERT means you have to insert a character at the end of string s to get the minimum edit distance. A DELETE means you have to delete the character at the end of string s to get the minimum edit distance.
Given s = "SU" and t = "SATU" INSERT and DELETE would be as follows:
Insert:
SU_
SATU
Delete:
SU
SATU_
My confusion was that an INSERT into s is the same as a DELETION from t. I'm probably confused on something basic but it's not intuitive to me yet.
Edit 2: I think this link kind of clarifies my confusion but I'd love an explanation given my specific questions above.
They aren't the same thing any more than < and > are the same thing. There is of course a sort of duality and you are correct to point it out. a < b if and only if b > a so if you have a good algorithm to test for b > a then it makes sense to use it when you need to test if a < b.
It is much easier to directly test if s can be obtained from t by deletion rather than to directly test if t can be obtained from s by insertion. It would be silly to randomly insert letters to s and see if you get t. I can't imagine that any implementation of edit-distance actually does that. Still, it doesn't mean that you can't distinguish between insertion and deletion.
More abstractly. There is a relation, R on any set of strings defined by
s R t <=> t can be obtained from s by insertion
deletion is the inverse relation. Closely related, but not the same.
The problem of edit distance can be restated as a problem of converting the source string into target string with minimum number of operations (including insertion, deletion and replacement of a single character).
Thus, in the process of converting a source string into a target string, if inserting a character from target string or deleting a character from the source string or replacing a character in the source string with a character from the target string yields the same (minimum) edit distance, then, well, all the operations can be said to be equivalent. In other words, it does not matter how you arrive at the target string as long as you have done minimum number of edits.
This is realized by looking at how the cost matrix is calculated. Consider a simpler problem where source = AT (represented vertically) and target = TA (represented horizontally). The matrix is then constructed as (coming from west, northwest, north in that order):
| ε | T | A |
| | | |
ε | 0 | 1 | 2 |
| | | |
A | 1 | min(2, 1, 2) = 1 | min(2, 1, 3) = 1 |
| | | |
T | 2 | min(3, 1, 3) = 1 | min(2, 2, 2) = 2 |
The idea of filling this matrix is:
If we moved east, we insert the current target string character.
If we moved south, we delete the current source string character.
If we moved southeast, we replace the current source character with current target character.
If all or any two of these impart the same cost in terms of editing, then they can be said to be equivalent and you can break the ties arbitrarily.
One of the first experiences with this comes when we find c(2, 2) in the cost matrix (c(0, 0) through c(0, 2) -- minimum costs of converting an empty string to "T", "TA" respectively, and c(0, 0) to c(2,0) -- costs of converting "A", "AT" respectively to empty string are clear).
Value of c(2, 2), can be realized either by:
inserting the current character in target, 'A' (we move east from c(2,1)) -- cost is 1 + 1 = 2, or
replacing the current character 'T' in source by current character in target 'A' -- cost is `1 + 1 = 2
deleting the current character in source, 'T' (we move south from c(1, 2)) -- cost is 1 + 1 = 2
Since all values are the same, which one are you going to choose?
If you choose to move from west, your alignment could be:
A T -
- T A
(one deletion, one 0-cost replacement, one insertion)
If you choose to move from north, your alignment could be:
- A T
T A -
(one insertion, one 0-cost replacement, one deletion)
If you choose to move from northwest, your alignment could be:
A T
T A
(Two 1-cost replacements).
All these edit graphs are equivalent in terms of given edit distance (under given cost function).
Edit distance is only interested in the minimum number of operations required to transform one sequence into another; it is not interested in the uniqueness of the transformation. In practice, there are often multiple ways to transform one string into another, that all have the minimum number of operations.
I have a grid with x fields. This grid should be filled with as much sqaures (lets call them "farms") of the size 2x2 (so each farm is 4 fields in size) as possible. Each farm has to be connected to a certain field ("root") through "roads".
I have written a kind of brute force algorithm which tries every combination of farms and roads. Everytime a farm is placed on the grid, the algorithm checks, if the Farm has a connection to the root using the A* algorithm. It works very well on small grids, but on large grids, it's too time consuming.
Here is a small already solved grid
http://www.tmk-stgeorgen.at/algo/small.png
Blue squares are the farms, red squares are free space or "roads" and the filled red square is the root field, to which every farm needs a connection.
I need to solve this grid:
http://www.tmk-stgeorgen.at/algo/grid.png
Is there any fast standard algorithm, which I can use?
i think the following is better than a search, but it's based on a search, so i'll describe that first:
search
you can make a basic search efficient in various ways.
first, you need to enumerate the possible arrangements efficiently. i think i would do this by storing the number of shifts relative to the first position a farm can be placed, starting from the bottom (near the root). so (0) would be a single farm on the left of the bottom line; (1) would be that farm shifted one right; (0,0) would be two farms, first as (0), second at the first position possible scanning upwards (second line, touching first farm); (0,1) would have the second farm one to the right; etc.
second, you need to prune as efficiently as possible. there it's a trade-off between doing smart but expensive things, and dumb but fast things. dumb but fast would be a flood fill from the root, checking whether all farms can be reached. smarter would be working out how to do that in an incremental fashion when you add one farm - for example, you know that you can rely on previous flood fills cells smaller than the smallest value the farm covers. even smarter would be identifying which roads are critical (unique access to another farm) and "protecting" them in some way.
third, there may be extra tweaks you can do at a higher level. for example, it might be better to solve for a symmetric grid (and use symmetry to avoid repeating the same pattern in different ways) and then check which solutions are consistent with the grid you actually have. another approach that might be useful, but that i can't see how to make work, is to focus on the road rather than the farms.
caching
here's the secret sauce. the search i have described "fills up" farms into the space from the bottom, left to right scanning.
now imagine that you have run the search to the point where the space is full, with a nearly-optimal distribution. it may be that to improve that solution you have to backtrack almost to the start to rearrange a few farms "near the bottom". which is expensive because then you have to continue the search to re-fill the space above.
but you don't need to repeat the entire search if the "boundary" around the farms is the same as an earlier arrangement. because you've already "filled in" above that boundary in some optimal way. so you can cache by "best result for a given boundary" and simply look-up those solutions.
the boundary description must include the shape of the boundary and the positions of roads that provide access to the root. that is all.
Here's something kind of crude in Haskell, which could probably benefit from optimization, memoization, and better heuristics...
The idea is to start with a grid that is all farm and place roads on it, starting with the root and expanding from there. The recursion uses a basic heuristic, where the candidates are chosen from all adjacent straight-two-block segments all along the road/s, and only if they satisfy the requirement that adding the segment will increase the number of farms connected to the road/s (overlapping segments are just added as one block rather than two).
import qualified Data.Map as M
import Data.List (nubBy)
-- (row,(rowLength,offset))
grid' = M.fromList [(9,[6])
,(8,[5..7])
,(7,[4..8])
,(6,[3..9])
,(5,[2..10])
,(4,[1..11])
,(3,[2..10])
,(2,[3..9])
,(1,[4..7])]
grid = M.fromList [(19,[10])
,(18,[9..11])
,(17,[8..12])
,(16,[7..13])
,(15,[6..14])
,(14,[5..15])
,(13,[4..16])
,(12,[3..17])
,(11,[2..18])
,(10,[1..19])
,(9,[1..20])
,(8,[1..19])
,(7,[2..18])
,(6,[3..17])
,(5,[4..16])
,(4,[5..15])
,(3,[6..14])
,(2,[7..13])
,(1,[8..11])]
root' = (1,7) --(row,column)
root = (1,11) --(row,column)
isOnGrid (row,col) =
case M.lookup row grid of
Nothing -> False
Just a -> elem col a
isFarm (topLeftRow,topLeftCol) =
and (map isOnGrid [(topLeftRow,topLeftCol),(topLeftRow,topLeftCol + 1)
,(topLeftRow - 1,topLeftCol),(topLeftRow - 1,topLeftCol + 1)])
isNotOnFarm tile#(r,c) farm#(fr,fc) =
not (elem r [fr,fr - 1]) || not (elem c [fc, fc + 1])
isOnFarm tile#(r,c) farm#(fr,fc) =
elem r [fr,fr - 1] && elem c [fc, fc + 1]
farmOnFarm farm#(fr,fc) farm' =
or (map (flip isOnFarm farm') [(fr,fc),(fr,fc + 1),(fr - 1,fc),(fr - 1,fc + 1)])
addRoad tile#(r,c) result#(road,(numFarms,farms))
| not (isOnGrid tile) || elem tile road = result
| otherwise = (tile:road,(length $ nubBy (\a b -> farmOnFarm a b) farms',farms'))
where
newFarms' = filter (isNotOnFarm tile) farms
newFarms = foldr comb newFarms' adjacentFarms
farms' = newFarms ++ adjacentFarms
comb adjFarm newFarms'' =
foldr (\a b -> if farmOnFarm a adjFarm || a == adjFarm then b else a:b) [] newFarms''
adjacentFarms = filter (\x -> isFarm x && and (map (flip isNotOnFarm x) road))
[(r - 1,c - 1),(r - 1,c),(r,c - 2),(r + 1,c - 2)
,(r + 2,c - 1),(r + 2,c),(r + 1,c + 1),(r,c + 1)]
candidates result#(road,(numFarms,farms)) =
filter ((>numFarms) . fst . snd)
$ map (\roads -> foldr (\a b -> addRoad a b) result roads)
$ concatMap (\(r,c) -> [[(r + 1,c),(r + 1,c - 1)],[(r + 1,c),(r + 1,c + 1)]
,[(r,c - 1),(r + 1,c - 1)],[(r,c - 1),(r - 1,c - 1)]
,[(r,c + 1),(r + 1,c + 1)],[(r,c + 1),(r - 1,c + 1)]
,[(r - 1,c),(r - 1,c - 1)],[(r - 1,c),(r - 1,c + 1)]
,[(r + 1,c),(r + 2,c)],[(r,c - 1),(r,c - 2)]
,[(r,c + 1),(r,c + 2)],[(r - 1,c),(r - 2, c)]]) road
solve = solve' (addRoad root ([],(0,[]))) where
solve' result#(road,(numFarms,farms)) =
if null candidates'
then [result]
else do candidate <- candidates'
solve' candidate
where candidates' = candidates result
b n = let (road,(numFarms,farms)) = head $ filter ((>=n) . fst . snd) solve
in (road,(numFarms,nubBy (\a b -> farmOnFarm a b) farms))
Output, small grid:
format: (road/s,(numFarms,farms))
*Main> b 8
([(5,5),(5,4),(6,6),(4,6),(5,6),(4,8),(3,7),(4,7),(2,7),(2,6),(1,7)]
,(8,[(2,4),(3,8),(5,9),(8,6),(6,7),(5,2),(4,4),(7,4)]))
(0.62 secs, 45052432 bytes)
Diagram (O's are roads):
X
XXX
XXXXX
XXXOXXX
XXOOOXXXX
XXXXXOOOXXX
XXXXXOXXX
XXXOOXX
XXXO
Output, large grid:
format: (road/s,(numFarms,farms))
*Main> b 30
([(9,16),(9,17),(13,8),(13,7),(16,10),(7,6),(6,6),(9,3),(8,4),(9,4),(8,5)
,(8,7),(8,6),(9,7),(10,8),(10,7),(11,8),(12,9),(12,8),(14,9),(13,9),(14,10)
,(15,10),(14,11),(13,12),(14,12),(13,14),(13,13),(12,14),(11,15),(11,14)
,(10,15),(8,15),(9,15),(8,14),(8,13),(7,14),(7,15),(5,14),(6,14),(5,12)
,(5,13),(4,12),(3,11),(4,11),(2,11),(2,10),(1,11)]
,(30,[(2,8),(4,9),(6,10),(4,13),(6,15),(7,12),(9,11),(10,13),(13,15),(15,13)
,(12,12),(13,10),(11,9),(9,8),(10,5),(8,2),(10,1),(11,3),(5,5),(7,4),(7,7)
,(17,8),(18,10),(16,11),(12,6),(14,5),(15,7),(10,18),(8,16),(11,16)]))
(60.32 secs, 5475243384 bytes)
*Main> b 31
still waiting....
I don't know if this solution will maximize your number farms, but you can try to put them in a regular way: allign them horizontally or vertically. You can stick 2 columns (or rows) together for the best density of farms. You should just take care to let 1 space on top/bottom (or left/right).
When you can't put more column (row), just check if you can put some farms near the border of your grid.
Wish it could help you !
From ProjectEuler.net:
Prob 76: How many different ways can one hundred be written as a sum of at least two positive integers?
I have no idea how to start this...any points in the right direction or help? I'm not looking for how to do it but some hints on how to do it.
For example 5 can be written like:
4 + 1
3 + 2
3 + 1 + 1
2 + 2 + 1
2 + 1 + 1 + 1
1 + 1 + 1 + 1 + 1
So 6 possibilities total.
Partition Numbers (or Partition Functions) are the key to this one.
Problems like these are usually easier if you start at the bottom and work your way up to see if you can detect any patterns.
P(1) = 1 = {1}
P(2) = 2 = {[2], [1 + 1]}
P(3) = 3 = {[3], [2 + 1], [1 + 1 + 1]}
P(4) = 5 = {[4], [3 + 1], [2 + 2], [2 + 1 + 1], [1 + 1 + 1 + 1]}
P(5) = 7 ...
P(6) = 11 ...
P(7) = 15 ...
P(8) = 22 ...
P(9) = 30 ...
Hint: See if you can build P(N) up from some combination of the results prior to P(N).
The solution can be found using a chopping algorithm.
Use for example the 6. Then we have:
6
5+1
4+2
3+3
but we are not finished yet.
If we take the 5+1, and consider the +1 part as finished, because all other ending combinations are handled by the 4+2 and 3+3. So we need to apply the same trick to the 5.
4+1+1
3+2+1
And we can continue. But not mindlessly. Because for example 4+2 produces 3+1+2 and 2+2+2. But we don't want the 3+1+2 because we will have a 3+2+1. So we only use all productions of 4 where the lowest number is greater or equal than 2.
6
5+1
4+1+1
3+1+1+1
2+1+1+1+1
1+1+1+1+1+1
2+2+1+1
3+2+1
4+2
2+2+2
3+3
Next step is to put this in an algorithm.
Ok we need a recursive function that takes two parameters. The number to be chopped and the minimal value:
func CountCombinations(Number, Minimal)
temp = 1
if Number<=1 then return 1
for i = 1 to Floor(Number/2)
if i>=Minimal then
temp := temp + CountCombinations(Number-i, i)
end for
return temp
end func
To check the algorithm:
C(6,1) = 1 + C(5,1) + C(4,2) + C(3,3) = 11, which is correct.
C(5,1) = 1 + C(4,1) + C(3,2) = 7
C(4,1) = 1 + C(3,1) + C(2,2) = 5
C(3,1) = 1 + C(2,1) = 3
C(2,1) = 1 + C(1,1) = 2
C(1,1) = 1
C(2,2) = 1
C(3,2) = 1
C(4,2) = 1 + C(2,2) = 2
C(3,3) = 1
By the way, the number of combinations of 100:
CC(100) = 190569292
CC(100) = 190569291 (if we don't take into account 100 + 0)
A good way to approach these is not get fixated on the '100' but try to consider what the difference between totalling for a sum n and n+1 would be, by looking for patterns as n increases 1,2,3....
I'd have a go now but I have work to do :)
Like most problems in Project Euler with big numbers, the best way to think about them is not to get stumped with that huge upper bound, and think of the problem in smaller terms, and gradually work your way up. Maybe, on the way you'll recognize a pattern, or learn enough to get you to the answer easily.
The only other hint I think I can give you without spoiling your epiphany is the word 'partition'.
Once you've figured that out, you'll have it in no time :)
one approach is to think recursive function: find permutations of a numeric series drawn from the positive integers (duplicates allowed) that add up to 100
the zero is 1, i.e. for the number 1 there are zero solutions
the unit is 2, i.e for the number 2 there is only one solution
another approach is to think generative function: start with the zero and find permutation series up to the target, keeping a map/hash or the intermediate values and counts
you can iterate up from 1, or recurse down from 100; you'll get the same answer either way. At each point you could (for a naive solution) generate all permutations of the series of positive integers counting up to the target number minus 1, and count only those that add up to the target number
good luck!
Notice: My maths is a bit rusty but hopefully this will help...
You are going well with your break down of the problem.
Think Generally:
A number n can be written as (n-1)+1 or (n-2)+2
You generalise this to (n-m)+m
Remember that the above also applies to all numbers (including m)
So the idea is to find the first set (lets say 5 = (5-1)+1) and then treat (5-1) as a new n...5 = 4 +1...5 = ((4-1)+1)+1. The once that is exhausted begin again on 5 = 3 + 2....which becomes 5 = ((3-1)+1)+2 ....= 2+1+2....breaking down each one as you go along.
Many math problems can be solved by induction.
You know the answer for a specific value, and you can find the answer for every value, if you find something that link n with n+1.
For example in your case you know that the answer for How many different ways can one be written as a sum of at least two positive integers? is just 1.
What do I mean with the link between n and n+1? Well I mean exactly that you must find a formula that, provide you know the answer for n, you will find the answer for n+1. Then, calling recursively that formula, you'll know the answer and you've done (note: this is just the mathematical part of it, in real life you might find that this approach would gave you something too slow to be practical, so you've not done yet - in this case I think you will be done).
Now, suppose that you know n can be written as a sum of at least two positive integers? in k different ways, one of which would be:
n=a1+a2+a3+...am (this sum has m terms)
What can you say about n+1? Since you would like just hints I'm not writing the solution here, but just what follows. Surely you have the same k different ways, but in each of them there will be the +1 term, one of which would be:
n+1=a1+a2+a3+...am+1 (this sum has m+1 terms)
Then, of course, you will have other k possibilities, such those in which the last term of each sum is not the same, but it will be increased by one, like:
n+1=a1+a2+a3+...(am+1) (this sum has m terms)
Thus there are at least 2k ways of writing n+1 as a sum of at least two positive integers. Well, there are other ones. Find it out, if you can :-)
And enjoy :-))