Solving ACM ICPC - SEERC 2009 - algorithm

I have been sitting on this for almost a week now. Here is the question in a PDF format.
I could only think of one idea so far but it failed. The idea was to recursively create all connected subgraphs which works in O(num_of_connected_subgraphs), but that is way too slow.
I would really appreciate someone giving my a direction. I'm inclined to think that the only way is dynamic programming but I can't seem to figure out how to do it.

OK, here is a conceptual description for the algorithm that I came up with:
Form an array of the (x,y) board map from -7 to 7 in both dimensions and place the opponents pieces on it.
Starting with the first row (lowest Y value, -N):
enumerate all possible combinations of the 2nd player's pieces on the row, eliminating only those that conflict with the opponents pieces.
for each combination on this row:
--group connected pieces into separate networks and number these
networks starting with 1, ascending
--encode the row as a vector using:
= 0 for any unoccupied or opponent position
= (1-8) for the network group that that piece/position is in.
--give each such grouping a COUNT of 1, and add it to a dictionary/hashset using the encoded vector as its key
Now, for each succeeding row, in ascending order {y=y+1}:
For every entry in the previous row's dictionary:
--If the entry has exactly 1 group, add it's COUNT to TOTAL
--enumerate all possible combinations of the 2nd player's pieces
on the current row, eliminating only those that conflict with the
opponents pieces. (change:) you should skip the initial combination
(where all entries are zero) for this step, as the step above actually
covers it. For each such combination on the current row:
+ produce a grouping vector as described above
+ compare the current row's group-vector to the previous row's
group-vector from the dictionary:
++ if there are any group-*numbers* from the previous row's
vector that are not adjacent to any gorups in the current
row's vector, *for at least one value of X*, then skip
to the next combination.
++ any groups for the current row that are adjacent to any
groups of the previous row, acquire the lowest such group
number
++ any groups for the current row that are not adjacent to
any groups of the previous row, are assigned an unused
group number
+ Re-Normalize the group-number assignments for the current-row's
combination (**) and encode the vector, giving it a COUNT equal
to the previous row-vector's COUNT
+ Add the current-row's vector to the dictionary for the current
Row, using its encoded vector as the key. If it already exists,
then add it's COUNT to the COUNT for the pre-exising entry
Finally, for every entry in the dictionary for the last row:
If the entry has exactly one group, then add it's COUNT to TOTAL
**: Re-Normalizing simply means to re-assign the group numbers so as to eliminate any permutations in the grouping pattern. Specifically, this means that new group numbers should be assigned in increasing order, from left-to-right, starting from one. So for example, if your grouping vector looked like this after grouping ot to the previous row:
2 0 5 5 0 3 0 5 0 7 ...
it should be re-mapped to this normal form:
1 0 2 2 0 3 0 2 0 4 ...
Note that as in this example, after the first row, the groupings can be discontiguous. This relationship must be preserved, so the two groups of "5"s are re-mapped to the same number ("2") in the re-normalization.
OK, a couple of notes:
A. I think that this approach is correct , but I I am really not certain, so it will definitely need some vetting, etc.
B. Although it is long, it's still pretty sketchy. Each individual step is non-trivial in itself.
C. Although there are plenty of individual optimization opportunities, the overall algorithm is still pretty complicated. It is a lot better than brute-force, but even so, my back-of-the-napkin estimate is still around (2.5 to 10)*10^11 operations for N=7.
So it's probably tractable, but still a long way off from doing 74 cases in 3 seconds. I haven't read all of the detail for Peter de Revaz's answer, but his idea of rotating the "diamond" might be workable for my algorithm. Although it would increase the complexity of the inner loop, it may drop the size of the dictionaries (and thus, the number of grouping-vectors to compare against) by as much as a 100x, though it's really hard to tell without actually trying it.
Note also that there isn't any dynamic programming here. I couldn't come up with an easy way to leverage it, so that might still be an avenue for improvement.
OK, I enumerated all possible valid grouping-vectors to get a better estimate of (C) above, which lowered it to O(3.5*10^9) for N=7. That's much better, but still about an order of magnitude over what you probably need to finish 74 tests in 3 seconds. That does depend on the tests though, if most of them are smaller than N=7, it might be able to make it.

Here is a rough sketch of an approach for this problem.
First note that the lattice points need |x|+|y| < N, which results in a diamond shape going from coordinates 0,6 to 6,0 i.e. with 7 points on each side.
If you imagine rotating this diamond by 45 degrees, you will end up with a 7*7 square lattice which may be easier to think about. (Although note that there are also intermediate 6 high columns.)
For example, for N=3 the original lattice points are:
..A..
.BCD.
EFGHI
.JKL.
..M..
Which rotate to
A D I
C H
B G L
F K
E J M
On the (possibly rotated) lattice I would attempt to solve by dynamic programming the problem of counting the number of ways of placing armies in the first x columns such that the last column is a certain string (plus a boolean flag to say whether some points have been placed yet).
The string contains a digit for each lattice point.
0 represents an empty location
1 represents an isolated point
2 represents the first of a new connected group
3 represents an intermediate in a connected group
4 represents the last in an connected group
During the algorithm the strings can represent shapes containing multiple connected groups, but we reject any transformations that leave an orphaned connected group.
When you have placed all columns you need to only count strings which have at most one connected group.
For example, the string for the first 5 columns of the shape below is:
....+ = 2
..+++ = 3
..+.. = 0
..+.+ = 1
..+.. = 0
..+++ = 3
..+++ = 4
The middle + is currently unconnected, but may become connected by a later column so still needs to be tracked. (In this diagram I am also assuming a up/down/left/right 4-connectivity. The rotated lattice should really use a diagonal connectivity but I find that a bit harder to visualise and I am not entirely sure it is still a valid approach with this connectivity.)
I appreciate that this answer is not complete (and could do with lots more pictures/explanation), but perhaps it will prompt someone else to provide a more complete solution.

Related

Possible NxN matrices, t 1's in each row and column, none in diagonal?

Background:
This is extra credit in a logic and algorithms class, we are currently covering propositional logic, P implies Q that kind of thing, so I think the Prof wanted to give us and assignment out of our depth.
I will implement this in C++, but right now I just want to understand whats going on in the example....which I don't.
Example
Enclosed is a walkthrough for the Lefty algorithm which computes the number
of nxn 0-1 matrices with t ones in each row and column, but none on the main
diagonal.
The algorithm used to verify the equations presented counts all the possible
matrices, but does not construct them.
It is called "Lefty", it is reasonably simple, and is best described with an
example.
Suppose we wanted to compute the number of 6x6 0-1 matrices with 2 ones
in each row and column, but no ones on the main diagonal. We first create a
state vector of length 6, filled with 2s:
(2 2 2 2 2 2)
This state vector symbolizes the number of ones we must yet place in each
column. We accompany it with an integer which we call the "puck", which is
initialized to 1. This puck will increase by one each time we perform a ones
placement in a row of the matrix (a "round"), and we will think of the puck as
"covering up" the column that we wonít be able to place ones in for that round.
Since we are starting with the first row (and hence the first round), we place
two ones in any column, but since the puck is 1, we cannot place ones in the
first column. This corresponds to the forced zero that we must place in the first
column, since the 1,1 entry is part of the matrixís main diagonal.
The algorithm will iterate over all possible choices, but to show each round,
we shall make a choice, say the 2nd and 6th columns. We then drop the state
vector by subtracting 1 from the 2nd and 6th values, and advance the puck:
(2 1 2 2 2 1); 2
For the second round, the puck is 2, so we cannot place a one in that column.
We choose to place ones in the 4th and 6th columns instead and advance the
puck:
(2 1 2 1 2 0); 3
Now at this point, we can place two ones anywhere but the 3rd and 6th
columns. At this stage the algorithm treats the possibilities di§erently: We
can place some ones before the puck (in the column indexes less than the puck
value), and/or some ones after the puck (in the column indexes greater than
the puck value). Before the puck, we can place a one where there is a 1, or
where there is a 2; after the puck, we can place a one in the 4th or 5th columns.
Suppose we place ones in the 4th and 5th columns. We drop the state vector
and advance the puck once more:
(2 1 2 0 1 0); 4
1
For the 4th round, we once again notice we can place some ones before the
puck, and/or some ones after.
Before the puck, we can place:
(a) two ones in columns of value 2 (1 choice)
(b) one one in the column of value 2 (2 choices)
(c) one one in the column of value 1 (1 choice)
(d) one one in a column of value 2 and one one in a column of value 1 (2
choices).
After we choose one of the options (a)-(d), we must multiply the listed
number of choices by one for each way to place any remaining ones to the right
of the puck.
So, for option (a), there is only one way to place the ones.
For option (b), there are two possible ways for each possible placement of
the remaining one to the right of the puck. Since there is only one nonzero value
remaining to the right of the puck, there are two ways total.
For option (c), there is one possible way for each possible placement of the
remaining one to the right of the puck. Again, since there is only one nonzero
value remaining, there is one way total.
For option (d), there are two possible ways to place the ones.
We choose option (a). We drop the state vector and advance the puck:
(1 1 1 0 1 0); 5
Since the puck is "covering" the 1 in the 5th column, we can only place
ones before the puck. There are (3 take 2) ways to place two ones in the three
columns of value 1, so we multiply 3 by the number of ways to get remaining
possibilities. After choosing the 1st and 3rd columns (though it doesnít matter
since weíre left of the puck; any two of the three will do), we drop the state
vector and advance the puck one final time:
(0 1 0 0 1 0); 6
There is only one way to place the ones in this situation, so we terminate
with a count of 1. But we must take into account all the multiplications along
the way: 1*1*1*1*3*1 = 3.
Another way of thinking of the varying row is to start with the first matrix,
focus on the lower-left 2x3 submatrix, and note how many ways there were to
permute the columns of that submatrix. Since there are only 3 such ways, we
get 3 matrices.
What I think I understand
This algorithm counts the the all possible 6x6 arrays with 2 1's in each row and column with none in the descending diagonal.
Instead of constructing the matrices it uses a "state_vector" filled with 6 2's, representing how many 2's are in that column, and a "puck" that represents the index of the diagonal and the current row as the algorithm iterates.
What I don't understand
The algorithm comes up with a value of 1 for each row except 5 which is assigned a 3, at the end these values are multiplied for the end result. These values are supposed to be the possible placements for each row but there are many possibilities for row 1, why was it given a one, why did the algorithm wait until row 5 to figure all the possible permutations?
Any help will be much appreciated!
I think what is going on is a tradeoff between doing combinatorics and doing recursion.
The algorithm is using recursion to add up all the counts for each choice of placing the 1's. The example considers a single choice at each stage, but to get the full count it needs to add the results for all possible choices.
Now it is quite possible to get the final answer simply using recursion all the way down. Every time we reach the bottom we just add 1 to the total count.
The normal next step is to cache the result of calling the recursive function as this greatly improves the speed. However, the memory use for such a dynamic programming approach depends on the number of states that need to be expanded.
The combinatorics in the later stages is making use of the fact that once the puck has passed a column, the exact arrangement of counts in the columns doesn't matter so you only need to evaluate one representative of each type and then add up the resulting counts multiplied by the number of equivalent ways.
This both reduces the memory use and improves the speed of the algorithm.
Note that you cannot use combinatorics for counts to the right of the puck, as for these the order of the counts is still important due to the restriction about the diagonal.
P.S. You can actually compute the number of ways for counting the number of n*n matrices with 2 1's in each column (and no diagonal entries) with pure combinatorics as:
a(n) = Sum_{k=0..n} Sum_{s=0..k} Sum_{j=0..n-k} (-1)^(k+j-s)*n!*(n-k)!*(2n-k-2j-s)!/(s!*(k-s)!*(n-k-j)!^2*j!*2^(2n-2k-j))
According to OEIS.

Heuristics for this (probably) NP-complete puzzle game

I asked whether this problem was NP-complete on the Computer Science forum, but asking for programming heuristics seems better suited for this site. So here it goes.
You are given an NxN grid of unit squares and 2N binary strings of length N. The goal is to fill the grid with 0's and 1's so that each string appears once and only once in the grid, either horizontally (left to right) or vertically (top down). Or determine that no such solution exists. If N is not fixed I suspect this is an NP-complete problem. However are there any heuristics that can hopefully speed up the search to faster than brute force trying all ways to fill in the grid with N vertical strings?
I remember programming this for my friend that had the 5x5 physical version of this game, but I used brute force back then. I can only think of this heuristic:
Consider a 4x4 map with these 8 strings (read each from left to right):
1 1 0 1
1 0 0 1
1 0 1 1
1 0 1 0
1 1 1 1
1 0 0 0
0 0 1 1
1 1 1 0
(Note that this is already solved, since the second 4 is the first 4 transposed)
First attempt:
We will choose columns from left to right. Since 7 of 8 strings start with 1, we will try to put the one with most 1s to the first column (so that we can lay rows more easily when columns are done).
In the second column, most string have 0, so you can also try putting a string with most zeros to the second row, and so on.
This i would call a wide-1 prediction, since it only looks at one column at a time
(Possible) Improvement:
You can look at 2 columns at a time (a wide-2 prediction, if i may call it like that). In this case, from the 8 strings, the most common combination of first two bits is 10 (5/8), so you would like to choose first two columns so the the combination 10 occurring as much as possible (in this case, 1111 followed by 1000 has 3 of 4 10 at start).
(Of course you don't have to stop at 2)
Weaknesses:
I don't know if this would work. I just made it up and thought it might work.
If you choose to he wide-X prediction, the number of possibilities is exponential with X
This can absolutely fail if the distribution of combinations if even.
What you can do:
As i said, this game has physical 5x5 adaptation, only there you can also lay the string from right-to-left and bottom-to-top, if you found that name, you could google further. I unfortunately don't remember it.
Sounds like you want the crossword grid filling algorithm:
First, build 2N subsets of your 2N strings -- each subset has all the strings with a particular bit at a particular postion. So subset(0,3) is all the strings that have a 0 in the 3rd position and subset(1,5) is all the strings that have a 1 in the 5th position.
The algorithm is a basic brute-force depth fist search trying all possible mappings of strings to slots in the grid, with severe pruning of impossible branches
Your search state is a set of assignments of strings to slots and a set of sets of possible assignments to the remaining slots. The initial state has 0 assignments and 2N sets, all of which contain all 2N strings.
At each step of the search, pick the most constrained set (the set with the fewest elements) from the set of possible sets. Try each element of the set in turn in that slot (adding it to the assigments and removing it from the set of sets), and constrain all the remaining sets of sets by removing the chosen string and intersecting the crossing sets with subset(X,N) (computed in step 1) where X is the bit from the chosen string and N is the row/column number of the chosen string
If you find an empty set when picking above, there is no solution with the choices so far, so backtrack up the tree to a different choice
This is still EXPTIME, but it is about as fast as you can get it. Since the main time consuming step is the set intersections, using 2N bit binary strings for your set representation is very fast -- for N=32, the sets fit in a 64-bit word and can be intersected with a single AND instruction. It also helps to have a POPCOUNT instruction, since you also need set sizes.
This can be solved as a 0/1 integer linear program with O(N^2) variables and constraints. First there are variables Xij which are 1 if string i is assigned to line j (where j=1 to N are rows and j = (N+1) to 2N are columns). Then there is a variable for each square in the grid, which indicates if the entry is 0 or 1. If the position of the square is (i,j) with variable Yij then the sum of all X variables for line j that correspond to strings that have a 1 in position i is equal to Yij, and the sum of all X variables for line j that correspond to strings that have a 0 in position i is equal to (1 - Yij). And similarly for line i and position j. Finally, the sum of all X variables Xij for each string i (summed over all lines j) is equal to 1.
There has been a lot of research in speeding up solvers for 0/1 integer programming so this may be able to often handle fairly large N (like N=100) for many examples. Also, in some cases, solving the relaxed non-integer linear program and rounding the solution off to 0/1 may produce a valid solution, in polynomial time.
We could choose the first lg 2N rows out of the 2N strings, and then since 2^(lg 2N) = 2N, in a lot of cases there shouldn't be very many ways to assign the N columns so that the prefixes of length lg 2N are respected. Then all the rows are filled in so they can be checked to see if a solution has been found. We can also try assigning more rows in the beginning, and fill in different combinations of rows besides the initial rows. (e.g. we can try filling in contiguous rows starting anywhere in the grid).
Running time for assigning lg 2N rows out of 2N strings is O((2N)^(lg 2N)) = O(2^((lg 2N)^2)), which grows slower than 2^N. Assigning columns to match the prefixes is the part that's the hardest to predict run time. If a prefix occurs K times among the assigned rows, and there are M remaining strings that have the prefix, then the number of assignments for this prefix is M*(M-1)...(M-K+1). The total number of possible column assignments is the product of these terms over all prefixes that occur among the rows. If this gets to be too large, the number of rows initially assigned can be increased. But it's hard to predict the worst-case run time unless an assumption is made like the NxN grid is filled in randomly.

Algorithm for expressing reordering, as minimum number of object moves

This problem arises in synchronization of arrays (ordered sets) of objects.
Specifically, consider an array of items, synchronized to another computer. The user moves one or more objects, thus reordering the array, behind my back. When my program wakes up, I see the new order, and I know the old order. I must transmit the changes to the other computer, reproducing the new order there. Here's an example:
index 0 1 2
old order A B C
new order C A B
Define a move as moving a given object to a given new index. The problem is to express the reordering by transmitting a minimum number of moves across a communication link, such that the other end can infer the remaining moves by taking the unmoved objects in the old order and moving them into as-yet unused indexes in the new order, starting with the lowest index and going up. This method of transmission would be very efficient in cases where a small number of objects are moved within a large array, displacing a large number of objects.
Hang on. Let's continue the example. We have
CANDIDATE 1
Move A to index 1
Move B to index 2
Infer moving C to index 0 (the only place it can go)
Note that the first two moves are required to be transmitted. If we don't transmit Move B to index 2, B will be inferred to index 0, and we'll end up with B A C, which is wrong. We need to transmit two moves. Let's see if we can do better…
CANDIDATE 2
Move C to index 0
Infer moving A to index 1 (the first available index)
Infer moving B to index 2 (the next available index)
In this case, we get the correct answer, C A B, transmitting only one move, Move C to index 0. Candidate 2 is therefore better than Candidate 1. There are four more candidates, but since it's obvious that at least one move is needed to do anything, we can stop now and declare Candidate 2 to be the winner.
I think I can do this by brute forcibly trying all possible candidates, but for an array of N items there are N! (N factorial) possible candidates, and even if I am smart enough to truncate unnecessary searches as in the example, things might still get pretty costly in a typical array which may contain hundreds of objects.
The solution of just transmitting the whole order is not acceptable, because, for compatibility, I need to emulate the transmissions of another program.
If someone could just write down the answer that would be great, but advice to go read Chapter N of computer science textbook XXX would be quite acceptable. I don't know those books because, I'm, hey, only an electrical engineer.
Thanks!
Jerry Krinock
I think that the problem is reducible to Longest common subsequence problem, just find this common subsequence and transmit the moves that are not belonging to it. There is no prove of optimality, just my intuition, so I might be wrong. Even if I'm wrong, that may be a good starting point to some more fancy algorithm.
Information theory based approach
First, have a bit series such that 0 corresponds to 'regular order' and 11 corresponds to 'irregular entry'. Whenever there in irregular entry also add the original location of the entry that is next.
Eg. Assume original order of ABCDE for the following cases
ABDEC: 001 3 01 2
BCDEA: 1 1 0001 0
Now, if the probability of making a 'move' is p, this method requires roughly n + n*p*log(n) bits.
Note that if p is small the number of 0s is going to be high. You can further compress the result to:
n*(p*log(1/p) + (1-p)*log(1/(1-p))) + n*p*log(n) bits

Arranging groups of people optimally

I have this homework assignment that I think I managed to solve, but am not entirely sure as I cannot prove my solution. I would like comments on what I did, its correctness and whether or not there's a better solution.
The problem is as follows: we have N groups of people, where group ihas g[i]people in it. We want to put these people on two rows of S seats each, such that: each group can only be put on a single row, in a contiguous sequence, OR if the group has an even number of members, we can split them in two and put them on two rows, but with the condition that they must form a rectangle (so they must have the same indices on both rows). What is the minimum number of seats S needed so that nobody is standing up?
Example: groups are 4 11. Minimum S is 11. We put all 4 in one row, and the 11 on the second row. Another: groups are 6 2. We split the 6 on two rows, and also the two. Minimum is therefore 4 seats.
This is what I'm thinking:
Calculate T = (sum of all groups + 1) / 2. Store the group numbers in an array, but split all the even values x in two values of x / 2 each. So 4 5 becomes 2 2 5. Now run subset sum on this vector, and find the minimum value higher than or equal to T that can be formed. That value is the minimum number of seats per row needed.
Example: 4 11 => 2 2 11 => T = (15 + 1) / 2 = 8. Minimum we can form from 2 2 11 that's >= 8 is 11, so that's the answer.
This seems to work, at least I can't find any counter example. I don't have a proof though. Intuitively, it seems to always be possible to arrange the people under the required conditions with the number of seats supplied by this algorithm.
Any hints are appreciated.
I think your solution is correct. The minimum number of seats per row in an optimal distribution would be your T (which is mathematically obvious).
Splitting even numbers is also correct, since they have two possible arrangements; by logically putting all the "rectangular" groups of people on one end of the seat rows you can also guarantee that they will always form a proper rectangle, so that this condition is met as well.
So the question boils down to finding a sum equal or as close as possible to T (e.g. partition problem).
Minor nit: I'm not sure if the proposed solution above works in the edge case where each group has 0 members, because your numerator in T = SUM ALL + 1 / 2 is always positive, so there will never be a subset sum that is greater than or equal to T.
To get around this, maybe a modulus operation might work here. We know that we need at least n seats in a row if n is the maximal odd term, so maybe the equation should have a max(n * (n % 2)) term in it. It will come out to max(odd) or 0. Since the maximal odd term is always added to S, I think this is safe (stated boldly without proof...).
Then we want to know if we should split the even terms or not. Here's where the subset sum approach might work, but with T simply equal to SUM ALL / 2.

How do I calculate the shanten number in mahjong?

This is a followup to my earlier question about deciding if a hand is ready.
Knowledge of mahjong rules would be excellent, but a poker- or romme-based background is also sufficient to understand this question.
In Mahjong 14 tiles (tiles are like
cards in Poker) are arranged to 4 sets
and a pair. A straight ("123") always
uses exactly 3 tiles, not more and not
less. A set of the same kind ("111")
consists of exactly 3 tiles, too. This
leads to a sum of 3 * 4 + 2 = 14
tiles.
There are various exceptions like Kan
or Thirteen Orphans that are not
relevant here. Colors and value ranges
(1-9) are also not important for the
algorithm.
A hand consists of 13 tiles, every time it's our turn we get to pick a new tile and have to discard any tile so we stay on 13 tiles - except if we can win using the newly picked tile.
A hand that can be arranged to form 4 sets and a pair is "ready". A hand that requires only 1 tile to be exchanged is said to be "tenpai", or "1 from ready". Any other hand has a shanten-number which expresses how many tiles need to be exchanged to be in tenpai. So a hand with a shanten number of 1 needs 1 tile to be tenpai (and 2 tiles to be ready, accordingly). A hand with a shanten number of 5 needs 5 tiles to be tenpai and so on.
I'm trying to calculate the shanten number of a hand. After googling around for hours and reading multiple articles and papers on this topic, this seems to be an unsolved problem (except for the brute force approach). The closest algorithm I could find relied on chance, i.e. it was not able to detect the correct shanten number 100% of the time.
Rules
I'll explain a bit on the actual rules (simplified) and then my idea how to tackle this task. In mahjong, there are 4 colors, 3 normal ones like in card games (ace, heart, ...) that are called "man", "pin" and "sou". These colors run from 1 to 9 each and can be used to form straights as well as groups of the same kind. The forth color is called "honors" and can be used for groups of the same kind only, but not for straights. The seven honors will be called "E, S, W, N, R, G, B".
Let's look at an example of a tenpai hand: 2p, 3p, 3p, 3p, 3p, 4p, 5m, 5m, 5m, W, W, W, E. Next we pick an E. This is a complete mahjong hand (ready) and consists of a 2-4 pin street (remember, pins can be used for straights), a 3 pin triple, a 5 man triple, a W triple and an E pair.
Changing our original hand slightly to 2p, 2p, 3p, 3p, 3p, 4p, 5m, 5m, 5m, W, W, W, E, we got a hand in 1-shanten, i.e. it requires an additional tile to be tenpai. In this case, exchanging a 2p for an 3p brings us back to tenpai so by drawing a 3p and an E we win.
1p, 1p, 5p, 5p, 9p, 9p, E, E, E, S, S, W, W is a hand in 2-shanten. There is 1 completed triplet and 5 pairs. We need one pair in the end, so once we pick one of 1p, 5p, 9p, S or W we need to discard one of the other pairs. Example: We pick a 1 pin and discard an W. The hand is in 1-shanten now and looks like this: 1p, 1p, 1p, 5p, 5p, 9p, 9p, E, E, E, S, S, W. Next, we wait for either an 5p, 9p or S. Assuming we pick a 5p and discard the leftover W, we get this: 1p, 1p, 1p, 5p, 5p, 5p, 9p, 9p, E, E, E, S, S. This hand is in tenpai in can complete on either a 9 pin or an S.
To avoid drawing this text in length even more, you can read up on more example at wikipedia or using one of the various search results at google. All of them are a bit more technical though, so I hope the above description suffices.
Algorithm
As stated, I'd like to calculate the shanten number of a hand. My idea was to split the tiles into 4 groups according to their color. Next, all tiles are sorted into sets within their respective groups to we end up with either triplets, pairs or single tiles in the honor group or, additionally, streights in the 3 normal groups. Completed sets are ignored. Pairs are counted, the final number is decremented (we need 1 pair in the end). Single tiles are added to this number. Finally, we divide the number by 2 (since every time we pick a good tile that brings us closer to tenpai, we can get rid of another unwanted tile).
However, I can not prove that this algorithm is correct, and I also have trouble incorporating straights for difficult groups that contain many tiles in a close range. Every kind of idea is appreciated. I'm developing in .NET, but pseudo code or any readable language is welcome, too.
I've thought about this problem a bit more. To see the final results, skip over to the last section.
First idea: Brute Force Approach
First of all, I wrote a brute force approach. It was able to identify 3-shanten within a minute, but it was not very reliable (sometimes too a lot longer, and enumerating the whole space is impossible even for just 3-shanten).
Improvement of Brute Force Approach
One thing that came to mind was to add some intelligence to the brute force approach. The naive way is to add any of the remaining tiles, see if it produced Mahjong, and if not try the next recursively until it was found. Assuming there are about 30 different tiles left and the maximum depth is 6 (I'm not sure if a 7+-shanten hand is even possible [Edit: according to the formula developed later, the maximum possible shanten number is (13-1)*2/3 = 8]), we get (13*30)^6 possibilities, which is large (10^15 range).
However, there is no need to put every leftover tile in every position in your hand. Since every color has to be complete in itself, we can add tiles to the respective color groups and note down if the group is complete in itself. Details like having exactly 1 pair overall are not difficult to add. This way, there are max around (13*9)^6 possibilities, that is around 10^12 and more feasible.
A better solution: Modification of the existing Mahjong Checker
My next idea was to use the code I wrote early to test for Mahjong and modify it in two ways:
don't stop when an invalid hand is found but note down a missing tile
if there are multiple possible ways to use a tile, try out all of them
This should be the optimal idea, and with some heuristic added it should be the optimal algorithm. However, I found it quite difficult to implement - it is definitely possible though. I'd prefer an easier to write and maintain solution first.
An advanced approach using domain knowledge
Talking to a more experienced player, it appears there are some laws that can be used. For instance, a set of 3 tiles does never need to be broken up, as that would never decrease the shanten number. It may, however, be used in different ways (say, either for a 111 or a 123 combination).
Enumerate all possible 3-set and create a new simulation for each of them. Remove the 3-set. Now create all 2-set in the resulting hand and simulate for every tile that improves them to a 3-set. At the same time, simulate for any of the 1-sets being removed. Keep doing this until all 3- and 2-sets are gone. There should be a 1-set (that is, a single tile) be left in the end.
Learnings from implementation and final algorithm
I implemented the above algorithm. For easier understanding I wrote it down in pseudocode:
Remove completed 3-sets
If removed, return (i.e. do not simulate NOT taking the 3-set later)
Remove 2-set by looping through discarding any other tile (this creates a number of branches in the simulation)
If removed, return (same as earlier)
Use the number of left-over single tiles to calculate the shanten number
By the way, this is actually very similar to the approach I take when calculating the number myself, and obviously never to yields too high a number.
This works very well for almost all cases. However, I found that sometimes the earlier assumption ("removing already completed 3-sets is NEVER a bad idea") is wrong. Counter-example: 23566M 25667P 159S. The important part is the 25667. By removing a 567 3-set we end up with a left-over 6 tile, leading to 5-shanten. It would be better to use two of the single tiles to form 56x and 67x, leading to 4-shanten overall.
To fix, we simple have to remove the wrong optimization, leading to this code:
Remove completed 3-sets
Remove 2-set by looping through discarding any other tile
Use the number of left-over single tiles to calculate the shanten number
I believe this always accurately finds the smallest shanten number, but I don't know how to prove that. The time taken is in a "reasonable" range (on my machine 10 seconds max, usually 0 seconds).
The final point is calculating the shanten out of the number of left-over single tiles. First of all, it is obvious that the number is in the form 3*n+1 (because we started out with 14 tiles and always subtracted 3 tiles).
If there is 1 tile left, we're shanten already (we're just waiting for the final pair). With 4 tiles left, we have to discard 2 of them to form a 3-set, leaving us with a single tile again. This leads to 2 additional discards. With 7 tiles, we have 2 times 2 discards, adding 4. And so on.
This leads to the simple formula shanten_added = (number_of_singles - 1) * (2/3).
The described algorithm works well and passed all my tests, so I'm assuming it is correct. As stated, I can't prove it though.
Since the algorithm removes the most likely tiles combinations first, it kind of has a built-in optimization. Adding a simple check if (current_depth > best_shanten) then return; it does very well even for high shanten numbers.
My best guess would be an A* inspired approach. You need to find some heuristic which never overestimates the shanten number and use it to search the brute-force tree only in the regions where it is possible to get into a ready state quickly enough.
Correct algorithm sample: syanten.cpp
Recursive cut forms from hand in order: sets, pairs, incomplete forms, - and count it. In all variations. And result is minimal Shanten value of all variants:
Shanten = Min(Shanten, 8 - * 2 - - )
C# sample (rewrited from c++) can be found here (in Russian).
I've done a little bit of thinking and came up with a slightly different formula than mafu's. First of all, consider a hand (a very terrible hand):
1s 4s 6s 1m 5m 8m 9m 9m 7p 8p West East North
By using mafu's algorithm all we can do is cast out a pair (9m,9m). Then we are left with 11 singles. Now if we apply mafu's formula we get (11-1)*2/3 which is not an integer and therefore cannot be a shanten number. This is where I came up with this:
N = ( (S + 1) / 3 ) - 1
N stands for shanten number and S for score sum.
What is score? It's a number of tiles you need to make an incomplete set complete. For example, if you have (4,5) in your hand you need either 3 or 6 to make it a complete 3-set, that is, only one tile. So this incomplete pair gets score 1. Accordingly, (1,1) needs only 1 to become a 3-set. Any single tile obviously needs 2 tiles to become a 3-set and gets score 2. Any complete set of course get score 0. Note that we ignore the possibility of singles becoming pairs. Now if we try to find all of the incomplete sets in the above hand we get:
(4s,6s) (8m,9m) (7p,8p) 1s 1m 5m 9m West East North
Then we count the sum of its scores = 1*3+2*7 = 17.
Now if we apply this number to the formula above we get (17+1)/3 - 1 = 5 which means this hand is 5-shanten. It's somewhat more complicated than Alexey's and I don't have a proof but so far it seems to work for me. Note that such a hand could be parsed in the other way. For example:
(4s,6s) (9m,9m) (7p,8p) 1s 1m 5m 8m West East North
However, it still gets score sum 17 and 5-shanten according to formula. I also can't proof this and this is a little bit more complicated than Alexey's formula but also introduces scores that could be applied(?) to something else.
Take a look here: ShantenNumberCalculator. Calculate shanten really fast. And some related stuff (in japanese, but with code examples) http://cmj3.web.fc2.com
The essence of the algorithm: cut out all pairs, sets and unfinished forms in ALL possible ways, and thereby find the minimum value of the number of shanten.
The maximum value of the shanten for an ordinary hand: 8.
That is, as it were, we have the beginnings for 4 sets and one pair, but only one tile from each (total 13 - 5 = 8).
Accordingly, a pair will reduce the number of shantens by one, two (isolated from the rest) neighboring tiles (preset) will decrease the number of shantens by one,
a complete set (3 identical or 3 consecutive tiles) will reduce the number of shantens by 2, since two suitable tiles came to an isolated tile.
Shanten = 8 - Sets * 2 - Pairs - Presets
Determining whether your hand is already in tenpai sounds like a multi-knapsack problem. Greedy algorithms won't work - as Dialecticus pointed out, you'll need to consider the entire problem space.

Resources