Finding the longest Common Subsequence from a table - algorithm

I have two input strings for the LCS problem:
1: ABCDGH
2:AEDFHR
The following table represents the Dynamic Programming Solution to the bottom-up table for the length of the LCS:
Based on the method provided in this video, when trying to find the actual letters in the LCS, you start from the end of the table and go backwards. If the cells to the left and right aren't the same as the current one and the cell diagonal is one less, then you know the character in the current column is included and you move back diagonally. Otherwise you either move to the left or the right.
Following that approach, you would have this sequence of movements (H,R), (H,H),and then to (F,G). But once you get there, how would the algorithm decide where to go next? It seems that it should go left as that would lead to 'D' being included in the LCS from the next column to the left, but the cells to the left, right, and diagonal of (F,G) all have values of 2 and the cell to the diagonal isn't one less. So what should the logic in the algorithm be in cases where you have a cell surrounded by the same value?

Dynamic programming problems often have multiple optimal solutions. When two or three adjacent cells have the same value, they are equally good, and if that value is also the best one among the adjacent cells, jumping to either one of them will lead to one of the optimal solutions. (Note that your problem statement might impose additional constraints, such as "if there are multiple optimal solutions, pick the one where the last substitution is as early as possible".)

Related

Auto align elements in table

I need to make an algorithm which will align elements in table by using smallest possible number of rows.
Elements that need to bee sorted should keep the horizontal position/alignment
Like this:
[
I hope someone already did this.
Thanks!
To clarify: I assume that you mean that each item must fit on one row, and one row only (it cannot break up into the next row), but that they can move horizontally.
Heuristically/naively, I would do it like this:
Sort the elements by length.
Try to fill the first free row by naively picking (from longest to shortest) items until the row is full or no more matching elements can be found.
Repeat until all elements are done.
This will finish (relatively) quickly (somewhere between O(nlogn) and O(n^2) depending on heuristic "shortcuts") but leave more holes than necessary and turn up otherwise non-optimal solutions.
I'd wager this problem is equivalent to one of the classical NP-complete problems https://en.wikipedia.org/wiki/Karp%27s_21_NP-complete_problems , so you likely will not find a good practical non-heuristic solution.

Fast method of tile fitting/arranging

For example, let's say we have a bounded 2D grid which we want to cover with square tiles of equal size. We have unlimited number of tiles that fall into a defined number of types. Each type of tile specifies the letters printed on that tile. Letters are printed next to each edge and only the tiles with matching letters on their adjacent edges can be placed next to one another on the grid. Tiles may be rotated.
Given the size of the grid and tile type definitions, what is the fastest method of arranging the tiles such that the above constraint is met and the entire/majority of the grid is covered? Note that my use case is for large grids (~20 in each dimension) and medium-large number of solutions (unlike Eternity II).
So far, I've tried DFS starting in the center and picking the locations around filled area that allow the least number of possibilities and backtrack in case no progress can be made. This only works for simple scenarios with one or two types. Any more and too much backtracking ensues.
Here's a trivial example, showing input and the final output:
This is a hard puzzle.
The Eternity 2 was a puzzle of this form with a 16 by 16 square grid.
Despite a 2 million dollar prize, no one found the solution in several years.
The paper "Jigsaw Puzzles, Edge Matching, and Polyomino Packing: Connections and Complexity" by Erik D. Demaine, Martin L. Demaine shows that this type of puzzle is NP-complete.
Given a problem of this sort with a square grid I would try a brute force list of all possible columns, and then a dynamic programming solution across all of the rows. This will find the number of solutions, and with a little extra work can be used to generate all of the solutions.
However if your rows are n long and there are m letters with k tiles, then the brute force list of all possible columns has up to mn possible right edges with up to m4k combinations/rotations of tiles needed to generate it. Then the transition from the right edge of one column to the right edge of the next next potentially has up to m2n possibilities in it. Those numbers are usually not worst case scenarios, but the size of those data structures will be the upper bound on the feasibility of that technique.
Of course if those data structures get too large to be feasible, then the recursive algorithm that you are describing will be too slow to enumerate all solutions. But if there are enough, it still might run acceptably fast even if this approach is infeasible.
Of course if you have more rows than columns, you'll want to reverse the role of rows and columns in this technique.

Algorithm for global multiple sequence alignment using only indels

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines have the same set of delimiters, but some lines may have extra segments, an optional comma at the end, and so forth.
My idea is to come up with a canonical list of delimiters. Specifically, given several strings of delimiters, I would like to find the shortest string that can be formed from any of the given strings using only insertions, with ties broken in some sensible manner. After some research, I learned that this is the well-known problem of global multiple sequence alignment, except that there are no mismatches, only matches and indels.
The dynamic programming approach, unfortunately, is exponential in the number of strings - at least in the general case. Is there any hope for a faster solution when mismatches are disallowed?
I'm a little hesitant to make a blanket statement that there is no such hope, even when mismatches are disallowed, but I'm pretty sure that there isn't. Here's why.
The size of the dynamic programming table generated when doing sequence alignment is approximately (string length)^(number of strings), hence the exponential run-time/space requirement. To give you a feel of where that comes from, here's an example with two strings, ABC and ACB, each of length 3. This gives us a 3x3 table:
A B C
A 0 1 2
C 1 1 1
B 2 1 2
We initialize this table starting from the upper left and working our way down to the lower right from there. The total cost to get to any location in the table is given by the number at that location (for simplicity, I'm assuming that insertions, deletions, and substitutions all have a cost of 1). The operation used to get to a given location is given by the direction that you moved from the previous value. Moving to the right means you are inserting elements from the top string. Moving down inserts elements from the sideways string. Moving diagonally means you are aligning elements from the top and bottom. If these elements don't match, then this represents a substitution and you increase the cost to get there.
And that's the problem. Saying mismatches aren't allowed doesn't rule out the operations that are responsible for the length and height of the table (insertions/deletions). Worse, disallowing mismatches doesn't even rule out a potential move. Diagonal movements in the table are still possible sometimes, just not when the two elements don't match. Plus, you still need to check to see if the elements match, so you're basically still considering that move. As a result, this shouldn't be able to improve your worst case time and seems unlikely to have a substantial effect on your average or best case time either.
On the bright side, this is a pretty important problem in bioinformatics, so people have come up with some solutions. They have their flaws, but may work well-enough for your case (particularly since it seems likely that you'll be less likely to have spurious alignments than you would with DNA, given that your strings are not-composed of a four-letter alphabet). So take a look at Star Alignment and Neighbor Joining.

Generating minimal/irreducible Sudokus

A Sudoku puzzle is minimal (also called irreducible) if it has a unique solution, but removing any digit would yield a puzzle with multiple solutions. In other words, every digit is necessary to determine the solution.
I have a basic algorithm to generate minimal Sudokus:
Generate a completed puzzle.
Visit each cell in a random order. For each visited cell:
Tentatively remove its digit
Solve the puzzle twice using a recursive backtracking algorithm. One solver tries the digits 1-9 in forward order, the other in reverse order. In a sense, the solvers are traversing a search tree containing all possible configurations, but from opposite ends. This means that the two solutions will match iff the puzzle has a unique solution.
If the puzzle has a unique solution, remove the digit permanently; otherwise, put it back in.
This method is guaranteed to produce a minimal puzzle, but it's quite slow (100 ms on my computer, several seconds on a smartphone). I would like to reduce the number of solves, but all the obvious ways I can think of are incorrect. For example:
Adding digits instead of removing them. The advantage of this is that since minimal puzzles require at least 17 filled digits, the first 17 digits are guaranteed to not have a unique solution, reducing the amount of solving. Unfortunately, because the cells are visited in a random order, many unnecessary digits will be added before the one important digit that "locks down" a unique solution. For instance, if the first 9 cells added are all in the same column, there's a great deal of redundant information there.
If no other digit can replace the current one, keep it in and do not solve the puzzle. Because checking if a placement is legal is thousands of times faster than solving the puzzle twice, this could be a huge time-saver. However, just because there's no other legal digit now doesn't mean there won't be later, once we remove other digits.
Since we generated the original solution, solve only once for each cell and see if it matches the original. This doesn't work because the original solution could be anywhere within the search tree of possible solutions. For example, if the original solution is near the "left" side of the tree, and we start searching from the left, we will miss solutions on the right side of the tree.
I would also like to optimize the solving algorithm itself. The hard part is determining if a solution is unique. I can make micro-optimizations like creating a bitmask of legal placements for each cell, as described in this wonderful post. However, more advanced algorithms like Dancing Links or simulated annealing are not designed to determine uniqueness, but just to find any solution.
How can I optimize my minimal Sudoku generator?
I have an idea on the 2nd option your had suggested will be better for that provided you add 3 extra checks for the 1st 17 numbers
find a list of 17 random numbers between 1-9
add each item at random location provided
these new number added dont fail the 3 basic criteria of sudoku
there is no same number in same row
there is no same number in same column
there is no same number in same 3x3 matrix
if condition 1 fails move to the next column or row and check for the 3 basic criteria again.
if there is no next row (ie at 9th column or 9th row) add to the 1st column
once the 17 characters are filled, run you solver logic on this and look for your unique solution.
Here are the main optimizations I implemented with (highly approximate) percentage increases in speed:
Using bitmasks to keep track of which constraints (row, column, box) are satisfied in each cell. This makes it much faster to look up whether a placement is legal, but slower to make a placement. A complicating factor in generating puzzles with bitmasks, rather than just solving them, is that digits may have to be removed, which means you need to keep track of the three types of constraints as distinct bits. A small further optimization is to save the masks for each digit and each constraint in arrays. 40%
Timing out the generation and restarting if it takes too long. See here. The optimal strategy is to increase the timeout period after each failed generation, to reduce the chance that it goes on indefinitely. 30%, mainly from reducing the worst-case runtimes.
mbeckish and user295691's suggestions (see the comments to the original post). 25%

3x3 grid puzzle solving (JS)

I have an image separated into a 3x3 grid. The grid is represented by an array. Each column or row can be rotated through. E.g, the top row [1,2,3] could become [3,1,2] etc.
The array needs to end up as:
[1,2,3]
[4,5,6]
[7,8,9]
And would start from something like:
[5,3,9]
[7,1,4]
[8,6,2]
It will always be solvable, so this doesn't need to be checked.
I've tried a long handed approach of looking for '1' and moving left then up to its correct place, and so on for 2, 3,... but end up going round in circles for ever.
Any help would be appreciated, even if you can just give me a starting point/reference... I can't seem to think through this one.
your problem is that the moves to shift one value will mess up others. i suspect with enough set theory you can work out an exact solution, but here's a heuristic that has more chance of working.
first, note that if every number in a row belongs to that row then it's either trivial to solve, or some values are swapped. [2,3,1] is trivial, while [3,2,1] is swapped, for example.
so an "easier" target than placing 1 top left is to get all rows into that state. how might we do that? let's look at the columns...
if the column contains one number from each row, then we are in a similar state to above (it's either trivial to shift so numbers are in the correct rows, or it's swapped).
so, what i would suggest is:
for column in columns:
if column is not one value from each row:
pick a value from column that is from a duplicate row
rotate that row
for column in columns:
as well as possible, shift until each value is in correct row
for row in rows:
as well as possible, shift until each value is in correct column
now, that is not guaranteed to work, although it will tend to get close, and can solve some set of "almost right" arrangements.
so what i would then do is put that in a loop and, on each run, record a "hash" of the state (for example, a string containing the values read row by row). and then on each invocation if i detect (by checking if the hash was one we had seen already) that the state has already occurred (so we are repeating ourselves) i would invoke a "random shuffle" that mixes things up.
so the idea is that we have something that has a chance of working once we are close, and a shuffle that we resort to when that gets stuck in a loop.
as i said, i am sure there are smarter ways to do this, but if i were desperate and couldn't find anything on google, that's the kind of heuristic i would try... i am not even sure the above is right, but the more general tactic is:
identify something that will solve very close solutions (in a sense, find out where the puzzle is "linear")
try repeating that
shuffle if it repeats
and that's really all i am saying here.
Since the grid is 3x3, you can not only find the solution, but find the smallest number of moves to solve the problem.
You would need to use Breadth First Search for this purpose.
Represent each configuration as a linear array of 9 elements. After each move, you reach a different configuration. Since the array is essentially a permutation of numbers between 1-9, there would be only 9! = 362,880 different configurations possible.
If we consider each configuration as a node, and making each move is considered as taking an edge, we can explore the entire graph in O(n), where n is the number of configurations. We need to make sure, that we do not re-solve a configuration which we have already seen before, so you would need a visited array, which marks each configuration visited as it sees it.
When you reach the 'solved' configuration, you can trace back the moves taken by using a 'parent' array, which stores the configuration you came from.
Also note, if it had been a 4x4 grid, the problem would have been quite intractable, since n would equal (4x4)! = 16! = 2.09227899 × 10^13. But for smaller problems like this, you can find the solution pretty fast.
Edit:
TL;DR:
Guaranteed to work, and pretty fast at that. 362,880 is a pretty small number for today's computers
It will find the shortest sequence of moves.

Resources