Find optimal local alignment of two strings using local & global alignments - algorithm

I have a homework question that I trying to solve for many hours without success, maybe someone can guide me to the right way of thinking about it.
The problem:
We want to find an optimal local alignment of two strings S1 and S2, we know that there exists such an alignment
with the two aligned substrings of S1 and S2 both of length at most q.
Besides, we know that the number of the table cells with the maximal value, opt, is at most r.
Describe an algorithm solving the problem in time O(mn+r*q^2) using working space of at most
O(n+r+q^2).
Restrictions: run the algorithm of finding the optimal local alignment value, with
additions to your choice (like the list of index pairs), only once. Besides, you can run any variant of the algorithm for solving the global optimal alignment problem as many times as you wish
I know the solution to this problem with running the local alignment many times and the global alignment only once, but not the opposite.
the global alignment algorithm:
the local alignment algorithm:
Any help would be appreciated.

The answer in case someone will be interested in this question in the future:
Compute the optimal local alignment score OPT of both strings in $O(mn)$ time and $O(n)$ space by maintaining just a single row of the DP matrix. (Since we are only computing the score and don't need to perform traceback to build the full solution, we don't need to keep the full DP matrix.) As you do so, keep track of the highest cell value seen so far, as well as a list of the coordinates $(i, j)$ of cells having this value. Whenever a new maximum is seen, clear the list and update the maximum. Whenever a cell $\ge$ the current maximum is seen (including in the case where we just saw a new maximum), add the coordinates of the current cell to the list. At the end, we have a list of endpoints of all optimal local alignments; by assumption, there are at most $r$ of these.
For each entry $(i, j)$ in the list:
Set $R1$ to the reverse of the substring $S1[i-q+1..i]$, and $R2$ to the reverse of $S2[j-q+1..j]$.
Perform optimal global alignment of $R1$ and $R2$ in $O(q^2)$ time, maintaining the full $O(q^2)$ DP matrix this time.
Search for the highest entry in the matrix (also $O(q^2)$ time; or you can perform this during the previous step).
If this entry is OPT, we have found a solution: Trace back towards the top-left corner from this cell to find the full solution, reverse it, and output it, and stop.
By assumption, at least one of the alignments performed in the previous step reaches a score of OPT. (Note that reversing both strings does not change the score of an alignment.)
Step 2 iterates at most $r$ times, and does $O(q^2)$ work each time, using at most $O(q^2)$ space, so overall the time and space bounds are met.
(A simpler way, that avoids reversing strings, would be to simply perform local alignments of the length-$q$ substrings, but the restrictions appear to forbid this.)

Related

How to efficiently diversify Dijkstra's algorithm (while preserving shortest path(s))?

Please look at the images and their descriptions below.
P.S. ignore the gray circular boundary (that's just max radius for debug testing).
Figure 1: No shuffling of branches. Branches are in order: Top, Left, Down, Right
Figure 2: Has branch shuffling: every time a node branches to its 4 potential children, the order is randomized.
So, as you can see the four images have the same path length. The lower 3 are more diverse, and are preferred. Shuffling the order of the array at every branch seems a bit inefficient. Any ways to improve it?
My idea is that I could create a list of all the possible shuffles (since there are 4 elements, that should be 24* permutations, right?), and generate a random number which will be used as an index to the list.
Are there any alternatives? Or perhaps I should look into a different algorithm altogether?
P.S. this is for game development purposes, so the diversity for paths is highly preferred.
Every time you calculate the path length to a node, before comparing against its previous best length, add a small random number so that the calculated length is between real_length and real_length+0.5. This will randomize the choices between paths of equal length.

Mutually Overlapping Subset of Activites

I am prepping for a final and this was a practice problem. It is not a homework problem.
How do I go about attacking this? Also, more generally, how do I know when to use Greedy vs. Dynamic programming? Intuitively, I think this is a good place to use greedy. I'm also thinking that if I could somehow create an orthogonal line and "sweep" it, checking the #of intersections at each point and updating a global max, then I could just return the max at the end of the sweep. I'm not sure how to plane sweep algorithmically though.
a. We are given a set of activities I1 ... In: each activity Ii is represented by its left-point Li and its right-point Ri. Design a very efficient algorithm that finds the maximum number of mutually overlapping subset of activities (write your solution in English, bullet by bullet).
b. Analyze the time complexity of your algorithm.
Proposed solution:
Ex set: {(0,2) (3,7) (4,6) (7,8) (1,5)}
Max is 3 from interval 4-5
1) Split start and end points into two separate arrays and sort them in non-decreasing order
Start points: [0,1,3,4,7] (SP)
End points: [2,5,6,7,8] (EP)
I know that I can use two pointers to sort of simulate the plane sweep, but I'm not exactly sure how. I'm stuck here.
I'd say your idea of a sweep is good.
You don't need to worry about planar sweeping, just use the start/end points. Put the elements in a queue. In every step take the smaller element from the queue front. If it's a start point, increment current tasks count, otherwise decrement it.
Since you don't need to point which tasks are overlapping - just the count of them - you don't need to worry about specific tasks duration.
Regarding your greedy vs DP question, in my non-professional opinion greedy may not always provide valid answer, whereas DP only works for problem that can be divided into smaller subproblems well. In this case, I wouldn't call your sweep-solution either.

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Is my heuristic algorithm correct? (Sudoku solver)

First of -yes this IS a homework- but it's primarily a theoretical question rather than a practical one, I am simply asking a confirmation if I am thinking correctly or any hints if I am not.
I have been asked to compile a simple Sudoku solver (on Prolog but that is not so important right now) with the only limitation being that it must utilize a heuristic function using Best-First Algorithm. The only heuristic function I have been able to come up with is explained below:
1. Select an empty cell.
1a. If there are no empty cells and there is a solution return solution.
Else return No.
2. Find all possible values it can hold. %% It can't take values currently assigned to cells on the same line/column/box.
3. Set to all those values a heuristic number starting from 1.
4. Pick the value whose heuristic number is the lowest && you haven't checked yet.
4a. If there are no more values return no.
5. If a solution is not found: GoTo 1.
Else Return Solution.
// I am sorry for errors in this "pseudo code." If you want any clarification let me know.
So am I doing this right or is there any other way around and mine is false?
Thanks in advance.
The heuristic I would use is this:
Repeatedly find any empty spaces where there is only one possible number you can insert. Fill them with the number 1-9 that fits.
If every empty space has two or more possibilities, push the game state onto a stack, then pick a random square to fill in with a random value.
Go to step 1.
If you manage to fill every square, you've found a valid solution.
If you get to a point where there are no valid options, pop the last game state off the stack (i.e. backtrack to the last time you made a random choice.) Make a different choice and try again.
As an interesting sidenote, you've been told to do this using a greedy heuristic approach, but Sudoku can actually be reduced to a boolean satisfiability problem (SAT problem) and solved using a general-purpose SAT solver. This is very elegant and can actually be faster than a heuristic approach.
When I wrote a sudoku solver myself in Prolog, the algorithm I used was the following:
filter out cells already solved (ie the given values at the start)
for each cell, build a list containing all its neighbours (that's 20 cells).
for each cell, build a list containing all the possible values it can take (easy to do once the above is done)
in the list containing all the cells to solve, put one with the minimum number of values available on top
if the top cell has 0 remaining possibility, go to 7, else, go to 6, if the list is empty, you have a solution.
for the cell of the top of the list: pick a random number in the possible values of the cell. Remove this value in all the possible values of its neighbours. Go to 5.
backtrack (ie, fail in Prolog)
This algorithm always sorts the "most solved" cell first and detects failure early enough. It reduces solving time quite a lot compared to an algorithm that solves a random cell.
What you have described is Most Constrained Variable heuristic. It picks up the cell that has least number of possibilities and then branches recursively in depth starting from that cell. This heuristic is extremely fast in depth-first search algorithms because it detects collisions early, near the root, while the search tree is still small.
Here is the implementation of Most Constrained Variable heuristic in C#: Exercise #2: Sudoku Solver
This text also contains the analysis of total number of visits to Sudoku cells by this algorithm - it is surprisingly small. It almost looks like the heuristic solves Sudoku in the first try.

Sorting a list of numbers with modified cost

First, this was one of the four problems we had to solve in a project last year and I couldn’t find a suitable algorithm so we handle in a brute force solution.
Problem: The numbers are in a list that is not sorted and supports only one type of operation. The operation is defined as follows:
Given a position i and a position j the operation moves the number at position i to position j without altering the relative order of the other numbers. If i > j, the positions of the numbers between positions j and i - 1 increment by 1, otherwise if i < j the positions of the numbers between positions i+1 and j decreases by 1. This operation requires i steps to find a number to move and j steps to locate the position to which you want to move it. Then the number of steps required to move a number of position i to position j is i+j.
We need to design an algorithm that given a list of numbers, determine the optimal (in terms of cost) sequence of moves to rearrange the sequence.
Attempts:
Part of our investigation was around NP-Completeness, we make it a decision problem and try to find a suitable transformation to any of the problems listed in Garey and Johnson’s book: Computers and Intractability with no results. There is also no direct reference (from our point of view) to this kind of variation in Donald E. Knuth’s book: The art of Computer Programing Vol. 3 Sorting and Searching. We also analyzed algorithms to sort linked lists but none of them gives a good idea to find de optimal sequence of movements.
Note that the idea is not to find an algorithm that orders the sequence, but one to tell me the optimal sequence of movements in terms of cost that organizes the sequence, you can make a copy and sort it to analyze the final position of the elements if you want, in fact we may assume that the list contains the numbers from 1 to n, so we know where we want to put each number, we are just concerned with minimizing the total cost of the steps.
We tested several greedy approaches but all of them failed, divide and conquer sorting algorithms can’t be used because they swap with no cost portions of the list and our dynamic programing approaches had to consider many cases.
The brute force recursive algorithm takes all the possible combinations of movements from i to j and then again all the possible moments of the rest of the element’s, at the end it returns the sequence with less total cost that sorted the list, as you can imagine the cost of this algorithm is brutal and makes it impracticable for more than 8 elements.
Our observations:
n movements is not necessarily cheaper than n+1 movements (unlike swaps in arrays that are O(1)).
There are basically two ways of moving one element from position i to j: one is to move it directly and the other is to move other elements around i in a way that it reaches the position j.
At most you make n-1 movements (the untouched element reaches its position alone).
If it is the optimal sequence of movements then you didn’t move the same element twice.
This problem looks like a good candidate for an approximation algorithm but that would only give us a good enough answer. Since you want the optimal answer, this is what I'd do to improve on the brute force approach.
Instead of blindly trying every permutations, I'd use a backtracking approach that would maintain the best solution found and prune any branches that exceed the cost of our best solution. I would also add a transposition table to avoid redoing searches on states that were reached by previous branches using different move permutations.
I would also add a few heuristics to explore moves that are more likely to reach good results before any other moves. For example, prefer moves that have a small cost first. I'd need to experiment before I can tell which heuristics would work best if any.
I would also try to find the longest increasing subsequence of numbers in the original array. This will give us a sequence of numbers that don't need to be moved which should considerably cut the number of branches we need to explore. This also greatly speeds up searches on list that are almost sorted.
I'd expect these improvements to be able to handle lists that are far greater than 8 but when dealing with large lists of random numbers, I'd prefer an approximation algorithm.
By popular demand (1 person), this is what I'd do to solve this with a genetic algorithm (the meta-heuristique I'm most familiar with).
First, I'd start by calculating the longest increasing subsequence of numbers (see above). Every item that is not part of that set has to be moved. All we need to know now is in what order.
The genomes used as input for the genetic algorithm, is simply an array where each element represents an item to be moved. The order in which the items show up in the array represent the order in which they have to be moved. The fitness function would be the cost calculation described in the original question.
We now have all the elements needed to plug the problem in a standard genetic algorithm. The rest is just tweaking. Lots and lots of tweaking.

Resources