Algorithm for global multiple sequence alignment using only indels - algorithm

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines have the same set of delimiters, but some lines may have extra segments, an optional comma at the end, and so forth.
My idea is to come up with a canonical list of delimiters. Specifically, given several strings of delimiters, I would like to find the shortest string that can be formed from any of the given strings using only insertions, with ties broken in some sensible manner. After some research, I learned that this is the well-known problem of global multiple sequence alignment, except that there are no mismatches, only matches and indels.
The dynamic programming approach, unfortunately, is exponential in the number of strings - at least in the general case. Is there any hope for a faster solution when mismatches are disallowed?

I'm a little hesitant to make a blanket statement that there is no such hope, even when mismatches are disallowed, but I'm pretty sure that there isn't. Here's why.
The size of the dynamic programming table generated when doing sequence alignment is approximately (string length)^(number of strings), hence the exponential run-time/space requirement. To give you a feel of where that comes from, here's an example with two strings, ABC and ACB, each of length 3. This gives us a 3x3 table:
A B C
A 0 1 2
C 1 1 1
B 2 1 2
We initialize this table starting from the upper left and working our way down to the lower right from there. The total cost to get to any location in the table is given by the number at that location (for simplicity, I'm assuming that insertions, deletions, and substitutions all have a cost of 1). The operation used to get to a given location is given by the direction that you moved from the previous value. Moving to the right means you are inserting elements from the top string. Moving down inserts elements from the sideways string. Moving diagonally means you are aligning elements from the top and bottom. If these elements don't match, then this represents a substitution and you increase the cost to get there.
And that's the problem. Saying mismatches aren't allowed doesn't rule out the operations that are responsible for the length and height of the table (insertions/deletions). Worse, disallowing mismatches doesn't even rule out a potential move. Diagonal movements in the table are still possible sometimes, just not when the two elements don't match. Plus, you still need to check to see if the elements match, so you're basically still considering that move. As a result, this shouldn't be able to improve your worst case time and seems unlikely to have a substantial effect on your average or best case time either.
On the bright side, this is a pretty important problem in bioinformatics, so people have come up with some solutions. They have their flaws, but may work well-enough for your case (particularly since it seems likely that you'll be less likely to have spurious alignments than you would with DNA, given that your strings are not-composed of a four-letter alphabet). So take a look at Star Alignment and Neighbor Joining.

Related

Generate or find a shortest text given list of words

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

What kind of PRNG would match such scatter plots?

I've been given the challenge to find the seed from a series of pseudo-randomly generated alphanumerical IDs and after some analysis, I'm stuck in a dead end that I hope you'll be able to get me out of.
Each ID is obtained by passing the previous one through the encryption algorithm, that I'm supposed to reverse engineer in order to find the seed. The list given to me is composed of the 2070 first IDs (without the seed obviously). The IDs start as 4 alphanumerical characters, and switch to 5 after some time (e.g. "2xm1", "34nj", "avdfe", "2lgq9")
This switch happens once the algorithm, after encrypting an ID, returns an ID that has already been generated previously. At this point, it adds one character to this returned ID, making it longer and thus unique. It then proceeds as usual, generating IDs of the new length. This effectively means that the generation algorithm is surjective.
My first reflex was to try to convert those IDs from base36 to some other base, notably decimal. I used the results to scatter plot a chart of the IDs' decimal values in terms of their rank in the list, when I noticed a pattern that I couldn't understand the origin of.
After isolating the two parts of the list in terms of ID length, I scatter plotted the same graph for the 4-characters IDs sub-list and 5-characters IDs sub-list, allowing me to notice the strange density patterns.
After some analysis, I've observed 2 things :
For each sub-list, the delimitation between the 2 densities is 6x36^(n-1), n being the number of characters in the ID. In other terms, it is 1/6th of the entire range of values for a given ID length. The range of values is [0; (36^n)-1]
The repartition of those IDs in relation to this limit tends towards 50/50, half of them being above the 1/6th limit, half of them being under it.
I've tried to correlate such a behavior with other known PRNG scatter-plots, but none of them matched what I get on my graphs.
I'm hoping some of you might know about an encryption method, formula, or function matching such a specific scatter plot, or have any idea about what could be going on behind the scenes.
Thanks in advance for your answers.
This answer may not be very useful but I think it can help. the graph plot you shown is most likely that it doesn't belong to one of the most known PRNG used and of course it would never belong to cryptographic PRNG.
But I have a notice I dont know if it can help. This PRNG seems to have a full period equals to full cycle of numbers generated for a fixed character places. I mean it operate with a pattern for 4 digits then repeat pattern but with higher magnitude for 5 characters which will propably means that this same pattern of distribution will repeat for 6 characters but with higher magnitude.
So, in summery, this can mean that this pattern can be exploited if you know what is the value of this magnitude so you know the increments for 6 characters graph plot and then you can just stretch the 5 characters graph on the Y-Axis to get some kind of a solution (which would be the seed for 6 characters graph).
EDIT: To clear things more clearly regarding your comment. what I mean is that this PRNG generate random numbers but these random numbers would not be repeated to infinity instead there will be some point in time were the same sequence will be regenerated. The I've inadvertantly left behind a piece of information: confirm this since when it encounter same number generated before ( reached this point in time where same sequence is regenerated ). It will just add 1 extra character to the sequence which would not change the distribution on the graph but instead will make the graph appear like if it was stretched along Y-Axis (like if Y intercept of the graph function just got bigger).

Algorithm to query between which boundaries a number is in

I am trying to query what interval in a set of non-overlapping intervals a number is in.
I realise that there are numerous solutions to this problem out there already, but I believe that my case is yet again special and somewhat different.
I am dynamically adding and removing boundaries that segregate the interval from negative to positive infinity into sections. There are no holes in these sections, i.e. the set of boundaries/intervals is continuous. This means that when a boundary is removed, the sections below and above it are merged.
The boundaries are added in no particular order. See the following image for clarification:
In case the queried number is exactly on a boundary, I don't care which of the regions it belongs to, and want to return either one of them.
The number of queries may be considerably larger or considerably smaller than the number of boundaries.
A naive algorithm would run in O(n). Assuming that until the next query is made, the set of boundaries will have changed completely, this could be faster than using binary search trees, I think (?), but this does not seem like an optimal solution.
Is there some algorithm that fits this scenario?

When using dynamic programming, capturing the entire path for a min-sum?

I am trying to use the Viterbi min-sum algorithm which tries to find the pathway through a bunch of nodes that minimizes the overall Hamming distance (fancy term for "xor two numbers and count the resulting bits") against some fixed input.
I understand find how to use DP to compute the minimal distance overall, but I am having trouble using it to also capture the corresponding path that corresponds to the minimal distance.
It seems like memoizing the path at each node would be really memory-intensive. Is there a standard way to handle these kinds of problems?
Edit:
http://i.imgur.com/EugiEWG.jpg
Here is a sample trellis with what I am talking about. The general idea is to find the path through the trellis that most closely emulates the input bitstring, with minimal error (measured by minimizing overall Hamming distance, or the number of mismatched bits).
As you can see, the first chunk of my input string is 01, and I can traverse there in column 1 of the trellis. The next chunk is 10, and I can move there in column 2. Next chunk is 11. Fine so far. Next chunk is 10, which is a problem because I can't reach that state from where I am now, so I have to go to the next best thing (00) and the rest can be filled fine.
But this can become more complex. I'd need to be able to somehow get the corresponding path to the minimal Hamming distance.
(The point of this exercise is that the trellis represents what are ACTUALLY valid transitions, whereas the input string is something you receive through telecommunicationa and might get garbled and have incorrect bits here and there. This program tries to figure out what the input string SHOULD be by minimizing error).
There's the usual "follow path backwards" technique, requiring only the table of values (but the whole table of values, no cheating with "keep only the most recent part"). The algorithm is simple: start at the end, decide which way you came from. You can make that decision, because either there's exactly one way such that if you came from it you'd compute the value that matches the stored one, or several result in the same value and it wouldn't matter which one you chose.
Storing also a table of "back-pointers" doesn't take much space (about as much as the table of weights, but you can actually omit most of the table of weights if you do this), doing it that way allows you to have a much simpler backwards phase: just follow the pointers. That really is the path, just stored backwards.
You are correct that the immediate approach for calculating the paths, is space expensive.
This problem comes up often in DNA sequencing, where the cost is prohibitive. There are a number of ways to overcome it (see more here):
You can reduce up to a square root of the space if you are willing to double the execution time (see 2.1.1 in the link above).
Using a compressed tree, you can reduce one of the dimensions logarithmically (see 2.1.2 in the link above).

Generating minimal/irreducible Sudokus

A Sudoku puzzle is minimal (also called irreducible) if it has a unique solution, but removing any digit would yield a puzzle with multiple solutions. In other words, every digit is necessary to determine the solution.
I have a basic algorithm to generate minimal Sudokus:
Generate a completed puzzle.
Visit each cell in a random order. For each visited cell:
Tentatively remove its digit
Solve the puzzle twice using a recursive backtracking algorithm. One solver tries the digits 1-9 in forward order, the other in reverse order. In a sense, the solvers are traversing a search tree containing all possible configurations, but from opposite ends. This means that the two solutions will match iff the puzzle has a unique solution.
If the puzzle has a unique solution, remove the digit permanently; otherwise, put it back in.
This method is guaranteed to produce a minimal puzzle, but it's quite slow (100 ms on my computer, several seconds on a smartphone). I would like to reduce the number of solves, but all the obvious ways I can think of are incorrect. For example:
Adding digits instead of removing them. The advantage of this is that since minimal puzzles require at least 17 filled digits, the first 17 digits are guaranteed to not have a unique solution, reducing the amount of solving. Unfortunately, because the cells are visited in a random order, many unnecessary digits will be added before the one important digit that "locks down" a unique solution. For instance, if the first 9 cells added are all in the same column, there's a great deal of redundant information there.
If no other digit can replace the current one, keep it in and do not solve the puzzle. Because checking if a placement is legal is thousands of times faster than solving the puzzle twice, this could be a huge time-saver. However, just because there's no other legal digit now doesn't mean there won't be later, once we remove other digits.
Since we generated the original solution, solve only once for each cell and see if it matches the original. This doesn't work because the original solution could be anywhere within the search tree of possible solutions. For example, if the original solution is near the "left" side of the tree, and we start searching from the left, we will miss solutions on the right side of the tree.
I would also like to optimize the solving algorithm itself. The hard part is determining if a solution is unique. I can make micro-optimizations like creating a bitmask of legal placements for each cell, as described in this wonderful post. However, more advanced algorithms like Dancing Links or simulated annealing are not designed to determine uniqueness, but just to find any solution.
How can I optimize my minimal Sudoku generator?
I have an idea on the 2nd option your had suggested will be better for that provided you add 3 extra checks for the 1st 17 numbers
find a list of 17 random numbers between 1-9
add each item at random location provided
these new number added dont fail the 3 basic criteria of sudoku
there is no same number in same row
there is no same number in same column
there is no same number in same 3x3 matrix
if condition 1 fails move to the next column or row and check for the 3 basic criteria again.
if there is no next row (ie at 9th column or 9th row) add to the 1st column
once the 17 characters are filled, run you solver logic on this and look for your unique solution.
Here are the main optimizations I implemented with (highly approximate) percentage increases in speed:
Using bitmasks to keep track of which constraints (row, column, box) are satisfied in each cell. This makes it much faster to look up whether a placement is legal, but slower to make a placement. A complicating factor in generating puzzles with bitmasks, rather than just solving them, is that digits may have to be removed, which means you need to keep track of the three types of constraints as distinct bits. A small further optimization is to save the masks for each digit and each constraint in arrays. 40%
Timing out the generation and restarting if it takes too long. See here. The optimal strategy is to increase the timeout period after each failed generation, to reduce the chance that it goes on indefinitely. 30%, mainly from reducing the worst-case runtimes.
mbeckish and user295691's suggestions (see the comments to the original post). 25%

Resources