Generate or find a shortest text given list of words - algorithm

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.

I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

Related

What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences?

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.
A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/
I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Algorithm for global multiple sequence alignment using only indels

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines have the same set of delimiters, but some lines may have extra segments, an optional comma at the end, and so forth.
My idea is to come up with a canonical list of delimiters. Specifically, given several strings of delimiters, I would like to find the shortest string that can be formed from any of the given strings using only insertions, with ties broken in some sensible manner. After some research, I learned that this is the well-known problem of global multiple sequence alignment, except that there are no mismatches, only matches and indels.
The dynamic programming approach, unfortunately, is exponential in the number of strings - at least in the general case. Is there any hope for a faster solution when mismatches are disallowed?
I'm a little hesitant to make a blanket statement that there is no such hope, even when mismatches are disallowed, but I'm pretty sure that there isn't. Here's why.
The size of the dynamic programming table generated when doing sequence alignment is approximately (string length)^(number of strings), hence the exponential run-time/space requirement. To give you a feel of where that comes from, here's an example with two strings, ABC and ACB, each of length 3. This gives us a 3x3 table:
A B C
A 0 1 2
C 1 1 1
B 2 1 2
We initialize this table starting from the upper left and working our way down to the lower right from there. The total cost to get to any location in the table is given by the number at that location (for simplicity, I'm assuming that insertions, deletions, and substitutions all have a cost of 1). The operation used to get to a given location is given by the direction that you moved from the previous value. Moving to the right means you are inserting elements from the top string. Moving down inserts elements from the sideways string. Moving diagonally means you are aligning elements from the top and bottom. If these elements don't match, then this represents a substitution and you increase the cost to get there.
And that's the problem. Saying mismatches aren't allowed doesn't rule out the operations that are responsible for the length and height of the table (insertions/deletions). Worse, disallowing mismatches doesn't even rule out a potential move. Diagonal movements in the table are still possible sometimes, just not when the two elements don't match. Plus, you still need to check to see if the elements match, so you're basically still considering that move. As a result, this shouldn't be able to improve your worst case time and seems unlikely to have a substantial effect on your average or best case time either.
On the bright side, this is a pretty important problem in bioinformatics, so people have come up with some solutions. They have their flaws, but may work well-enough for your case (particularly since it seems likely that you'll be less likely to have spurious alignments than you would with DNA, given that your strings are not-composed of a four-letter alphabet). So take a look at Star Alignment and Neighbor Joining.

Finding a list of adjacent words between two words

I am working on a programming challenge for practice and am having trouble finding a good data structure/algorithm to use to implement a solution.
Background:
Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
A “word list” is an ordered list of unique words where successive words are adjacent.
The problem:
Write a program which takes two words as inputs and walks through the dictionary and creates a list of words between them.
Examples:
hate → love: hate, have, hove, love
dogs → wolves: dogs, does, doles, soles, solves, wolves
man → woman: man, ran, roan, roman, woman
flour → flower: flour, lour, dour, doer, dower, lower, flower
I am not quite sure how to approach this problem, my first attempt involved creating permutations of the first word then trying to replace letters in it. My second thought was maybe something like a suffix tree
Any thoughts or ideas toward at least breaking the problem down would be appreciated. Keep in mind that this is not homework, but a programming challenge I am working on myself.
This puzzle was first stated by Charles Dodgson, who wrote Alice's Adventures in Wonderland under his pseudonym Lewis Carroll.
The basic idea is to create a graph structure in which the nodes are words in a dictionary and the edges connect words that are one letter apart, then do a breadth-first search through the graph, starting at the first word, until you find the second word.
I discuss this problem, and give an implementation that includes a clever algorithm for identifying "adjacent to" words, at my blog.
I have done this myself and used it to create a (not very good) Windows game.
I used the approach recommended by others of implementing this as a graph, where each node is a word and they are connected if they differ in one letter. This means you can use well known graph theory results to find paths between words (eg simple recursion where knowing the words at distance 1 allows you to find the words at distance 2).
The tricky part is building up the graph. The bad news is that it is O(n^2). The good news is that it doesn't have to be done in real time - rather than your program reading the dictionary words from a file, it reads in the data structure you baked earlier.
The key insight is that the order doesn't matter, in fact it gets in the way. You need to construct another form in which to hold the words which strips out the order information and allows words to be compared more easily. You can do this in O(n). You have lots of choices; I will show two.
For word puzzles I quit often use an encoding which I call anagram dictionary. A word is represented by another word which has the same letters but in alphabetic sequence. So "cars" becomes "acrs". Both lists and slits become "ilsst". This is a better structure for comparison than the original word, but much better comparisons exist (however, it is a very useful structure for other word puzzles).
Letter counts. An array of 26 values which show the frequency of that letter in the word. So for "cars" it starts 1,0,1,0,0... as there is one "a" and one "c". Hold an external list of the non-zero entries (which letters appear in the word) so you only have to check 5 or 6 values at most instead of 26. Very simple to compare two words held in this form quickly by ensuring at most two counts are different. This is the one I would use.
So, this is how I did it.
I wrote a program which implemented the data structure up above.
It had a class called WordNode. This contains the original word; a List of all other WordNodes which are one letter different; an array of 26 integers giving the frequency of each letter, a list of the non-zero values in the letter count array.
The initialiser populates the letter frequency array and the corresponding list of non-zero values. It sets the list of connected WordNodes to zero.
After I have created an instance of the WordNode class for every word, I run a compare method which checks to see if the frequency counts are different in no more than two places. That normally takes slightly less compares than there are letters in the words; not too bad. If they are different in exactly two places they differ by one letter, and I add that WordNode into the list of WordNodes differing in only one letter.
This means we now have a graph of all the words one letter different.
You can export either the whole data structure or strip out the letter frequency and other stuff you don't need and save it (I used serialized XML. If you go that way, make sure you check it handles the List of WordNodes as references and not embedded objects).
Your actual game then only has to read in this data structure (instead of a dictionary) and it can find the words one letter different with a direct lookup, in essentially zero time.
Pity my game was crap.
I don't know if this is the type of solution that you're looking for, but an active area of research is in constructing "edit distance 1" dictionaries for quickly looking up adjacent words (to use your parlance) for search term suggestions, data entry correction, and bioinformatics (e.g. finding similarities in chromosomes). See for example this research paper. Short of indexing your entire dictionary, at the very least this might suggest a search heuristic that you can use.
The simplest (recursive) algorithm is I can think of (well, the only one I can think in the moment) is
Initialize a empty blacklist
Take all words from your dictionary that is a valid step for the current word
remove the ones that are in the blacklist
Check if you can find the target word.
if not, repeat the algorithm for all words you found in last step
if yes, you found it. Return the recursion printing all words in the path you found.
Maybe someone with a bit more time can add the ruby code for this?
Try this
x = 'hate'
puts x = x.next until x == 'love'
And if you couple it with dictionary lookup, you will get a list of all valid words in between in that dictionary.

How can we optimise the creation of a trie if we know the input is in alphabetical order?

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!
Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.
There is no way that you can avoid looking at all the characters in the input strings from which you're building the tree. If there was a way to do this, then I could make your algorithm incorrect. In particular, suppose that there is a word w and you don't look at one of its characters (say, the kth character). Then when your algorithm runs and tries to place the word somewhere in the trie, it must be able to place it without knowing all the characters. Therefore, if I change the kth character of the word to something else, your algorithm would put it in exactly the same place as before, which is incorrect because one of the characters in the word won't be correct.
Since the normal algorithm for constructing a trie takes time proportional to the number of characters in the input, you won't be able to asymptotically outperform it without doing some crazy tricks like parallelizing the construction code or packing the characters into machine words and hitting them with your Hammer of Bit Hackery.
However, you could potentially get a constant factor speedup. Following large numbers of pointers in a linked structure can be slow due to cache performance, so you could speed up the algorithm by minimizing the number of pointers you have to follow. One thing you could do would be to maintain the position of the end of the last string that you inserted, along with a list (preferably as a dynamic array) of nodes tracing the path back up to the root. To insert a new character, you could do the following:
Find the longest prefix of the string that matches the last string you inserted.
Jump to the pointer in the array marking where that would take you.
Trace the rest of the path down as normal, adding all the nodes that you trace out to the array and overwriting the previous pointers.
This way, if you insert a lot of words with a common prefix of a reasonable length, you can avoid doing a bunch of pointer-chasing back through a shared part of the structure. This could conceivably give you a performance boost if you have lots of words with the same prefix. It's not asymptotically better than before (and, in fact, uses more memory), but the savings from not following pointers could add up. I haven't tested this, but it seems like it might work.
Hope this helps!

data structure for NFA representation

In my lexical analyzer generator I use McNaughton and Yamada algorithm for NFA construction, and one of its properties that transition form I to J marked with char at J position.
So, each node of NFA can be represented simply as list of next possible states.
Which data structure best suit for storing this type of data? It must provide fast lookup for all possible states and use less space, but insertion time is not so important.
My understanding is that you want to encode a graph, where the nodes are states and the edges are transitions, and where every edge is labelled with a character. Is that correct?
The dull but practical answer is to have a object for each state, and to encode the transitions in some little structure in that object.
The simplest one would be an array, indexed by character code: that's as fast as it gets, but not naturally space-efficient. You can make it more space efficient by using a sort of offset, truncated array: store only the part of the array which contains transitions, along with the start and end indices of that part. When looking up a character in it, check that its code is within the bounds; if it isn't, treat it as a null edge (or an edge back to the start state or whatever), and if it is, fetch the element at index (character code - start). Does that make sense?
A more complex option would be a little hashtable, which would be more compact but slightly slower. I would suggest closed hashing, because collision lists will use too much memory; linear probing should be enough. You could look into using perfect hashing (look it up), which takes a lot of time to generate the table but then gives collision-free lookup. The generation process is quite complex, though.
A clever approach is to use both arrays and hashtables, and to pick one or the other based on the number of edges: if the compacted array would be more than, say, a third full, use it, but if not, use a hashtable.
Now, something a bit more radical you could do would be to use arrays, but to overlap them - if they're sparse, they'll have lots of holes in, and if you're clever, you can arrange them so that the entries in each array lines up with holes in the others. That will give you fast lookups, but also excellent memory efficiency. You will need some scheme for distinguishing when a lookup has found something from when it's found an empty slot with some other state's transition in, but i'm sure you can think of something.

Resources