String of words - DP - algorithm

I have a string of words and I must determine the longest substring so that the latest 2 letters of a word must be the first 2 letters of a word after it.
For example, for the words:
star, artifact, book, ctenophore, list, reply
Edit: So the longest substring would be star, artifact, ctenophore, reply
I'm looking for an idea to solve this problem in O(n). No code, I appreciate any sugestions on how to solve it.

The closest thing to O(n) I have is this :
You should mark every word with an Id. Let's take your example :
star => 1st substring possible. Since you're looking for the longest substring, if a substring stars with ar, it's not the longest, because you can add star in the front.
let's set the star ID to 1, and its string comparison is ar
artifact => the two first character matches the first possible substring. let's set the artifact ID to 1 as well, and change the string comparison to ct
book => the two first character don't match anything in the string comparisons (there's only ct there), so we set the book ID to 2, and we add a new string comparison : ok
...
list => the first two character don't match anything in the string comparisons (re from ID == 1 and ok from ID ==2 ), so we create another ID = 3 and another string comparison
In the end, you just need to go through the IDs and see which one has the most elements. You can probably count it as you go as well.
The main idea of this algorithm is to memorize every substring we're looking for. If we find a match, we just update the right substring with the two new last characters, and if we don't, we add it to the "memory list"
Repeating this procedure makes it O(n*m), with m the number of different IDs.

First, read in all words into a structure. (You don't really need to, but it's easier to work that way. You could also read them in as you go.)
Idea is to have a lookup table (such as a Dictionary in .NET), which will contain key value pairs such that each two last letters of a word will have an entry in this lookup table, and their corresponding value will always be the longest 'substring' found so far.
Time complexity is O(n) - you only go through the list once.
Logic:
maxWord <- ""
word <- read next word
initial <- get first two letters of word
end <- get last two letters of word
if lookup contains key initial //that is the longest string so far... add to it
newWord <- lookup [initial] value + ", " + word
if lookup doesn't contain key end //nothing ends with these two letters so far
lookup add (end, newWord) pair
else if lookup [end] value length < newWord length //if this will be the longest string ending in these two letters, we replace the previous one
lookup [end] <- newWord
if maxWord length < newWord length //put this code here so you don't have to run through the lookup table again and find it when you finish
maxWord <- newWord
else //lookup doesn't contain initial, we use only the word, and similar to above, check if it's the longest that ends with these two letters
if lookup doesn't contain key end
lookup add (end, word) pair
else if lookup [end] value length < word length
lookup [end] <- word
if maxWord length < word length
maxWord <- word
The maxWord variable will contain the longest string.
Here is the actual working code in C#, if you want it: http://pastebin.com/7wzdW9Es

Related

Data structure for Camelcase search (similar to Eclipse and Idea)

I'm trying to implement an algorithm for camel case search (skipping letters between capital letters), for example:
Given list and input string.
List:
ManualProxy.java
MapPropertyBase.java
MaxProduct.java
ManualProviderBase.java
Map.java
Input:
"Ma" - we should return all words from the dictionary because all of
them starts with Ma
"MaP" - return all except "Map.java", because they start with "Ma"
and the second word starts with "P"
"MProd" - return MaxProduct.java because the first word starts with
"M" and the second words starts with "Prod"
I consider to use a trie, but the problem is that I need traverse the whole tree to find the second capital letter. Another option is to use trie for capital letters and tries for nodes. But in this case, I have to return from the node and if there are matches continue traverse for the next capital letter.
Is there any better approach to this problem?

How can I count the number of equal words between two strings?

How can I count the number of words that appear in two strings?
I'm thinking in something like this
let $nequalwords := count($item[text() eq $speech])
What is the best way to do this?
I thought to go with a two fors comparing word by word, but I don't know if there are a better way to do this.
How about splitting the strings on white space so that you end up with words, and then creating a sequence of the strings and removing those that are not distinct, i.e. those that appear in both strings, by then subtracting this from the count of all words you know how many words appeared in both strings. For example:
let $distinct-words1 := distinct-values(tokenize($string1, "\s+"))
let $distinct-words2 := distinct-values(tokenize($string2, "\s+"))
let $all-words := ($distinct-words1, $distinct-words2)
return
count($all-words) - count(distinct-values($all-words))
How about
count(tokenize($string1, "\s+")[. = tokenize($string2, "\s+")])
This is the number of words in the first string that also appear in the second string. Which might or might not be what you actually want. For example, if the two strings are "the more the merrier" and "the rite of spring", the answer will be 2.

How to find anagram frequency in a string?

Given a string value of arbitrary length, you're supposed to determine the frequency of words which are anagrams of each other.
public static Map<String, Integer> generateAnagramFrequency(String str)
{ ... }
For example: if the string is "find art in a rat for cart and dna trac"
your output should be a map:
find -> 1
art -> 2
in -> 1
a -> 1
cart -> 2
and -> 2
The keys should be the first occurrence of the word, and the number is the number of anagrams of that word, including itself.
The solution i came up with so for is to sort all the words and compare each character from both strings till the end of either strings. It would be O(logn). I am looking for some other efficient method which doesn't change the 2 strings being compared. Thanks.
I've written a JavaScript implementation of creation a n-gram (word analysis), at Extract keyphrases from text (1-4 word ngrams).
This function can easily be altered to analyse the frequency of anagrams:Replace s = text[i]; by s = text[i].sort(), so that the order of characters doesn't matter any more.
Create a "signature" for each word by sorting its letters alphabetically. Sort the words by their signatures. Run through the sorted list in order; if the signature is the same as the previous signature, you have an anagram.

Boyer Moore Algorithm Understanding and Example?

I am facing issues in understanding Boyer Moore String Search algorithm.
I am following the following document. Link
I am not able to work out my way as to exactly what is the real meaning of delta1 and delta2 here, and how are they applying this to find string search algorithm.
Language looked little vague..
Kindly if anybody out there can help me out in understanding this, it would be really helpful.
Or, if you know of any other link or document available that is easy to understand, then please share.
Thanks in advance.
The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.
Let's say our pattern p is the sequence of characters p1, p2, ..., pn and we are searching a string s, currently with p aligned so that pn is at index i in s.
E.g.:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
The B-M paper makes the following observations:
(1) if we try matching a character that is not in p then we can jump forward n characters:
'F' is not in p, hence we advance n characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
(2) if we try matching a character whose last position is k from the end of p then we can jump forward k characters:
' 's last position in p is 4 from the end, hence we advance 4 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
Now we scan backwards from i until we either succeed or we hit a mismatch.
(3a) if the mismatch occurs k characters from the start of p and the mismatched character is not in p, then we can advance (at least) k characters.
'L' is not in p and the mismatch occurred against p6, hence we can advance (at least) 6 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
However, we can actually do better than this.
(3b) since we know that at the old i we'd already matched some characters (1 in this case). If the matched characters don't match the start of p, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper):
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
At this point, observation (2) applies again, giving
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
and bingo! We're done.
The algorithm is based on a simple principle. Suppose that I'm trying to match a substring of length m. I'm going to first look at character at index m. If that character is not in my string, I know that the substring I want can't start in characters at indices 1, 2, ... , m.
If that character is in my string, I'll assume that it is at the last place in my string that it can be. I'll then jump back and start trying to match my string from that possible starting place. This piece of information is my first table.
Once I start matching from the beginning of the substring, when I find a mismatch, I can't just start from scratch. I could be partially through a match starting at a different point. For instance if I'm trying to match anand in ananand successfully match, anan, realize that the following a is not a d, but I've just matched an, and so I should jump back to trying to match my third character in my substring. This, "If I fail after matching x characters, I could be on the y'th character of a match" information is stored in the second table.
Note that when I fail to match the second table knows how far along in a match I might be based on what I just matched. The first table knows how far back I might be based on the character that I just saw which I failed to match. You want to use the more pessimistic of those two pieces of information.
With this in mind the algorithm works like this:
start at beginning of string
start at beginning of match
while not at the end of the string:
if match_position is 0:
Jump ahead m characters
Look at character, jump back based on table 1
If match the first character:
advance match position
advance string position
else if I match:
if I reached the end of the match:
FOUND MATCH - return
else:
advance string position and match position.
else:
pos1 = table1[ character I failed to match ]
pos2 = table2[ how far into the match I am ]
if pos1 < pos2:
jump back pos1 in string
set match position at beginning
else:
set match position to pos2
FAILED TO MATCH

Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my #words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (#words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
The answer heavily depends on the actual requirements.
How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?
Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.
If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.
In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.
you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci
ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)
during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.
This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use
The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

Resources