Data structure for Camelcase search (similar to Eclipse and Idea) - algorithm

I'm trying to implement an algorithm for camel case search (skipping letters between capital letters), for example:
Given list and input string.
List:
ManualProxy.java
MapPropertyBase.java
MaxProduct.java
ManualProviderBase.java
Map.java
Input:
"Ma" - we should return all words from the dictionary because all of
them starts with Ma
"MaP" - return all except "Map.java", because they start with "Ma"
and the second word starts with "P"
"MProd" - return MaxProduct.java because the first word starts with
"M" and the second words starts with "Prod"
I consider to use a trie, but the problem is that I need traverse the whole tree to find the second capital letter. Another option is to use trie for capital letters and tries for nodes. But in this case, I have to return from the node and if there are matches continue traverse for the next capital letter.
Is there any better approach to this problem?

Related

Is empty string can be a prefix or suffix of a string?

So lately I was solving a problem called unique prefix tree (or Trie) and there I was confused about the term prefix so I dig into it as far as possible. And what I found by definition is like,
"A string x is a prefix of another string y if there is a string v such that y = xv. v is called a suffix of y."
So from this definition, I have a question arises in my mind which is,
can a string be a prefix of itself?
I think, it is. A string can be a prefix of itself.
But according to the definition, if a string is prefix of itself then v should be a empty string. And, v is also a suffix of y. So again question arises is like, then can empty string be a suffix of a string!!
Wikipedia put it nicely - All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.
For example, nodes associated with the string "tea", "ted" and "ten" have a common prefix of "te", preceded by prefix of "t" which root is the empty string.
An empty string is, according to the theory, still a string but with a length of zero, but it is not much of a suffix more than a root, it's as if you would say that 4 is 0 + 2 + 2, you could say that but you would you?

String of words - DP

I have a string of words and I must determine the longest substring so that the latest 2 letters of a word must be the first 2 letters of a word after it.
For example, for the words:
star, artifact, book, ctenophore, list, reply
Edit: So the longest substring would be star, artifact, ctenophore, reply
I'm looking for an idea to solve this problem in O(n). No code, I appreciate any sugestions on how to solve it.
The closest thing to O(n) I have is this :
You should mark every word with an Id. Let's take your example :
star => 1st substring possible. Since you're looking for the longest substring, if a substring stars with ar, it's not the longest, because you can add star in the front.
let's set the star ID to 1, and its string comparison is ar
artifact => the two first character matches the first possible substring. let's set the artifact ID to 1 as well, and change the string comparison to ct
book => the two first character don't match anything in the string comparisons (there's only ct there), so we set the book ID to 2, and we add a new string comparison : ok
...
list => the first two character don't match anything in the string comparisons (re from ID == 1 and ok from ID ==2 ), so we create another ID = 3 and another string comparison
In the end, you just need to go through the IDs and see which one has the most elements. You can probably count it as you go as well.
The main idea of this algorithm is to memorize every substring we're looking for. If we find a match, we just update the right substring with the two new last characters, and if we don't, we add it to the "memory list"
Repeating this procedure makes it O(n*m), with m the number of different IDs.
First, read in all words into a structure. (You don't really need to, but it's easier to work that way. You could also read them in as you go.)
Idea is to have a lookup table (such as a Dictionary in .NET), which will contain key value pairs such that each two last letters of a word will have an entry in this lookup table, and their corresponding value will always be the longest 'substring' found so far.
Time complexity is O(n) - you only go through the list once.
Logic:
maxWord <- ""
word <- read next word
initial <- get first two letters of word
end <- get last two letters of word
if lookup contains key initial //that is the longest string so far... add to it
newWord <- lookup [initial] value + ", " + word
if lookup doesn't contain key end //nothing ends with these two letters so far
lookup add (end, newWord) pair
else if lookup [end] value length < newWord length //if this will be the longest string ending in these two letters, we replace the previous one
lookup [end] <- newWord
if maxWord length < newWord length //put this code here so you don't have to run through the lookup table again and find it when you finish
maxWord <- newWord
else //lookup doesn't contain initial, we use only the word, and similar to above, check if it's the longest that ends with these two letters
if lookup doesn't contain key end
lookup add (end, word) pair
else if lookup [end] value length < word length
lookup [end] <- word
if maxWord length < word length
maxWord <- word
The maxWord variable will contain the longest string.
Here is the actual working code in C#, if you want it: http://pastebin.com/7wzdW9Es

Performing operations on each line of a string

I have a string named "string" that contains six lines.
I want to remove an "Z" from the end of each line (which each has) and capitalize the first character in each line (ignoring numbers and white space; e.g., "1. apple" -> "1. Apple").
I have some idea of how to do it, but have no idea how to do it in Ruby. How do I accomplish this? A loop? What would the syntax be?
Using regular expression (See String#gsub):
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
puts s.gsub(/z$/i, '').gsub(/^([^a-z]*)([a-z])/i) { $1 + $2.upcase }
# /z$/i - to match a trailing `z` at the end of lines.
# /^([^a-z]*)([a-z])/i - to match leading non-alphabets and alphabet.
# capture them as group 1 ($1), group 2 ($2)
output:
1. Apple
2. Banana
3. Cat
4. Dog
5. Elephant
6. Fruit
I would approach this by breaking your problem into smaller steps. After we've solved each of the smaller problems, you can put it all back together for a more elegant solution.
Given the initial string put forth by falsetru:
s = <<EOS
1. applez
2. bananaz
3. catz
4. dogz
5. elephantz
6. fruitz
EOS
1. Break your string into an array of substrings, separated by the newline.
substrings = s.split(/\n/)
This uses the String class' split method and a regular expression. It searches for all occurrences of newline (backslash-n) and treats this as a delimiter, splitting the string into substrings based on this delimiter. Then it throws all of these substrings into an array, which we've named substrings.
2. Iterate through your array of substrings to do some stuff (details on what stuff later)
substrings.each do |substring|
.
# Do stuff to each substring
.
end
This is one form for how you iterate across an array in Ruby. You call the Array's each method, and you give it a block of code which it will run on each element in the array. In our example, we'll use the variable name substring within our block of code so that we can do stuff to each substring.
3. Remove the z character at the end of each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
end
Now, as we iterate through the array, the first thing we want to do is remove the z character at the end of each string. You do this with the gsub! method of String, which is a search-and-replace method. The first argument for this method is the regular expression of what you're looking for. In this case, we are looking for a z followed by the end-of-string (denoted by the dollar sign). The second argument is an empty string, because we want to replace what's been found with nothing (another way of saying - we just want to remove what's been found).
4. Find the index of the first letter in each substring
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
end
The String class also has a method called index which will return the index of the first occurrence of a string that matches the regular expression your provide. In our case, since we want to ignore numbers and symbols and spaces, we are really just looking for the first occurrence of the very first letter in your substring. To do this, we use the regular expression /[a-zA-Z]/ - this basically says, "Find me anything in the range of small A to small Z or in big A to big Z." Now, we have an index (using our example strings, the index is 3).
5. Capitalize the letter at the index we have found
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
Based on the index value that we found, we want to replace the letter at that index with that same letter, but capitalized.
6. Put our substrings array back together as a single-string separated by newlines.
Now that we've done everything we need to do to each substring, our each iterator block ends, and we have what we need in the substrings array. To put the array back together as a single string, we use the join method of Array class.
result = substrings.join("\n")
With that, we now have a String called result, which should be what you're looking for.
Putting It All Together
Here is what the entire solution looks like, once we put together all of the steps:
substrings = s.split(/\n/)
substrings.each do |substring|
substring.gsub!(/z$/, '')
index = substring.index(/[a-zA-Z]/)
substring[index] = substring[index].capitalize
end
result = substrings.join("\n")

How to find anagram frequency in a string?

Given a string value of arbitrary length, you're supposed to determine the frequency of words which are anagrams of each other.
public static Map<String, Integer> generateAnagramFrequency(String str)
{ ... }
For example: if the string is "find art in a rat for cart and dna trac"
your output should be a map:
find -> 1
art -> 2
in -> 1
a -> 1
cart -> 2
and -> 2
The keys should be the first occurrence of the word, and the number is the number of anagrams of that word, including itself.
The solution i came up with so for is to sort all the words and compare each character from both strings till the end of either strings. It would be O(logn). I am looking for some other efficient method which doesn't change the 2 strings being compared. Thanks.
I've written a JavaScript implementation of creation a n-gram (word analysis), at Extract keyphrases from text (1-4 word ngrams).
This function can easily be altered to analyse the frequency of anagrams:Replace s = text[i]; by s = text[i].sort(), so that the order of characters doesn't matter any more.
Create a "signature" for each word by sorting its letters alphabetically. Sort the words by their signatures. Run through the sorted list in order; if the signature is the same as the previous signature, you have an anagram.

Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my #words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (#words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
The answer heavily depends on the actual requirements.
How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?
Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.
If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.
In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.
you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci
ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)
during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.
This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use
The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

Resources