How to find anagram frequency in a string? - algorithm

Given a string value of arbitrary length, you're supposed to determine the frequency of words which are anagrams of each other.
public static Map<String, Integer> generateAnagramFrequency(String str)
{ ... }
For example: if the string is "find art in a rat for cart and dna trac"
your output should be a map:
find -> 1
art -> 2
in -> 1
a -> 1
cart -> 2
and -> 2
The keys should be the first occurrence of the word, and the number is the number of anagrams of that word, including itself.
The solution i came up with so for is to sort all the words and compare each character from both strings till the end of either strings. It would be O(logn). I am looking for some other efficient method which doesn't change the 2 strings being compared. Thanks.

I've written a JavaScript implementation of creation a n-gram (word analysis), at Extract keyphrases from text (1-4 word ngrams).
This function can easily be altered to analyse the frequency of anagrams:Replace s = text[i]; by s = text[i].sort(), so that the order of characters doesn't matter any more.

Create a "signature" for each word by sorting its letters alphabetically. Sort the words by their signatures. Run through the sorted list in order; if the signature is the same as the previous signature, you have an anagram.

Related

Ruby. Split string in separate decimal numbers

I have a long string which contains only decimal numbers with two signs after comma
str = "123,457568,22321,5484123,77"
The numbers in string only decimals with two signs after comma. How I can separate them in different numbers like that
arr = ["123,45" , "7568,22" , "321,54" , "84123,77"]
You could try a regex split here:
str = "123,457568,22321,5484123,77"
nums = str.split(/(?<=,\d{2})/)
print nums
This prints:
123,45
7568,22
321,54
84123,77
The logic above says to split at every point where a comma followed by two digits precedes.
Scan String for Commas Followed by Two Digits
This is a case where you really need to know your data. If you always have floats with two decimal places, and commas are decimals in your locale, then you can use String#scan as follows:
str.scan /\d+,\d{2}/
#=> ["123,45", "7568,22", "321,54", "84123,77"]
Since your input data isn't consistent (which can be assumed by the lack of a reliable separator between items), you may not be able to guarantee that each item has a fractional component at all, or that the component has exactly two digits. If that's the case, you'll need to find a common pattern that is reliable for your given inputs or make changes to the way you assign data from your data source into str.

Replace with multiple patterns mutually exclusively

I have the following text:
a phrase whith length one, which is "uno"
Using the following dictionary,
1) phrase --- frase
2) a phrase --- una frase
3) one --- uno
4) uno --- one
I'm trying to replace the occurrences of the dictionary items in the text. The desired output is:
[a phrase|una frase] whith length [one|uno], which is "[uno|one]"
I've done this:
text = %(a phrase whith length one, which is "uno")
dictionary.each do |original, translation|
text.gsub! original, "[#{original}|#{translation}]"
end
This snippet outputs the following for each dictionary word:
1) a [phrase|frase] whith length one, which is "uno"
2) a [phrase|frase] whith length one, which is "uno"
3) a [phrase|frase] whith length [one|uno], which is "uno"
3) a [phrase|frase] whith length [one|[uno|one]], which is "[uno|one]"
I see two problems here:
The word phrase is being replaced instead of a phrase. I think that this can be fixed by sorting the dictionary by length, giving priority to longer terms.
The already replaced words are being re-replaced, like uno in [one|uno]. I thought of using some sort of regular expression list (with Regex::union), but I don't know how efficient and clean it'll be.
Any ideas?
To solve your second problem, you have to replace in a single pass.
Convert the dictionary into a hash with the key-value pairs in the order you mention (sorted by length, perhaps).
dictionary = {
"a phrase" => "[a phrase|una frase]",
"phrase" => "[phrase|frase]",
"one" => "[one|uno]",
"uno" => "[uno|one]",
}
Then replace all in a single pass.
text.gsub(Regexp.union(*dictionary.keys.map{|w| "\b#{w}\b"}), dictionary)

String of words - DP

I have a string of words and I must determine the longest substring so that the latest 2 letters of a word must be the first 2 letters of a word after it.
For example, for the words:
star, artifact, book, ctenophore, list, reply
Edit: So the longest substring would be star, artifact, ctenophore, reply
I'm looking for an idea to solve this problem in O(n). No code, I appreciate any sugestions on how to solve it.
The closest thing to O(n) I have is this :
You should mark every word with an Id. Let's take your example :
star => 1st substring possible. Since you're looking for the longest substring, if a substring stars with ar, it's not the longest, because you can add star in the front.
let's set the star ID to 1, and its string comparison is ar
artifact => the two first character matches the first possible substring. let's set the artifact ID to 1 as well, and change the string comparison to ct
book => the two first character don't match anything in the string comparisons (there's only ct there), so we set the book ID to 2, and we add a new string comparison : ok
...
list => the first two character don't match anything in the string comparisons (re from ID == 1 and ok from ID ==2 ), so we create another ID = 3 and another string comparison
In the end, you just need to go through the IDs and see which one has the most elements. You can probably count it as you go as well.
The main idea of this algorithm is to memorize every substring we're looking for. If we find a match, we just update the right substring with the two new last characters, and if we don't, we add it to the "memory list"
Repeating this procedure makes it O(n*m), with m the number of different IDs.
First, read in all words into a structure. (You don't really need to, but it's easier to work that way. You could also read them in as you go.)
Idea is to have a lookup table (such as a Dictionary in .NET), which will contain key value pairs such that each two last letters of a word will have an entry in this lookup table, and their corresponding value will always be the longest 'substring' found so far.
Time complexity is O(n) - you only go through the list once.
Logic:
maxWord <- ""
word <- read next word
initial <- get first two letters of word
end <- get last two letters of word
if lookup contains key initial //that is the longest string so far... add to it
newWord <- lookup [initial] value + ", " + word
if lookup doesn't contain key end //nothing ends with these two letters so far
lookup add (end, newWord) pair
else if lookup [end] value length < newWord length //if this will be the longest string ending in these two letters, we replace the previous one
lookup [end] <- newWord
if maxWord length < newWord length //put this code here so you don't have to run through the lookup table again and find it when you finish
maxWord <- newWord
else //lookup doesn't contain initial, we use only the word, and similar to above, check if it's the longest that ends with these two letters
if lookup doesn't contain key end
lookup add (end, word) pair
else if lookup [end] value length < word length
lookup [end] <- word
if maxWord length < word length
maxWord <- word
The maxWord variable will contain the longest string.
Here is the actual working code in C#, if you want it: http://pastebin.com/7wzdW9Es

String that can contain multiple numbers - how do I extract the longest number?

I have a string that
contains at least one number
can contain multiple numbers
Some examples are:
https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384
https://www.facebook.com/username_13/posts/101505775425651120
https://www.facebook.com/username/posts/101505775425699820
I need a way to extract the longest number from the string. So for the 3 strings above, it would extract
53199604568
101505775425651120
101505775425699820
How can I do this?
#get the lines first
text = <<ENDTEXT
https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384
https://www.facebook.com/username_13/posts/101505775425651120
https://www.facebook.com/username/posts/101505775425699820
ENDTEXT
lines = text.split("\n")
#this bit is the actual answer to your question
lines.collect{|line| line.scan(/\d+/).sort_by(&:length).last}
Note that i'm returning the numbers as strings here. You could convert them to numbers with to_i
parse the list (to get an int array), then use the Max function. array.Max for syntax.
s = "https://www.facebook.com/permalink.php?story_fbid=53199604568&id=218700384"
s.scan(/\d+/).max{|a,b| a.length <=> b.length}.to_i

Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my #words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (#words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
The answer heavily depends on the actual requirements.
How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?
Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.
If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.
In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.
you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci
ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)
during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.
This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use
The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

Resources