Search for multiple words by prefix (trie data structure) - algorithm

How can I use a trie (or another data structure or algorithm) to efficiently search for multiple words by prefix?
For example: suppose this is my data set:
Alice Jones
Bob Smith
Bobby Walker
John Doe
(10000 names in total)
A trie data structure allows me to efficiently retrieve all names starting with "Bo" (thus without iterating over all the names). But I also want to search on the last name by prefix, thus searching for "Wa" should find "Bobby Walker". And to complicate things: when the user searches for "Bo Wa" this should also find the same name. How can I implement this? Should I use a separate trie structure for each part of the name? (And how to combine the results)?
Background: I'm writing the search functionality for a big address book (10000+ names). I want to have a really fast autocomplete function that shows results while people are typing the first few letters of the first & last name. I already have a solution that uses a regex, but it needs to iterate over all names which is to slow.

A very good data structure would be a Burst Trie
There's a Scala implementation.

You could try a second trie with the reversed string and a wildcard search:http://phpir.com/tries-and-wildcards/

I think that a sorted array will also fit for your requirements, an array containing Person objects (they have a firstName and a lastName field). Let's say that you have a prefix and want to find all the values that fit your prefix. Simply run a binary search to find the first position (let's say is firstIndex) where your prefix appears on firstName and one more to find the last position (lastIndex). Now you can retrieve your values in O(lastIndex - firstIndex). The same goes when you want to find them by lastName. When you have a prefixFirstName and a prefixLastName you can search for the interval where values match for prefixFirstName and then, on this interval, you can check for the values matching the prefixLastName. For a conclusion, when you have one or two prefixes you run 4 binary searches (around 17 iterations per search for 100k names) which is fast enough and you can retrieve them in linear time. Even if it isn't the fastest solution, I suggested it since it's easy to understand and easy to code.

Related

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database,
Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.
Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.
http://nlp.stanford.edu/software/lex-parser.shtml
The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.
So you have a document and want to seach if it contains any of your list of streetnames?
Turbo Boyer-Moore is a good starting point for doing that.
Here is more information on turbo boyer moore
But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:
Here an example:
Street name: Asia-Pacific-street
You can access your list by:
A (getting a starting point for all that start with an A)
AS (getting a starting point for all that start with an AS)
and so on...
I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)
more information about bucketing
The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

Fast filtering of a string collection by substring?

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?
Wikipedia article lists a few ways to index substrings. You've got:
Suffix tree
Suffix array
N-gram index, an inverted file for all N-grams of the text
Compressed suffix array1
FM-index
LZ-index
Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.
If you can preprocess the collection then you can do a lot of different things.
For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.
If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.
There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.
Not really anything that's viable, no, unless you have additional a priori knowledge of your data and/or search term - for instance, if you're only searching for matches at the beginning of your strings, then you could sort the strings and only look at the ones within the bounds of your search term (or even store them in a binary tree and only look at the branches that could possibly match). Likewise, if your potential search terms are limited, you could run all the possible searchs against a string when it's initially input and then just store a table of which terms match and which don't.
Aside from that kind of thing, just iterating through is basically it.
That depends on if the substring is at the beginning of the string or can be anywhere in the string.
If it's anywhere then you pretty much need to iterate over the entire list unless your list is so large and the query happens sufficiently often that it's worth building a more sophisticated indexing solution.
If the substring is at the beginning of the string then it's easy. Sort the list, find the start/end by biseciton search and take that subset.

Resources