Hash Table and Substring Matching - algorithm

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.

Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.

It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.

I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

Related

Search for multiple words by prefix (trie data structure)

How can I use a trie (or another data structure or algorithm) to efficiently search for multiple words by prefix?
For example: suppose this is my data set:
Alice Jones
Bob Smith
Bobby Walker
John Doe
(10000 names in total)
A trie data structure allows me to efficiently retrieve all names starting with "Bo" (thus without iterating over all the names). But I also want to search on the last name by prefix, thus searching for "Wa" should find "Bobby Walker". And to complicate things: when the user searches for "Bo Wa" this should also find the same name. How can I implement this? Should I use a separate trie structure for each part of the name? (And how to combine the results)?
Background: I'm writing the search functionality for a big address book (10000+ names). I want to have a really fast autocomplete function that shows results while people are typing the first few letters of the first & last name. I already have a solution that uses a regex, but it needs to iterate over all names which is to slow.
A very good data structure would be a Burst Trie
There's a Scala implementation.
You could try a second trie with the reversed string and a wildcard search:http://phpir.com/tries-and-wildcards/
I think that a sorted array will also fit for your requirements, an array containing Person objects (they have a firstName and a lastName field). Let's say that you have a prefix and want to find all the values that fit your prefix. Simply run a binary search to find the first position (let's say is firstIndex) where your prefix appears on firstName and one more to find the last position (lastIndex). Now you can retrieve your values in O(lastIndex - firstIndex). The same goes when you want to find them by lastName. When you have a prefixFirstName and a prefixLastName you can search for the interval where values match for prefixFirstName and then, on this interval, you can check for the values matching the prefixLastName. For a conclusion, when you have one or two prefixes you run 4 binary searches (around 17 iterations per search for 100k names) which is fast enough and you can retrieve them in linear time. Even if it isn't the fastest solution, I suggested it since it's easy to understand and easy to code.

How to search in polynomial hash table in ascending order in C?

I there guys,
i'm developing a small program in C, that reads strings from a .txt file with 2 letters and 3 numbers format. Like this
AB123
I developed a polynomial hash function, that calculates an hash key like this
hash key(k) = k1 + k2*A² + k3*A^3... +Kn*A^n
where k1 is the 1º letter of the word, k2 the 2º (...) and A is a prime number to improve the number of collisions, in my case its 11.
Ok, so i got the table generated, i can search in the table no problem, but only if i got the full word... That i could figure it out.
But what if i only want to use the first letter? Is it possible to search in the hash table, and get the elements started by for example 'A' without going through every element?
In order to have more functionality you have to introduce more data structures. It all depends on how deep you want to go, which depends on what exactly you need to code to do.
I suspect that you want some kind of filtering for the user. When user enters "A" it should be given all strings that have "A" at the start, and when afterwards it enters "B" the list should be filtered down to all strings starting with "AB".
If this is the case then you don't need over-complicated structures. Just iterate through the list and give the user the appropriate sublist. Humans are slow, and they won't notice the difference between 3 ms response and 300 ms response.
If your hash function is well designed, every place in the table is capable of storing a string beginning with any prefix, so this approach is doomed from the start.
It sounds like what you really want might be a trie.

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

Fast filtering of a string collection by substring?

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?
Wikipedia article lists a few ways to index substrings. You've got:
Suffix tree
Suffix array
N-gram index, an inverted file for all N-grams of the text
Compressed suffix array1
FM-index
LZ-index
Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.
If you can preprocess the collection then you can do a lot of different things.
For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.
If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.
There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.
Not really anything that's viable, no, unless you have additional a priori knowledge of your data and/or search term - for instance, if you're only searching for matches at the beginning of your strings, then you could sort the strings and only look at the ones within the bounds of your search term (or even store them in a binary tree and only look at the branches that could possibly match). Likewise, if your potential search terms are limited, you could run all the possible searchs against a string when it's initially input and then just store a table of which terms match and which don't.
Aside from that kind of thing, just iterating through is basically it.
That depends on if the substring is at the beginning of the string or can be anywhere in the string.
If it's anywhere then you pretty much need to iterate over the entire list unless your list is so large and the query happens sufficiently often that it's worth building a more sophisticated indexing solution.
If the substring is at the beginning of the string then it's easy. Sort the list, find the start/end by biseciton search and take that subset.

Resources