Efficiently Querying Phrases in a Huge Document Collection - algorithm

In an interview, I have received a question that asks me to design a search engine for an online library that contains thousands of documents with millions of words. Initially, the interviewer only asked me to search for single words, clarifying:
Search for exact keywords ("overflow" returns true while "overflo" does not)
Case-sensitivity can be ignored
My answer was to use a crawler algorithm that runs through each document and creates a lookup table that stores the information in which documents a given word is used, prior to any query executed. Then once a query is executed, all the algorithm have to do is to find the word in the lookup table and return the list of the documents that the word is used.
On the second step, they asked me what would I do if they wanted to search multiple words (not necessarily consecutive) and my answer was to make a new query for each word and find the intersection of the results.
Finally, the interviewer asked me what would I do if they wanted me to query consecutive words or phrases (eg. "stack overflow"). At this point, my lookup table failed since there are no connection between consecutive words and I couldn't come up with a solution in this approach. How can I handle these kind of queries? Are there any problems with my initial answers and design? I have searched on the Internet but couldn't find anything noteworthy.

For the second case, generate the map such that every key is a word and every value is a set of objects that have the following properties
{
string document name : location/document_name,
integer index : index, //location in document,
toString : hash the object
}
Then when you need to find results for "stack overflow"
For all elements in set 1, does that value exist in set 2 but modified with the item_index + 1.
Get the results for two words but just return the doc names. If there are three words, do the same process you did for two words, but for the third word, check only the matches that passed word 1 and word 2 for word 2, and check to see if those items exist in set 3 with item_index + 1.

Related

Is there a way to change the Search API facet count to show a total word count instead of the count of matching fragments (documents)?

I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.

Search for multiple words by prefix (trie data structure)

How can I use a trie (or another data structure or algorithm) to efficiently search for multiple words by prefix?
For example: suppose this is my data set:
Alice Jones
Bob Smith
Bobby Walker
John Doe
(10000 names in total)
A trie data structure allows me to efficiently retrieve all names starting with "Bo" (thus without iterating over all the names). But I also want to search on the last name by prefix, thus searching for "Wa" should find "Bobby Walker". And to complicate things: when the user searches for "Bo Wa" this should also find the same name. How can I implement this? Should I use a separate trie structure for each part of the name? (And how to combine the results)?
Background: I'm writing the search functionality for a big address book (10000+ names). I want to have a really fast autocomplete function that shows results while people are typing the first few letters of the first & last name. I already have a solution that uses a regex, but it needs to iterate over all names which is to slow.
A very good data structure would be a Burst Trie
There's a Scala implementation.
You could try a second trie with the reversed string and a wildcard search:http://phpir.com/tries-and-wildcards/
I think that a sorted array will also fit for your requirements, an array containing Person objects (they have a firstName and a lastName field). Let's say that you have a prefix and want to find all the values that fit your prefix. Simply run a binary search to find the first position (let's say is firstIndex) where your prefix appears on firstName and one more to find the last position (lastIndex). Now you can retrieve your values in O(lastIndex - firstIndex). The same goes when you want to find them by lastName. When you have a prefixFirstName and a prefixLastName you can search for the interval where values match for prefixFirstName and then, on this interval, you can check for the values matching the prefixLastName. For a conclusion, when you have one or two prefixes you run 4 binary searches (around 17 iterations per search for 100k names) which is fast enough and you can retrieve them in linear time. Even if it isn't the fastest solution, I suggested it since it's easy to understand and easy to code.

prefix similarity search

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming)
I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein.
However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)
What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result
Sample results:
query "fantas", words in database "fantasy", "fantastic" => 0
query "fantas", wor in database "fan" => 3
The input to edit distance are words selected from a "most words" list based on trigram similarity
You can modify edit distance algorithm to give a lower weight to the latter part of the string.
Eg: Match(i,j) = 1/max(i,j)^2 instead of Match(i,j)=1 for every i&j. (i and j are the location of the symbols you are comparing).
What this does is: dist('ABCD', 'ABCE') < dist('ABCD', 'EBCD').

Hash Table and Substring Matching

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.
Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.
It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.
I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

Fuzzy match of an English sentence with a set of English sentences stored in a database

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.
What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found in sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.
For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).
A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:
http://en.wikipedia.org/wiki/N-gram
You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).
Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.
To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.
You can play around with this more. I hope this approach is more simple and appealing.

Resources