Data Structures for Indexing - data-structures

I was just reading about indexing and discovered that there are two main data structures which can be used for indexing namely
1) Inverted Indexes
2) Suffix Tree
So to me it appears that Suffix Tree naturally due to its structures has no need to use join queries for answering phrases if it Indexes the text of whole document as a single string.
So why are people still using/talking about inverted index ?

Suffix trees can answer exact phrase queries easily, but inverted indexes are more versatile and useful for everything else you need, like stemming, synonym matching, result ranking, etc., unless you extend your suffix tree to also include inverted index information.
Also, exact phrase queries are not that common, and suffix trees are a lot more complicated, slow to build, and require a lot more storage. For typical full-text search applications, it's too much to pay for what you get.

Related

Which products or database use Trie data structure and for what purposes?

Wanted to understand the usage of Trie data structure. I have seen that for Type ahead suggestions Trie is good.
My understanding was that for distributed searching as well Trie is used. But I learnt that Elastic search uses Inverted Indexes, which seems to be a tabular structure.
So wanted to understand which kind of products/DBs use Trie and for what purposes? And where it makes sense to use Inverted index and not a Trie ?
Excepted products like elastic search would be using Trie. But it uses Inverted index.

Lucene index modeling - Why are skiplists used instead of btree?

I have recently started learning lucene and came to know about how lucene stores and queries indices. Lucene seems to be using skip list as an underlying data structure. However, I did not find any reason to use skip list over a binary tree.
The advantage with skip lists is that it provides good performance when being used concurrently. And lucene allows single writer thread per index and readers read from immutable segments, so skip list is not helping here either. Other than that binary tree (self balancing) trumps skip list - since it provides worst case complexity of O(logn) for reading and writing whereas skip list provides same time complexity in average case. Also, binary tree would serve range queries in better time compared to skip list. For serving a conjunction query as well, lucene uses skip lists of multiple postings list to find their intersection - for this case too binary tree would have been enough.
Is there any specific reason skip list is used in lucene for indexing purposes that I have missed?
Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). See this SO answer for How does lucene index documents?
In that answer, it also indicates that the primary benefit of using Skip-Lists it that it avoids ever having to rebalance a B-Tree. If you'd like to dig deeper that answer cite another one that provides a lot more detail: Skip List vs. Binary Search Tree Which intern references additional whitepapers.
Researching this a bit more, there is one other advantage to using Skip-Lists rather then a BTree. It's not just the rebalancing that is avoided, but also avoided is the locking of a portion of the tree while the rebalancing is taking place. This aspect is discussed further here. This latter advantage improves concurrency.

ElasticSearch / Lucene: automata for the dictionary of terms

Issue: I need to highlight matched terms. Out-of-the-box solution cannot be applied due to the fact we don't keep sources inside ES.
Possible solution:
Retrieve ids from ES by search query
Retrieve sources by ids
Match source with query word by word using LevinsteinDistance algorithm or lucene FSM class
Considering we don't retrieve a lot of content at a time it should not consume a lot of time.
The question is the following:
Does Lucene library contain FSM/automata to represent a dictionary? The desired solution: to get lucene automata representing the dictionary and feed the query to it term by term. Automata should accept terms that are contained in the dictionary. Edit Distance should be considered as well.
Searching for the solution I found lucene classes like LevenshteinAutomata and FuzzyQuery. But LevenshteinAutomata (as I understood) represents only one term. So for several terms I need several automata.

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

NoSQL or YesSQL

I have a huge dictionary of words:
"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]
number of words is really big.
Now I want to be able to retrieve, really fast, all the values which are being pointed by word. word is string value.
What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.
So, before I open Cormen's book, is there some ready solutions for that problem?
Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.
In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.
There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:
Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.
Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.
It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.
If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.
If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.
You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.
Hope this helps,
If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.
If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

Resources