If DocumentDB can do its own indexing and Azure Search can also do indexing, then why would I want to use them together? Any use cases?
Also, using DocumentDB is already expensive and if I use Azure Search with it, how does this affect my DocumentDB performance and cost?
DocumentDB shines as a general purpose document database, while Azure Search shines as a full text search (FTS) engine.
For example, Azure Search provides:
Linguistically-aware indexing and search that takes into account word forms (e.g., singular vs. plural, verb tenses, and many other kinds of grammatical inflections) in ~60 languages.
High quality lemmatization and tokenization. For example, word-breaking Chinese text is hard because white space is optional
Synonyms
Proximity search, similar pronunciation search (soundex / metaphone), wildcard and regex search using Lucene query syntax
Customizable ranking, so that you can boost newer documents, for example
Suggestions
... dozens of other text processing and natural language-related features
If all you need are simple numerical filters or exact string comparisons, just use DocumentDB.
If you need natural language search for some of your content, use Azure Search together with DocumentDB. Connecting them is easy with DocumentDB indexer.
In terms of cost implications, using Azure Search with DocumentDB doesn't change the cost of DocumentDB. If you use the DocumentDB indexer, it will consume a certain amount of Read Units - how much depends on your data and the query you use, as well as your indexing schedule.
Related
I would like to know the pros and cons of trying to search for data (basically full text search on a limited set of fields).
My data is currently in DynamoDB, and I realize that is not well suited to full-text search. Are there ways of doing a full-text search in DynamoDB? What are the pros and cons of doing that?
I can also use a Search cluster (like ElasticSearch). Any reasons that you would not go with a search cluster?
Are there other ways to do a full-text search? Other solutions?
Dynamodb is best suited for key value Insert and Retrieval.
It does not support search functionality, if you are trying to do a scan with some condition that will be O(n) and it will be very costly since you are consuming lots of read capacity.
Now coming to options
If use case is not full text search and only key value match, you can try to come up with composites key, but it will have drawbacks like
a. Can not change the schema afterwards and may require huge effort if you need to search on a new field.
b. Designing these kind of key is tricky considering that few keys will always be hot, and may result into hot partition.
Ideal solution is to use elastic-search or solr indexing. You can have a lambda function listening to dynamodb stream, doing transformation and putting data in elasticsearch. But it will have limitations like
a. Elasticsearch cluster is costly.
My understanding how autocomplete/search for text/item works at high level in any scalable product like Amazon eCommerce/Google at high level was :-
Elastic Search(ES) based approach
Documents are stored in DB . Once persisted given to Elastic search, It creates the index and store the index/document(based on tokenizer) in memory or disk based
configuration.
Once user types say 3 characters, it search all index under ES(Can be configured to index even ngram) , Rank them based on weightage and return to user
But after reading couple of resources on google like Trie based search
Looks some of the scalable product also uses Trie data stucture to do the prefix based search.
My question Is Can trie based approach be good alternative to ES or ES internally uses Trie or am i missing completely here ?
ES autocompletion can be achieved in two ways:
using prefix queries
either using (edge-)ngrams
or using the completion suggester
The first option is the poor man's completion feature. I'm mentioning it because it can be useful in certain situation but you should avoid it if you have a substantial amount of documents.
The second option uses the conventional ES indexing features, i.e. it will tokenize the text, all (edge-)ngrams will be indexed and then you can search for any prefix/infix/suffix that have been indexed.
The third option uses a different approach and is optimized for speed. Basically, when indexing a field of type completion, ES will create a "finite state transducer" and store it in memory for ultra fast access.
A finite state transducer is close to a trie in terms of implementation. You can check this excellent article which shows how trie compares to finite state transducer
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
I'm writing a diploma thesis about full-text search engines, and I'm struggling with one fundamental thing.
If we want to have searching on our website, the reason we need dedicated full-text search engines, alongside our classic MySQl/PostgreSQL/Oracle databases, is because databases in general don't have the full-text search capabilities that are needed for quality search. However, full-text search engines need to have a lot of features that classic databases have, in order to be able to search by specific fields, sort results, have proper scalability etc.
I saw people having a classic database, and additionally maintain e.g. Elasticsearch. I was reading a lot about all of the features full-text search engines need to have, it's really overlapping with features of classic databases. So, my question is: do we even need classic databases alongside full-text search engines? Do databases have some additional features that full-text search engines don't? Can we just keep all of our data in a single full-text search database like Elasticsearch?
What are the uses of Semantic Web in the information Retrieval. Semantic Web here i mean, the structured created like DBPedia, Freebase.
I have integrated information in RDF with Lucene in several projects and i think a lot of the value you can get from the integration is that you can go beyond the simple keyword search that Lucene would normally enable. That opens up possibilities for full text search over your RDF information, but also semantically enriched full-text search.
In the former case, there is no 'like' operator in SPARQL, and the regex function while similarly capable to the SQL like, is not really tractable to evaluate against a dataset of any appreciable size. However, if you're able to use lucene to do the search instead of relying on regex, you can get better scale and performance out of a single keyword search over your RDF.
In the latter case, if the query engine is integrated with the lucene text/rdf index, think LARQ, (both Jena and Stardog suppot this), you can do far more complex semantic searches over your full-text index. Queries like 'get all the genres of movies where there are at least 10 reviews and the reviews contain the phrase "two thumbs up"' That's difficult to swing with a lucene index, but becomes quite trivial in the intersection of Lucene & SPARQL.
You can use DBpedia in Information Retrieval, since it has the structured information from Wikipedia.
Since Wikipedia has knowledge of almost every topic of interest in terms of articles, categories, info-boxes that is being used in the information retrieval systems to extract the meaningful information in the form of triples i.e. Subject, Predicate & Object.
You can query the information via SPARQL using the following endpoint:Endpoint to query the information from DBpedia
I want to develop google desktop search like application, I want to know that which Indexing Techniques/ Algorithms I should use so I can get very fast data retrival.
In general, what you want is an Inverted Index. You can do the indexing yourself, but it's a lot of work to get right - you need to handle stemming, stop words, extending the posting list to include positions in the document so you can handle multi-word queries, and so forth. Then, you need to store the index, probably in a B-Tree on disk - or you can make life easier for yourself by using an existing database for the disk storage, such as BDB. You also need to write a query planner that interprets user queries, performs query expansion and converts them to a series of index scans. Wikipedia's article on Search Engine Indexing provides a good overview of all the challenges, too.
Or, you can leverage existing work and use ready-made full text indexing solutions like Apache Lucene and Compass (which is built on Lucene). These tools handle practically everything detailed above (and more), which just leaves you writing the tool to build and update the index by feeding all your documents into Lucene, and the UI to allow users to search it.
The Burrows-Wheeler transform, used to compress data in bzip2, can be used to make substring searching of text a constant time function.
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
I haven't seen a simple introduction online, but here is a lot of detail:
http://www.ddj.com/architect/184405504