Performance Comparison: Elasticsearch vs. Solr vs. Sphinx vs Whoosh - performance

I've searched for comparison of these things but every comparison has been about search features. I want to compare their performance(speeds) for a project.

Related

Issue with results after upgrading elastic search

I am upgrading my elasticsearch from 2.2 to 7.1 and I am maintaining both the instances and I am trying to compare the results on the new version and old version by making the same search queries.
Note: I have not changed the mappings, settings or querying logic
My results are almost the same but vary a little in scoring. Is it expected? though the documents, mappings, settings and query logic are the same?
Elasticsearch 2.x version uses the tf/idf for scoring and this ES doc explains it in details.
While ES 7.X uses the improved BM25 algorithm for score calculation. this is another nice article from ES which explains it in details.
In short yes, there are significant changes in the scoring formula of ES 2.X and 7.X as underlying algorithm changed itself and even though you have everything else same like documents, mappings, settings and query, still you will be having a different score.
You can use the explain API on your query to understand the score of documents returned by the query.

Using Azure Search with DocumentDB

If DocumentDB can do its own indexing and Azure Search can also do indexing, then why would I want to use them together? Any use cases?
Also, using DocumentDB is already expensive and if I use Azure Search with it, how does this affect my DocumentDB performance and cost?
DocumentDB shines as a general purpose document database, while Azure Search shines as a full text search (FTS) engine.
For example, Azure Search provides:
Linguistically-aware indexing and search that takes into account word forms (e.g., singular vs. plural, verb tenses, and many other kinds of grammatical inflections) in ~60 languages.
High quality lemmatization and tokenization. For example, word-breaking Chinese text is hard because white space is optional
Synonyms
Proximity search, similar pronunciation search (soundex / metaphone), wildcard and regex search using Lucene query syntax
Customizable ranking, so that you can boost newer documents, for example
Suggestions
... dozens of other text processing and natural language-related features
If all you need are simple numerical filters or exact string comparisons, just use DocumentDB.
If you need natural language search for some of your content, use Azure Search together with DocumentDB. Connecting them is easy with DocumentDB indexer.
In terms of cost implications, using Azure Search with DocumentDB doesn't change the cost of DocumentDB. If you use the DocumentDB indexer, it will consume a certain amount of Read Units - how much depends on your data and the query you use, as well as your indexing schedule.

Are classic databases needed when we have full-text search engines?

I'm writing a diploma thesis about full-text search engines, and I'm struggling with one fundamental thing.
If we want to have searching on our website, the reason we need dedicated full-text search engines, alongside our classic MySQl/PostgreSQL/Oracle databases, is because databases in general don't have the full-text search capabilities that are needed for quality search. However, full-text search engines need to have a lot of features that classic databases have, in order to be able to search by specific fields, sort results, have proper scalability etc.
I saw people having a classic database, and additionally maintain e.g. Elasticsearch. I was reading a lot about all of the features full-text search engines need to have, it's really overlapping with features of classic databases. So, my question is: do we even need classic databases alongside full-text search engines? Do databases have some additional features that full-text search engines don't? Can we just keep all of our data in a single full-text search database like Elasticsearch?

Algolia vs Solr search

I'm building a product search platform. I used Solr search engine before, and i found its performance is fine but doesn't generate a user interface. Recently I found Algolia has more features, easy setup, and generates a User Interface.
So if someone used Algolia before:
Is Algolia performance better than Solr?
Is there any difference between Algolia and Websolr ?
I'm using Algolia and SolR in production for an e-commerce website.
You're right about what you say on Algolia. It's fast (really) and has a lot of powerful features.
You have a complete dashboard to manage your search engine.
For SolR, it's ok but it's also a black box. You can fine tune your search engine, but it exhibits poor performance for semantic searches (I tested it).
If you have to make a choice, it depends on a lot of things.
With Algolia, there are no servers to manage, easy configuration and integration. It's fast with 20 millions records for me (less than 15ms per search).
With SolR, you can customise a little bit more. But it's a lot of work. If I had to make a choice, it would be more between Algolia and ElasticSearch. SolR is losing velocity; it's hard to imagine it growing again in the next few years.
As a resume, if you want to be fast and efficient, choose Algolia. If you want to dive deep into a search engine architecture and you have a lot of time (count it in months), you can try ElasticSearch.
I hope that I was helpful with my answer, ask me if you have more questions.
Speed is a critical part of keeping users happy. Algolia is aggressively designed to reduce latency. In a benchmarking test, Algolia returned results up to 200x faster than Elasticsearch.
Out-of-the-box, Algolia provides prefix matching for as-you-type search, typo-tolerance with intelligent result highlighting, and a flexible, powerful ranking formula. The ranking formula makes it easy to combine textual relevance with business data like prices and popularity metrics. With Lucene-based search tools like Solr and Elasticsearch, the ranking formula must be designed and built from scratch, which can be very difficult for teams without deep search experience to get right.
Algolia’s highly-optimized infrastructure is distributed across the world in 15 regions and 47 datacenters. Algolia provides a 99.99% reliability guarantee and can deliver a fast search to users wherever in the world they’re connecting from. Elasticsearch and Solr do not automatically distribute to multiple regions, and doing so can incur significant server costs and devops resources

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Resources