How does Elasticsearch/Lucene achieve such performance when querying multiple fields? - elasticsearch

According to the answer given here, Elasticsearch doesn't seem to use compound indexes for querying multiple fields, and instead queries multiple indexes and then intersects the results.
My question is how does it achieve such high performance? Surely a composite index is faster since it leads you straight to the desired data, rather than querying multiple indexes, which in turn return more data, and then compare the results?
I get the advantages of the multiple indexes, regarding the field order, etc., but in terms of performance, surely it's inferior...

Related

Strategies to compare performance of two Elasticsearch queries?

Since actual query runtime varies, it's not always useful to just check the runtime of two queries to determine which is generally faster. What are some ways to generally test whether one query is more efficient than another?
As an example of what I'm after, in MongoDB I can run explain on a query to get the number of documents iterated vs. returned. If the documents iterated is several orders of magnitude higher than what it's actually returning, I know I have an inefficient query. I know that since Elasticsearch indexes data much differently than other dbs, this may not translate well, but I'm wondering if there's some rough equivalent.
I'm looking at the Profile API which looks like a good starting place. Are fields like next_doc and next_doc_count what I'm after? Are there any others I should look for? Thanks!!

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.
According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Is having empty fields bad for lucene index?

ES doc on mappings states below
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.
I'm wondering how strictly should I take this.
Say I have three types of documents, with each sharing same 60-70% of fields and the rest being unique to each type.
Should I put each type in a separate index?
Or one single index would be fine as well, meaning there won't be lots of storage waste or any noticeable performance hit on search or index operations?
Basically I'm looking for any information to either confirm or disprove the quote above.
If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that:
Things will improve in future versions of ES
If you don't need them, you can disable norms and doc_values, as recommended here

Score equivalence Oracle Text / Lucene

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Resources