Is having empty fields bad for lucene index?

Is having empty fields bad for lucene index? - elasticsearch

ES doc on mappings states below
Types are not as well suited for entirely different types of data. If your two types have mutually exclusive sets of fields, that means half your index is going to contain "empty" values (the fields will be sparse), which will eventually cause performance problems. In these cases, it’s much better to utilize two independent indices.
I'm wondering how strictly should I take this.
Say I have three types of documents, with each sharing same 60-70% of fields and the rest being unique to each type.
Should I put each type in a separate index?
Or one single index would be fine as well, meaning there won't be lots of storage waste or any noticeable performance hit on search or index operations?
Basically I'm looking for any information to either confirm or disprove the quote above.

If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that:
Things will improve in future versions of ES
If you don't need them, you can disable norms and doc_values, as recommended here

Related

Optimizing Elastic Search Index for many updates on few fields

We are working on a large Elasticsearch Index (>1 bi documents) with many searchable fields evenly distributed across >100 shards on 6 nodes.
Only a few of these fields are actually changed, but they are changed very often. About 25% of total requests are changes to these fields.
We have one field that simply holds a boolean, which accounts for more than 90% of the changes done to the document.
It looks like we are taking huge performance hits re-indexing the entire documents, even though a simple boolean is changing.
Reading around I found that this might be a case where one could store the boolean value within a parent-child field, as this would effectively place it in a separate index thus not forcing a recreation of the entire document. But I also read that this comes with disadvantages, such as more heap space usage for the relation.
What would be the best way to solve this challenge?

Yes, since Elasticsearch is internally a write-only system every update effectively creates a new document and marks the old copy as stale to be garbage-collected later.
Parent-child (aka join) or nested fields can help with this but they come with a significant performance hit of their own so your search performance probably will suffer.
Another approach is to use external fast storage like Redis (as described in this article) though it would be harder to maintain and might also turn out to be inefficient for searches, depending on your use case.
General rule of thumb here is: all use cases are different and you should carefully benchmark all feasible options.

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.

According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Does elasticsearch/lucene impose memory overhead for missing values in fieldcache?

This question is for Elasticsearch primarily, but I believe the answer will be based on underlying Lucene semantics.
I'm contemplating using multiple types in the same index. A lot of fields will be sortable and a lot of fields will only be used by one particular type. I.e: fields will be sparse, say 10% coverage on average.
Since sorting keeps values for all docs in memory (regardess of type) , I'd like to know if there's any memory overhead with regards to missing fieldvalues (the ~90% in my case)

In a recent blog post on the official Elasticsearch blog titled "Index vs Type", the author tackles a common problematic when it comes to choosing whether one wants to model his data using several indices or several types.
One fact is that Lucene indices don't like sparsity. As a result, the author says that
Fields that exist in one type will also consume resources for documents of types where this field does not exist. [...] And the issue is even worse with doc values: for speed reasons, doc values often reserve a fixed amount of disk space for every document, so that values can be addressed efficiently.
There is a Lucene issue that aims at improving this situation, which has been fixed in 5.4 and will be available in Elasticsearch v2.2. Even then, the author advises to still model your data in a way to limits sparsity as much as possible.

Payload performance in Lucene

I know there are several topics on the web, as well as on SO, regarding indexing and query performance within Lucene, but I have yet to find one that discusses whether or not (and if so, how much?) creating payloads will affect query performance...
Here's the scenario ...
Let's say I want to index a collection of documents (anywhere from 100K - 10M), and each document has a subsection that I want to be able to search separately (or perhaps rank higher, depending on whether a match was found within that section).
I'm considering adding a payload (during indexing) to any term that appears within that subsection, so I can efficiently make that determination at query-time.
Does anyone know of any performance issues related to using payloads, or even better, could you point me to any online documentation about this topic?
Thanks!
EDIT: I appreciate the alternative solutions to my scenario, but in case I do need to use payloads in the future, does anyone have any comments regarding the original question about query performance?

The textbook solution to what you want to do is index each original document as two fields: one for the full document, and the other for the subsection. You can boost the subsection field separately either during indexing or during retrieval.
Having said that, you can read about Lucene payloads here: Getting Started with Payloads.

Your use case doesn't fit well with the purpose of payloads -- it looks to me that any payload information would be redundant.
Payloads are attached to individual occurrences of terms in the document, not to document/term pairs. In order to store and access payloads, you have to use the offset of the term occurrence within the document. In your case, if you know the offset, you should be able to calculate which section the term occurrence is in, without using payload data.
The broader question is the effect of payloads on performance. My experience is that when properly used, the payload implementation takes up less space and is faster than whatever workaround I was previously using. The biggest impact on disk space will be wherever you currently use Field.setOmitTermFreqAndPositions(true) to reduce index size. You will need to include positions to use payloads, which potentially makes the index much larger.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is having empty fields bad for lucene index? - elasticsearch

If your types overlap 60-70% then ES will be fine, that does not sound 'mutually exclusive' at all. Notice that: Things will improve in future versions of ES If you don't need them, you can disable norms and doc_values, as recommended here

Related

Optimizing Elastic Search Index for many updates on few fields

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

How important is it to use separate indices for percolator queries and their documents?

Does elasticsearch/lucene impose memory overhead for missing values in fieldcache?

Payload performance in Lucene

Categories

Resources