ElasticSearch Segment merge not happening when deleted documents count is greater than 50% - elasticsearch

Elasticsearch version: 7.10.0
I have an elasticsearch index with 8 shards in 8 different nodes and with a document count greater than 25 million documents(nested not included). It's an heavy update index. The document size grows over a period of time because of deleted documents. I did a search on this issue and read blogs like one below which tells a segment will automatically be merged when the deleted docs count in that segment is greater than 50%.
https://discuss.elastic.co/t/too-many-deleted-docs/84964/4
I did a /_segments for the index and found segments like the below
"segments": {
"_bbx": {
"generation": 14685,
"num_docs": 27901732,
"deleted_docs": 23290932,
"size_in_bytes": 5071187083,
"memory_in_bytes": 137008,
"committed": true,
"search": true,
"version": "8.7.0",
"compound": false,
"attributes": {
"Lucene87StoredFieldsFormat.mode": "BEST_SPEED"
}
},
Full response of /_segment call can be found here
https://drive.google.com/file/d/1mLE2xw0u7lnogHnfzz65rWCBS8JrcnNm/view?usp=sharing
In many segments like the one above the deleted_docs count is more than 75% of the num_docs but is still not getting merged. We haven't set any max_merged_segment so the default is 5gb. We also haven't changed any MergePolicy and are using the default ones as of Es version 7.10.0.
Is my understanding correct ?
Any thoughts on this would be helpful. Thanks in advance.

The num_docs contains only the present documents and doesn't include the deleted documents.
So in this case we have 23,290,932 deleted documents out of a total of 51,192,664 (27,901,732 + 23,290,932) documents which means 45.5% are deleted in that segment. Hence segment merge didn't happen.
Note : Posted the same question in elasticsearch forums got this reply
https://discuss.elastic.co/t/elasticsearch-segment-merge-not-happening-when-deleted-documents-count-is-greater-than-50/277209

Related

Does non-indexed field update triggers reindexing in elasticsearch8?

My index mapping is the following:
{
"mappings": {
"dynamic": False,
"properties": {
"query_str": {"type": "text", "index": False},
"search_results": {
"type": "object",
"enabled": False
},
"query_embedding": {
"type": "dense_vector",
"dims": 768,
},
}
}
Field search_result is disabled. Actual search is performed only via query_embedding, other fields are just non-searchable data.
If I will update search_result field in existing document, will it trigger reindexing?
The docs say that "The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way". So, it seems logical not to re-index docs if changes took place only in non-indexed part, but I'm not sure
Elasticsearch documents (Lucene Segments) are inmutable, so every change you make in a document will delete the document and create a new one. This is a Lucene's behavior:
Lucene's index is composed of segments, each of which contains a
subset of all the documents in the index, and is a complete searchable
index in itself, over that subset. As documents are written to the
index, new segments are created and flushed to directory storage.
Segments are immutable; updates and deletions may only create new
segments and do not modify existing ones. Over time, the writer merges
groups of smaller segments into single larger ones in order to
maintain an index that is efficient to search, and to reclaim dead
space left behind by deleted (and updated) documents.
When you set enable:false you are just avoiding to have the field content in the searchable structures but the data still lives in Lucene.
You can see a similar answer here:
Partial update on field that is not indexed

Reindexing more than 10k documents in Elasticsearch

Let's say I have an index- A. It contains 26k documents. Now I want to change a field status with type as Keyword. As I can't change A's status field type which is already existing, I will create a new index: B with my setting my desired type.
I followed reindex API:
POST _reindex
{
"source": {
"index": "A",
"size": 10000
},
"dest": {
"index": "B",
"version_type": "external"
}
}.
But the problem is, here I can migrate only 10k docs. How to copy the rest?
How can I copy all the docs without losing any?
delete the size: 10000 and problem will be solved.
by the way the size field in Reindex API means that what batch size elasticsearch should use to fetch and reindex docs every time. by default the batch size is 100. (you thought it means how many document you want to reindex)

Moving data from oine Elasticsearch index to another with higher number of shards or increasing shard number in existing index

I am new to Elasticsearch and I have been reading documentation in order to find a way of increasing amount of shards that my index consists of. Currently my index looks like this:
country_data 0 p STARTED 227 100.7kb 192.168.0.115 $HOSTNAME
country_data 0 r STARTED 227 100.7kb 192.168.0.116 $HOSTNAME
I wanted to increase the number of shard to 5 however I was unable to find a proper way of doing it. I learnt from another Stackoverflow question that I should be able to do it like this:
POST _reindex?slices=5
{
"source": {
"index": "country_data"
},
"dest": {
"index": "country_data_new"
}
}
However when I did that I got a copy of my country_data with same amount of shards and replicas (1 and 1). I tried to learn more about it in documentation but all I found is this: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_slices.html
I couldn't find anything in documentation about increasing number of shards in existing index or how can I move data to new index which would have more shards. I would be grateful for any insights into this problem or at least a website where could I learn how to do it.
This can be done in any of the below mentioned way.
1st Option : You can use the elastic search Split Index API.
I suggest you to please go through the documentation once before proceeding with this method.
2nd Option : Create a new index with same mappings and give the required settings for new shards. Then use the reindex API to copy data from source index to destination index
To create the new Index:
PUT /<NEW_INDEX_NAME>
{
"settings": {
"number_of_shards": <REQUIRED_NUMBER_OF_SHARDS>
},
"mappings": {<MAPPINGS_OF_SOURCE_INDEX>}
}
}
If you don't give the number of shards in the settings while creating an index, by default it creates index with one primary and one replica shard.
To Reindex from source to newly created index:
POST _reindex
{
"source": {
"index": "<SOURCE_INDEX_NAME>"
},
"dest": {
"index": "<NEW_INDEX_NAME>"
}
}

ElasticSearch not able to return data going above 10,000 offset, I am not allowed to make index level changes. Can't use Scroll API

I am running ES query step by step for different offset and limit. For example 100 to 149, then 150 to 199, then 200 to 249.. and so on.
When I keep offset+limit more than 10,000 then getting below exception:
{
"error": {
"root_cause": [
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "xyz",
"node": "123",
"reason": {
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
}
]
},
"status": 500
}
I know we can solve this by increasing the "max_result_window". I tried it and it helped too. I increased it to 15,000 then 30,000. But I am not allowed to make index level changes.
So, I changed it back to default one 10,000.
How can I solve this problem? This query is getting hit by an API call.
There are two approach which worked for me-
increasing the max_result_window
Using filter
a. by knowing the unique id of data records
b. by knowing the time frame
First approach was applied using below
PUT /index/_settings
{ "max_result_window" : 10000 }
This worked and solved my problem, but number of records is dynamic element and increasing very fast. So, it is not good to keep increasing this window. Also in my case we use index on sharing basis. So,this change will effect all the users or group on this shared index. So, we moved on to second approach.
Second approach
Part1: First I applied filter on last update timestamp and if record count is greater than 10K then I divide the time frame by half and keep doing it until it reaches count less than 10k.
Part2: As same data is also available in OLTP, I got the complete list of a unique identifier and sorted it. Then applied filter on that identifier and only fetched data in range of 10K. Once 10K data is fetched using pagination, then change the filter and move to next batch of 10k data.
Part3: Applied sorting on last updated timestamp and started fetching data using pagination. Once record count reaches 10k, get the timestamp of 9999 record and apply greater_than filter on identifier and then fetch next 10k records.
All mentioned solution helped me. But I selected the Part3 of second approach. As it is easy to implement and give a sorted data quickly.
Consider scroll API - https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This is also suggested in manual

Term filter causes memory to spike drastically

The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?
It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.

Resources