We have an index of around 20GB; the documents have several large fields, many of which are now redundant.
So I decided to use bulk update to set those fields to empty, in the expectation of recovering space on the server.
I tested a small number of instances, using code of the form:
POST myindex/doc/_bulk
{"update":{"_id":"ccp-23-1002"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
{"update":{"_id":"ccp-28-1007"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
This worked fine, I did a search, they showed the fields long_text_1 and long_text_2 were now blank on the specified docs, the other fields unchanged.
So then I scripted something to run the above across all the docs in the index, 1000 at a time. After a few had gone through, I checked the data in the console using
GET _cat/indices?v&s=store.size&h=index,docs.count,store.size
... which showed that while the index in question had the same number of documents, the store.size had got larger, not smaller!
Presumably what is happening is that in each case after an update, a new doc has been created with the same data as the old doc, except with the fields specified in the update request changed; and the old doc is still sitting in the index, presumably marked as dead, but taking up space. So the exercise is having exactly the opposite of the intended effect.
So my question is, how to instruct ES to compact the index or otherwise reclaim this dead space?
Related
In the write tuning section, Elastic recommends to Increase the Refresh Interval
We're doing document ingestions where during ingestion we may do reads, essentially like,
GET /my-index/_doc/mydocumentid
that is, a read of the document by its _id, as opposed to a search. Some descriptions suggest that the document id is just added to the Lucene index like other attributes. Does this mean that the read by id would still reset the refresh_interval and force a re-index instead of allowing it to wait for the full refresh_interval?
This is actually a tricky one:
You are correct that a GET on an _id works right away (unlike a multi-document operation like a search, which need to wait for an explicit ?refresh from you or the refresh_interval). But the underlying implementation changed twice:
Initially the GET on an _id read the data right from the translog, so it didn't need a refresh / the creation of a segment.
The code was complex and so we changed it in 5.0 that it would be read from a segment, but a GET on an _id would automatically trigger the _refresh. So it looked the same on the outside and the code was simpler.
But for use-cases that did a lot of GETs on _id this was expensive, since it creates lots of tiny shards. So we changed it back in 7.6 to read again from the translog.
So if you are using a current version, it doesn't trigger a _refresh.
a get on the _id is not a search, so no
Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.
I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.
I am updating existing documents by deleting and reindexing them. I did it this way because the documents have nested components and it was easier to massage the document myself rather than construct an update operation.
Mostly this works fine but occasionally the system updates the same document twice very quickly. I think what is happening is that the the search for the second update gets the original document (before it was updated the first time) because the the previous updates have not yet been reflected in the indexes. By the time I try to delete the document (by id) the index has updated and it comes up as not found.
I am not doing bulk updates.
Is this a known issue and if so how does one work around it?
I can't find any reference to problems like this anywhere so I am puzzled.
I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.
Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.
Can someone please explain what happens during indexing and possibly point out some documentation?
Thank you!
Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process
Making Text Searchable
Every word in a text field needs to be searchable,
The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
Updating Index :
First of all, please do note that a "lucene index is immutable"
Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.
Indexing Process
New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
A new segment—a supplementary inverted index—is written to disk.
A new commit point is written to disk, which includes the name of the new segment.
The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.
What happens in case of Delete
Segments are immutable, so documents cannot be removed from older segments.
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
When is it actually removed
In Segment Merging, deleted documents are purged from the filesystem.
References :
Elasticsearch Docs
Inverted Index
Lucene Talks