We use an ELK stack for our logging. I've been asked to design a process for how we would remove sensitive information that had been logged accidentally.
Now based on my reading around how ElasticSearch (Lucene) handles deletes and updates the data is still in the index just not available. It will ultimately get cleaned up as indexes get merged, etc..
Is there a process to run an update (to redact something) or delete (to remove something) and guarantee its removal?
When updating or deleting some value, ES will mark the current document as deleted and index the new document. The deleted value will still be available in the index, but will never get back from a search. Granted, if someone gets access to the underlying index files, he might be able to use some tool (Luke or similar) to view what's inside the index files and potentially see the deleted sensitive data.
The only way to guarantee that the documents marked as deleted are really deleted from the index segments, is to force a merge of the existing segments.
POST /myindex/_forcemerge?only_expunge_deletes=true
Be aware, though, that there is a setting called index.merge.policy.expunge_deletes_allowed that defines a threshold below which the force merge doesn't happen. By default this threshold is set at 10%, so if you have less than 10% deleted documents, the force merge call won't do anything. You might need to lower the threshold in order for the deletion to happen... or maybe easier, make sure to not index sensitive information.
Related
I've created some tests for my ElasticSearch functionality and I've noticed some strange behavior. If I have a test that:
Inserts a document and confirms there are no errors
Retrieves that same document, confirms there are no errors and confirms it has the expected values
Deletes the document, confirms there are no errors and confirms 1 document was deleted
Then the 3rd test will fail because 0 documents were deleted. If I take one of the following steps:
Debug the test and put a breakpoint after insert but before delete
Add time.Sleep(time.Second) immediately before the delete step
then 1 document is deleted and the 3rd test will pass. In the cases when the 3rd test has failed, I've gone into my ES instance and confirmed that the document exist.
This leads me to believe that after inserting a document there is some span of time where something has to happen before I can delete the document.
My questions is - what needs to happen after insert so that I can delete a document and is there a better way for me to handle this in my tests than sleeping for 1 second?
I am coding in Golang and I am using the Olivere ES Client
Elasticsearch operations can be inconsistent.
You can check the option refresh or wait_for_active_shards if it fit your test.
NB: it’s always difficult to add test to an inconsistent system.
I would not use the term inconsistence. Storing and retrieving a document are real-time operations. search is happening in near-real-time.
While you can always search for documents, they will only make it into your result set once the data structures for search exist (typically the inverted indices). Creating and maintaining this data structure for every single document that gets indexed would be costly and inefficient, that's why the data structure gets created at latest when the refresh interval has expired (default refresh interval is 1 second).
Also, when deleting a document, the document does not get immediately removed from disk. It first gets marked for deletion, ensuring that it will no longer show up in any results. But only after some Elasticsearch internal housekeeping (segment merges), the documents marked for deletion eventually get wiped.
That should give you an idea why for search we talk about a near real-time behaviour, or what you describe as "gap"
Especially for unit/integration test you would want to make sure that a document can get found after having it indexed. You can easily achieve this by converting your index/write-request into a blocking one by adding the parameter refresh=wait_for. With this, the indexing request only returns, AFTER the data structures needed for search have been created. Making sure that in your next request the document is available for whatever action you want to execute.
I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.
We have an index of around 20GB; the documents have several large fields, many of which are now redundant.
So I decided to use bulk update to set those fields to empty, in the expectation of recovering space on the server.
I tested a small number of instances, using code of the form:
POST myindex/doc/_bulk
{"update":{"_id":"ccp-23-1002"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
{"update":{"_id":"ccp-28-1007"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
This worked fine, I did a search, they showed the fields long_text_1 and long_text_2 were now blank on the specified docs, the other fields unchanged.
So then I scripted something to run the above across all the docs in the index, 1000 at a time. After a few had gone through, I checked the data in the console using
GET _cat/indices?v&s=store.size&h=index,docs.count,store.size
... which showed that while the index in question had the same number of documents, the store.size had got larger, not smaller!
Presumably what is happening is that in each case after an update, a new doc has been created with the same data as the old doc, except with the fields specified in the update request changed; and the old doc is still sitting in the index, presumably marked as dead, but taking up space. So the exercise is having exactly the opposite of the intended effect.
So my question is, how to instruct ES to compact the index or otherwise reclaim this dead space?
I am updating existing documents by deleting and reindexing them. I did it this way because the documents have nested components and it was easier to massage the document myself rather than construct an update operation.
Mostly this works fine but occasionally the system updates the same document twice very quickly. I think what is happening is that the the search for the second update gets the original document (before it was updated the first time) because the the previous updates have not yet been reflected in the indexes. By the time I try to delete the document (by id) the index has updated and it comes up as not found.
I am not doing bulk updates.
Is this a known issue and if so how does one work around it?
I can't find any reference to problems like this anywhere so I am puzzled.
I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.
Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.
Can someone please explain what happens during indexing and possibly point out some documentation?
Thank you!
Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process
Making Text Searchable
Every word in a text field needs to be searchable,
The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
Updating Index :
First of all, please do note that a "lucene index is immutable"
Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.
Indexing Process
New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
A new segment—a supplementary inverted index—is written to disk.
A new commit point is written to disk, which includes the name of the new segment.
The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.
What happens in case of Delete
Segments are immutable, so documents cannot be removed from older segments.
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
When is it actually removed
In Segment Merging, deleted documents are purged from the filesystem.
References :
Elasticsearch Docs
Inverted Index
Lucene Talks