Is it possible to delete duplicate documents after they are indexed? - elasticsearch

Our log processing went a bit haywire and we ended up with 2-5 copies of some logs in our ElasticSearch cluster. Fortunately, each one has a md5 hash stored with it in an indexed field (which is its own problem, but for another day).
So, in theory, we should be able to delete all by one copy of every document by using that field. However, I can't figure out how to do that. Any suggestions?

Related

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

How to absolutely delete something from ElasticSearch?

We use an ELK stack for our logging. I've been asked to design a process for how we would remove sensitive information that had been logged accidentally.
Now based on my reading around how ElasticSearch (Lucene) handles deletes and updates the data is still in the index just not available. It will ultimately get cleaned up as indexes get merged, etc..
Is there a process to run an update (to redact something) or delete (to remove something) and guarantee its removal?
When updating or deleting some value, ES will mark the current document as deleted and index the new document. The deleted value will still be available in the index, but will never get back from a search. Granted, if someone gets access to the underlying index files, he might be able to use some tool (Luke or similar) to view what's inside the index files and potentially see the deleted sensitive data.
The only way to guarantee that the documents marked as deleted are really deleted from the index segments, is to force a merge of the existing segments.
POST /myindex/_forcemerge?only_expunge_deletes=true
Be aware, though, that there is a setting called index.merge.policy.expunge_deletes_allowed that defines a threshold below which the force merge doesn't happen. By default this threshold is set at 10%, so if you have less than 10% deleted documents, the force merge call won't do anything. You might need to lower the threshold in order for the deletion to happen... or maybe easier, make sure to not index sensitive information.

Elasticsearch: time to index document

I am updating existing documents by deleting and reindexing them. I did it this way because the documents have nested components and it was easier to massage the document myself rather than construct an update operation.
Mostly this works fine but occasionally the system updates the same document twice very quickly. I think what is happening is that the the search for the second update gets the original document (before it was updated the first time) because the the previous updates have not yet been reflected in the indexes. By the time I try to delete the document (by id) the index has updated and it comes up as not found.
I am not doing bulk updates.
Is this a known issue and if so how does one work around it?
I can't find any reference to problems like this anywhere so I am puzzled.

How to delete data from ElasticSearch through JavaAPI

EDITED
I'm trying to find out how to delete data from Elasticsearch according to a criteria. I know that older versions of ElasticSearch had Delete By Query feature, but it had really serious performance issues, so it was removed. I know also for that there is a Java plugin for delete by query:
org.elasticsearch.plugin:delete-by-query:2.2.0
But I don't know if it has a better implementation of delete which has a better performance or it's the same as the old one.
Also, someone suggested using scroll to remove data, but I know how to retrieve data scrolling, not how to use scroll to remove!
Does anyone have an idea (the amount of documents to remove in a call would be huge, over 50k documents.
Thanks in advance!
Finally used this guy's third option
You are correct that you want to use the scroll/scan. Here are the steps:
begin a new scroll/scan
Get next N records
Take the IDs from each record and do a BulkDelete of those IDs
go back to step 2
So you don't delete exactly using the scroll/scan, you just use that as a tool to get all the IDs for the records that you want to delete. In this way you're only deleting N records at a time and not all 50,000 in 1 chunk (which would cause you all kinds of problems).

Resources