How to delete data from ElasticSearch through JavaAPI - elasticsearch

EDITED
I'm trying to find out how to delete data from Elasticsearch according to a criteria. I know that older versions of ElasticSearch had Delete By Query feature, but it had really serious performance issues, so it was removed. I know also for that there is a Java plugin for delete by query:
org.elasticsearch.plugin:delete-by-query:2.2.0
But I don't know if it has a better implementation of delete which has a better performance or it's the same as the old one.
Also, someone suggested using scroll to remove data, but I know how to retrieve data scrolling, not how to use scroll to remove!
Does anyone have an idea (the amount of documents to remove in a call would be huge, over 50k documents.
Thanks in advance!
Finally used this guy's third option

You are correct that you want to use the scroll/scan. Here are the steps:
begin a new scroll/scan
Get next N records
Take the IDs from each record and do a BulkDelete of those IDs
go back to step 2
So you don't delete exactly using the scroll/scan, you just use that as a tool to get all the IDs for the records that you want to delete. In this way you're only deleting N records at a time and not all 50,000 in 1 chunk (which would cause you all kinds of problems).

Related

Editing and re-indexing large amounts of data in elasticsearch (millions of records)

I recently made a new version of an index for my elasticsearch data with some new fields included. I re-indexed from the old index, so the new index has all of the old data with also the new mapping to include the new fields.
Now, I'd like to update all of my elasticsearch data in the index to include these new fields, which I can calculate by making some separate database + api calls to other sources.
What is the best way to do this, given that there are millions of records in the index?
Logistically speaking I'm not sure how to accomplish this... as in how can I keep track of the records that I've updated? I've been reading about the scroll api, but not certain if this is valid because of the max scroll time of 24 hours (what if it takes longer than that)? Also a serious consideration is that since I need to make other database calls to calculate the new field values, I don't want to hammer that database for too long in a single session.
Would there be some way to run an update for say 10 minutes every night, but keep track of what records have been updated/need updating?
I'm just not sure about a lot of this, would appreciate any insights or other ideas on how to go about it.
you would need to run an update by query on your original index, which is expensive
you might be able to use aliases to point to indices behind that, and when you want to make a change, create a new index with the new mappings etc and attach it to the alias so new data coming in gets written correctly. then reindex the "old" data into the new index
that will depend on the details of what you're doing though

Elastic Search Number of Document Views

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.
To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

Is it possible to delete duplicate documents after they are indexed?

Our log processing went a bit haywire and we ended up with 2-5 copies of some logs in our ElasticSearch cluster. Fortunately, each one has a md5 hash stored with it in an indexed field (which is its own problem, but for another day).
So, in theory, we should be able to delete all by one copy of every document by using that field. However, I can't figure out how to do that. Any suggestions?

Elasticsearch slow performance for huge data retrieval with source field

I'm using ElasticSearch to search from more than 10 million records, most records contains 1 to 25 words. I want to retrieve data from it, the method I'm using now is drastically slow for big data retrieval as I'm trying to get data from the source field. I want a method that can make this process faster. I'm free to use other database or anything with ElasticSearch. Can anyone suggest some good Ideas and Example for this?
I've tried searching for solution on google and one solution I found was pagination and I've already applied it wherever it's possible but pagination is not an option when I want to retrieve many(5000+) hits in one query.
Thanks in advance.
Try using scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all
results) from a single search request, in much the same way as you
would use a cursor on a traditional database.

How exactly does elasticsearch versioning work?

My understanding was that Elasticsearch would store the lastest copy of the document and just update the version field number? But I was playing around with a few thousand documents and had the need to index them repeatedly without changing any data in the document. My thinking was that the index size would remain the same, but that wasn't the case ... the index size seemed to increase.
This confused me a little bit, so i just wanted to seek clarification on the internal mechanism of versioning within elasticsearch.
An update is a Delete + Insert Lucene operation behind the scene.
But you should know that Lucene does not really delete the document but mark it as deleted.
To remove deleted docs, you have to optimize your Lucene segments.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?only_expunge_deletes=true'
See Optimize API. Also have a look at merge options. Merging segments happens behind the scene at some time.
For a general overview of versioning support in Elasticsearch, please refer to the Elasticsearch Versioning Support.

Resources