Using job to delete data from Elastic seach asynchronously - elasticsearch

I would like to delete data from Elastic search using API (curl).
I would like to start the deletion process and later query about the progress of deletion process.
Is it possible to use job to do it?
I tried looking at relevant documentation but the amount of examples is very low.
Would appreciate any relevant information or links.

You have two solutions:
Using the delete-by-query API using a range query that you can then monitor using the Task API
Use daily indices (e.g. my-logs-2018-09-10, my-logs-2018-09-11, etc) so deleting data in the past is simply a matter of deleting the indices for the days you want to ditch. No need to monitor as this happens instantaneously

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830
As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

Periodically remove documents from Elasticsearch index depending on field

Let's say I have an index called car. The documents in car have the following fields:
constructionYear
seats
decommissioned
…
Now I want to periodically delete all documents where decommissioned is true.
Is there a way to configure such a job on the Elasticsearch server? Or do I have to perform a REST call every time I want to clean up the index?
you'd need to build a delete by query to manage this, and then schedule it outside of Elasticsearch to be run every so often. there's no inbuilt scheduler for Elasticsearch to do this
however, to the point of Yuri's comment above, why not just leave them? you can still run analytics on the data
Actually, you can utilize a Watcher for this purpose.
It's not what they made for, yet you can set the Webhook Action there to go through whatever your Search input returns & do a REST call to delete unwanted docs by ID.
That way you're be able to keep it within your Elastic cluster.
P.S. Though it make sense to rethink your "data model" a bit, really.
Elasticsearch is not what your regular RDBMS is, and selective delete could get VERY expensive.
It's better to leave them sit there & simply modify your queries to acknowledge that attribute.

Re-processing data for Elasticsearch with a new pipeline

I have an ELK-stack server that is being used to analyse Apache web log data. We're loading ALL of the logs, going back several years. The purpose is to look at some application-specific trends over this time period.
The data-processing pipeline is still being tweaked, as this is the first time anyone has looked in detail into this data and some people are still trying to decide how they want the data to be processed.
Some changes were suggested and while they're easy enough to do in the logstash pipeline for new, incoming data, I'm not sure how to apply these changes to the data that's already in elastic. It took several days to load the current data set, and quite a bit more data has been added so re-processing everything through logstash, with the modified pipeline will probably take several days longer.
What's the best way to apply these changes to data that has already been ingested into elastic? In the early stages of testing this set-up, I would just remove the index and rebuild from scratch, but that was done with very limited data sets and with the amount of data in use here, I'm not sure that's feasible. Is there a better way?
Setup an ingest pipeline and use reindex API to move data from current index to new index (with the pipeline configured for destination index)
Ingest Node

Summarization in Elasticsearch

I am a newbie to Elasticsearch. We are currently using Splunk platform for our analytics application and looking to migrate to ELK. Splunk provides options to schedule searches to run in background periodically and to store the search results in a separate summary index. Is similar functionality available in Elasticsearch? If so, please point me to the documentation containing the process.
Thanks,
Keerthana
This is a great use case. Of course Elasticsearch can perform such tasks, but it is more manual. You have to write your own script. So for example, if you want to summarize data, you can use ElasticSearch aggregations, and take the result (which comes in JSON format) and store it back into an index where you keep summary data. This way, even if you delete your raw data, your summary data lives on.
Elasticsearch comes with different clients. I like to use the Python Elasticsearch DSL library.

Elasticsearch query a specific node for scroll

I have a scan/scroll query where each document that comes back has something done to it and is then the changes are written back. Basically mapping over the whole index (or document type actually).
If the function applied during this mapping starts to become too slow then I need to find a way to split this across several machines.
I could share a scroll ID across multiple machines using Zookeeper or something but will there be issues querying ES from 2 clients at almost the same time?
Alternatively, is there a way to write a query that will only run against one specified node? This way, if I had one 'mapping process' on the same box as one node then I could remove the network overhead.
Check "_only_node" or "_prefer_node" option in ElasticSearch API.

Resources