Periodically remove documents from Elasticsearch index depending on field - elasticsearch

Let's say I have an index called car. The documents in car have the following fields:
constructionYear
seats
decommissioned
…
Now I want to periodically delete all documents where decommissioned is true.
Is there a way to configure such a job on the Elasticsearch server? Or do I have to perform a REST call every time I want to clean up the index?

you'd need to build a delete by query to manage this, and then schedule it outside of Elasticsearch to be run every so often. there's no inbuilt scheduler for Elasticsearch to do this
however, to the point of Yuri's comment above, why not just leave them? you can still run analytics on the data

Actually, you can utilize a Watcher for this purpose.
It's not what they made for, yet you can set the Webhook Action there to go through whatever your Search input returns & do a REST call to delete unwanted docs by ID.
That way you're be able to keep it within your Elastic cluster.
P.S. Though it make sense to rethink your "data model" a bit, really.
Elasticsearch is not what your regular RDBMS is, and selective delete could get VERY expensive.
It's better to leave them sit there & simply modify your queries to acknowledge that attribute.

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830
As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

Query ElasticSearch after the index operation

I have the eservice A that executes some text processing. After it, service B has to execute some set of Elasticsearch queries on the document. The connectivity between the services provided by Kafka. The solution is tightly coupled to ES free text search capabilities, so I can't query in another way.
Possible solution:
To store the document in ES and query it. The problem is that ES is eventually consistent and I don't know if the document already indexed or not.
Is there some API to ensure that the document is already indexed?
Another option is to publish a message from service A with delay X+5 seconds, where X is the refresh interval of the index, where the document should be stored. Seems to me an unreliable solution. What do you think?
Another direction that I thought about, is some way to query the document with ES queries where the document is in memory. For example, if I will have some magic way to convert the ES query to Luciene DSL, so I don't need to deal with the eventual consistent behavior of Elasticsearch and I can query Lucine directly.
Maybe there are some other solutions?
take a look at the ?refresh flag so that an indexing request will only return once a refresh has happened. otherwise you can use the GET API to see if the document exists or not
however there is no magic options here, Elasticsearch is eventually consistent and you need to factor that in

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Using job to delete data from Elastic seach asynchronously

I would like to delete data from Elastic search using API (curl).
I would like to start the deletion process and later query about the progress of deletion process.
Is it possible to use job to do it?
I tried looking at relevant documentation but the amount of examples is very low.
Would appreciate any relevant information or links.
You have two solutions:
Using the delete-by-query API using a range query that you can then monitor using the Task API
Use daily indices (e.g. my-logs-2018-09-10, my-logs-2018-09-11, etc) so deleting data in the past is simply a matter of deleting the indices for the days you want to ditch. No need to monitor as this happens instantaneously

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources