Elasticsearch indexes but does not store documents - elasticsearch

I'm having troubles storing documents within a 3-node Elasticsearch cluster that previously was able to store documents. I use the Java API to send bulks of documents to Elasticsearch, which are accepted (no failure in BulkResponse object) AND Elasticsearch has heavy index activities. However, the number of documents are not increased and I assume that none of them are store.
I've looked into Elasticsearch logs (of all three nodes) but I see no errors or warnings.
Note: I've had to restart two nodes previously but search/query is working perfectly. (the count in the image starts at ~17:00 as I've installed the Marvel plugin at this time)
What can I do to solve or debug the problem?

Sorry for this point blank code blindness by me! I forgot to skip the cursor when reading from MongoDB and therefore re-inserted the same 1000 documents into Elasticsearch for thousands of times!
Learning: If this problem occurs check if you select the correct documents in your database and that these documents are not already stored in ES.
Sidenote to Marvel: It would be great is this could be indicated in any way - e.g. by having a chart with "updated documents" (I rechecked and could not find one)

Related

Elasticsearch count of searches against an index resets to zero after cluster restart

We use Elasticsearch - one cluster is 7.16 and another is 8.4. Behavior is the same in both.
We need to be able to get a count of search queries run against an index since the index's creation.
We retrieve the amount of searches that have been run against a given index by using the _stats endpoint as such:
GET /_stats?filter_path=indices.my_index.primaries.search.query_total
The problem is that this stat resets to zero after a cluster reboot. Does this data persist anywhere for a given index such that I can get the total since inception of the index? If not, is there an action I can take to somehow record that stat before a reboot so I can always access the full total number?
EDIT - this is the only item I was able to find on this subject, and the answer in this discussion does not look promising: https://discuss.elastic.co/t/why-close-reopen-index-will-reset-index-stats-to-zero/170830
As far as I know, there is no Out of the box solution to achieve your use-case, but its not that hard to build it yourself either, You can simply call the same _stats API periodically and store it in some other index of Elasticsearch or DB so that its not reset. IMHO Its not that big work.

How to check the index is used for searching or indexing

I've a lot of elasticsearch clusters which hold the historical indices(more than 10 years old), some of these indices are created newly with latest settings and fields, but old ones are not deleted.
Now I need to delete the old indices which are not receiving any search and index requests.
I've already gone to elasticsearch curator but it would not work with older version of ES.
Is there is any API which can just gives the last time of index and search request in ES, that would serve my purpose very well.
EDIT:- I've also check https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html but this also doesn't give the last time when indexing or search request came. all it gave is the number of these requests from last restart.

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Documents in elasticsearch getting deleted automatically?

I'm creating an index though logstash and pushing data to it from a MySQL database. But what I noticed in elasticsearch was once the whole data is uploaded, it starts deleting some of the docs. The total number of docs is 160729. Without the scheduler it works fine.
I inserted the cron scheduler in order to check whether new rows have been added to the table. Can that be the issue?
My logstash conf looks like this.
Where am I going wrong? Or is this behavior common?
Any help could be appreciated.
The docs.deleted number doesn't mean that your documents are being deleted, but simply that existing documents are being "updated" and the older version of the updated document is marked as deleted in the process.
Those documents marked as deleted will be eventually cleaned up as Lucene merges segments in the background.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources