Getting an indexes item count with ElasticSearch - elasticsearch

I am writing some code where we are inserting 200,000 items into an ElasticSearch index.
Whilst this works fine, when we get a count of items in the index to ascertain everything went in, we are not getting the same number. However, if we wait a second or two, the count is correct.
Therefore, is there a programmatic way we can get a real count from ElasticSearch without having to sleep or similar?

Newly indexed records become visible in search results only after the Refresh operation. Refresh is called automatically with frequency specified by index.refresh_interval setting, which is 1s by default. When writing elasticsearch tests, it's customary to call refresh after indexing to make sure that all indexed records are available in searches. However, excessive refresh calls (after each record, for example) in production code might hamper the elasticsearch indexing performance.

Related

Cannot delete a document immediately after it is inserted?

I've created some tests for my ElasticSearch functionality and I've noticed some strange behavior. If I have a test that:
Inserts a document and confirms there are no errors
Retrieves that same document, confirms there are no errors and confirms it has the expected values
Deletes the document, confirms there are no errors and confirms 1 document was deleted
Then the 3rd test will fail because 0 documents were deleted. If I take one of the following steps:
Debug the test and put a breakpoint after insert but before delete
Add time.Sleep(time.Second) immediately before the delete step
then 1 document is deleted and the 3rd test will pass. In the cases when the 3rd test has failed, I've gone into my ES instance and confirmed that the document exist.
This leads me to believe that after inserting a document there is some span of time where something has to happen before I can delete the document.
My questions is - what needs to happen after insert so that I can delete a document and is there a better way for me to handle this in my tests than sleeping for 1 second?
I am coding in Golang and I am using the Olivere ES Client
Elasticsearch operations can be inconsistent.
You can check the option refresh or wait_for_active_shards if it fit your test.
NB: it’s always difficult to add test to an inconsistent system.
I would not use the term inconsistence. Storing and retrieving a document are real-time operations. search is happening in near-real-time.
While you can always search for documents, they will only make it into your result set once the data structures for search exist (typically the inverted indices). Creating and maintaining this data structure for every single document that gets indexed would be costly and inefficient, that's why the data structure gets created at latest when the refresh interval has expired (default refresh interval is 1 second).
Also, when deleting a document, the document does not get immediately removed from disk. It first gets marked for deletion, ensuring that it will no longer show up in any results. But only after some Elasticsearch internal housekeeping (segment merges), the documents marked for deletion eventually get wiped.
That should give you an idea why for search we talk about a near real-time behaviour, or what you describe as "gap"
Especially for unit/integration test you would want to make sure that a document can get found after having it indexed. You can easily achieve this by converting your index/write-request into a blocking one by adding the parameter refresh=wait_for. With this, the indexing request only returns, AFTER the data structures needed for search have been created. Making sure that in your next request the document is available for whatever action you want to execute.

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

elastic query returns same results after insert

I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html

ElasticSearch indexed document not returned immediately

I'm using ES as backend. So, my architecture is based on a client-server.
Very often, maybe too much, I'm realizing when I perform two operations from client: index and search almost one after the other, the document indexed is not returned by ES.
When I refresh the result, the last indexed document is obtained from server.
Should I take something in mind in order to avoid this behavior?
Is this behavior something usual?
Yes, it is usual behaviour. ElasticSearch refreshes shard every 1 second.
ElasticSearch could work really slow if you refresh it after every index.

ElasticSearch - Configuration to Analyse a document on Indexing

In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

Resources