ElasticSearch delete_by_query: Trying to create too many scroll contexts

ElasticSearch delete_by_query: Trying to create too many scroll contexts - elasticsearch

I am trying to fix this error while running delete_by_query API on AWS ElasticSearch:
Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.
After going through few posts I have some basic idea as to what scroll-context is, however, my knowledge of "open scroll contexts" still remains murky.
I have a few questions regarding my understanding:
Does it mean that the API will open up a scroll context for the specified time (scroll = some period) and after processing (within the stipulated scroll time), open up a new context?
If so, do the previously processed contexts remain open till the API terminates?
I have 4 java EC2 instances running each of will execute delete_by_query API. Can this also cause too many scroll contexts to remain open? or it's unrelated?
Please do shed some light if there's anything lacking.
Coming to fixing the error:
The straight forward solution would be to increase the search.max_open_scroll_context parameter, however it has negative side effects as mentioned in this document's tip/note section.
Are there any other solutions?
Can increasing the batch size per scroll help?
Edit: The EC2 instances (2 in east-1 and 2 in west-2) are running a spring application at a frequency of 1s (this frequency is deliberate and can't be changed due to some restrictions), listening to SQS (in corresponding regions) for messages, and delete_by_query will use act upon these messages (delete based on some parameter from the message received).
Note: The SQS has considerable amount of data coming in.

Related

Dynatrace PurePath: what are each yellow bar?

I am using Dynatrace to help orient my efforts as I'm optimizing an endpoint of our service.
Looking at the Controller's PurePath, I am currently wondering: what does each individual yellow bar mean exactly?
It seems to be some sort of aggregate since I don't think we have any kind of batching activated. Yet, we see multiple times the same statement being aggregated into one bar, then right after the same statement aggregated again into a single bar, but in the same timeframe (for example: we see a 89x, then a 90x following each other).
As per company policy, I had to hide a bunch of things with black rectangles: sorry for that!

We have been using Dynatrace for long time now. These yellow boxes are showing the time taken for executing respective query. The query can be seen at the start of that row.
Looking at your diagram it seems you are executing few queries in parallel. e.g. last 4 queries have started at the same time and based on complexity each has taken different time to complete the execution.
The multiplying factor shown as 90X or 89X is the number of times that query is executed. This is what documentation says.
I truly do not agree with that. Why would developer/ DB server run the same query those many times? May be the agent installed on that DB server is getting confused due to same query is getting executed across different requests. This is just my guess.
Regards,
Vikrant Korde

ElasticSearch document refresh=true does not appear to work

In order to speed up searches on our website, I have created a small elastic search instance which keeps a copy of all of the "searchable" fields from our database. It holds only a couple million documents with an average size of about 1KB per document. Currently (in development) we have just 2 nodes, but will probably want more in production.
Our application is a "primarily read" application - maybe 1000 documents/day get updated, but they get read and searched 10's of thousands of times/day.
Each document represents a case in a ticketing system, and the case may change status during the day as users research and close cases. If a researcher closes a case and then immediately refreshes his queue of open work, we expect the case to disappear from their queue, which is driven by a query to our Elastic Search instance, filtering by status. The status is a field in the case index.
The complaint we're getting is that when a researcher closes a case, upon immediate refresh of his queue, the case still comes back when filtering on "in progress" cases. If he refreshes the view a second or two later, it's gone.
In an effort to work around this, I added refresh=true when updating the document, e.g.
curl -XPUT 'https://my-dev-es-instance.com/cases/_doc/11?refresh=true' -d '{"status":"closed", ... }'
But still the problem persists.
Here's the response I got from the above request:
{"_index":"cases","_type":"_doc","_id":"11","_version":2,"result":"updated","forced_refresh":true,"_shards":{"total":2,"successful":1,"failed":0},"_seq_no":70757,"_primary_term":1}
The response seems to verify that the forced_refresh request was received, although it does say out of total 2 shards, 1 was successful and 0 failed. Not sure about the other one, but since I have only 2 nodes, does this mean it updated the secondary?
According to the doc:
To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.
Are my expectations reasonable? Is there a better way to do this?

After more testing, I have concluded that my issue was due to application logic error, and not a problem with ElasticSearch. The refresh flag is behaving as expected. Apologies for the misinformation.

How to deal with Elasticsearch index delay

Here's my scenario:
I have a page that contains a list of users. I create a new user through my web interface and save it to the server. The server indexes the document in elasticsearch and returns successfully. I am then redirected to the list page which doesn't contain the new user because it can take up to 1-second for documents to become available for search in elasticsearch
Near real-time search in elasticsearch.
The elasticsearch guide says you can manually refresh the index, but says not to do it in production.
...don’t do a manual refresh every time you index a document in production; it will hurt your performance. Instead, your application needs to be aware of the near real-time nature of Elasticsearch and make allowances for it.
I'm wondering how other people get around this? I wish there was an event or something I could listen for that would tell me when the document was available for search but there doesn't appear to be anything like that. Simply waiting for 1-second is plausible but it seems like a bad idea because it presumably could take much less time than that.
Thanks!

Even though you can force ES to refresh itself, you've correctly noticed that it might hurt performance. One solution around this and what people often do (myself included) is to give an illusion of real-time. In the end, it's merely a UX challenge and not really a technical limitation.
When redirecting to the list of users, you could artificially include the new record that you've just created into the list of users as if that record had been returned by ES itself. Nothing prevents you from doing that. And by the time you decide to refresh the page, the new user record would be correctly returned by ES and no one cares where that record is coming from, all the user cares about at that moment is that he wants to see the new record that he's just created, simply because we're used to think sequentially.
Another way to achieve this is by reloading an empty user list skeleton and then via Ajax or some other asynchronous way, retrieve the list of users and display it.
Yet another way is to provide a visual hint/clue on the UI that something is happening in the background and that an update is to be expected very shortly.
In the end, it all boils down to not surprise users but to give them enough clues as to what has happened, what is happening and what they should still expect to happen.
UPDATE:
Just for completeness' sake, this answer predates ES5, which introduced a way to make sure that the indexing call would not return until the document is either visible when searching the index or return an error code. By using ?refresh=wait_for when indexing your data you can be certain that when ES responds, the new data will be indexed.

Elasticsearch 5 has an option to block an indexing request until the next refresh ocurred:
?refresh=wait_for
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.0/docs-refresh.html#docs-refresh

Here is a fragment of code which is what I did in my Angular application to cope with this. In the component:
async doNewEntrySave() {
try {
const resp = await this.client.createRequest(this.doc).toPromise();
this.modeRefreshDelay = true;
setTimeout(() => {
this.modeRefreshDelay = false;
this.refreshPage();
}, 2500);
} catch (err) {
this.error.postError(err);
}
}
In the template:
<div *ngIf="modeRefreshDelay">
<h2>Waiting for update ...</h2>
</div>
I understand this is a quick-and-dirty solution but it illustrates how the user experience should work. Obviously it breaks if the real-world latency turns out to be more than 2.5 seconds. A fancier version would loop until the new record showed up in the page delay (with a limit of course).
Unless you completely redesign ElasticSearch you will always have some latency between the successful index operation and the time when that document shows up in search results.

Data should be available immediately after indexing is complete. Couple of general questions:
Have you checked CPU and RAM to determine whether you are taxing your ES cluster? If so, you may need to beef up your hardware config to account for it. ES loves RAM!
Are you using NAS (network-attached-storage) or virtualized storage like EBS? Elastic recommends not doing so because of the latency. If you can use DAS (direct-attached) and SSD, you'll be in much, much better shape.
To give you an AWS example, moving from m4.xlarge instances to r3.xlarge made HUGE performance improvements for us.

Segmenting on users who have performed a behaviour not behaving as expected

I want to look at the effect of having performed a specific action sequence at any (tracked) time in the past on user retention and engagement.
The action sequence is that of performing an optional New User Flow.
This is signalled to Google Analytics via sending it appropriate events. That works fine. The events show up in reports as expected.
My problem is what happens to results when I used these events to create segments. I have tried two different ways of creating a segment based on this in Advanced Segmentations, via Conditions (defining the segment via the end event, filtered over users not sessions), and via Sequences (defining start and end events, again filtered over users not sessions).
What I get when I look at various retention/loyalty reports, using either of these segments, is ever so very clearly a result which is doing this segmentation within session, not across uses sessions. So for NUF completers , I am seeing all my loyalty/recency on Session 1, in which people are most likely to do the NUF, if they ever do it at all. This is not what I want. (Mind you it is something that could be really useful in other context, with another event! But not for the new user flow.)
What are my options for getting what I want? I see two possible ways forward:
Using custom dimensions, assigning a custom dimension value in the code when the New User Flow is completed. However I do not know if this will solve the cross-session persistence problem.
Injecting a UserID, which we do not currently do, and (somehow!) using the reports available when you inject a UserID to do this.
Are either of these paths plausible? Is there a better way forward? Is it silly to even try to do this in Google Analytics? I'm way more familiar with App Tracking solutions (e.g. Flurry, Mixpanel, DeltaDNA) which do this as a matter of course, than with Google Analytics, and the fact this is at the very least awkward in Google Analytics is coming a bit of a surprise.
thanks,
Heather

Immediately Display New Metrics

I am using graphite and coda hale metrics to try and track the number of times particular API's are called and also the top 10 callers. I have assigned a metric to each user who calls the API and use graphite to bring back the top 10.
The problem is, if it is a new user - ie a new metric, this will only be displayed in Graphite when the tool is refreshed - Has anyone come across a work around for this ? Is there some way Graphite can automatically detect new meters?
Just to be clear - I can see the top ten API callers for the last 30 minutes.........unless it is a brand new user that has never logged in before.

It seems that graphite-web uses an on disk index generated by a glorified find command. Another script is available so you can run it as cron to update the metric index file.
Whenever you update the index file, graphite-web process will detect it and reload it.
Since reloading the index might be heavy for large (1M) number of metrics, I would advise to modify the update script a bit to conditionnaly update the file (only if different for instance).
EDIT: after test, graphite does not seem to call the reloading code

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio