I'm collecting some statistics in Elasticsearch for the number of requests per hour to an API. Rather than indexing a new document per request, I'm updating a statistic document per user. On each request, I'm doing something like this from .NET:
client
.Update<StatisticsDocument>(statsId, u => u
.RetryOnConflict(3)
.Script(s => s.Source("ctx._source.requests+=1"))
.Upsert(new StatisticsDocument { Id = statsId, Requests = 1, User = userId, Created = thisHour }));
So, requests is incremented. If a document for this error and the user with userId doesn't exist, I'm creating a new one using the Upsert-method.
There can be more threads and processes calling this code, why I've added RetryOnConflict(3) to retry on conflict.
I'm seeing a pretty large number of conflicts anyway, especially under high load. I understand that retrying 3 times doesn't always solve this issue since there are many threads trying to update the same document over and over again. But I'm still surprised by a pretty high number of conflicts.
Any ideas about what to do here? I'm thinking about keeping the metric in memory and then writing it every minute or something like that. But if Elasticsearch and/or Nest has some kind of built-in feature to help even further, that would be the first priority to implement.
Related
I have an application that at some point has to perform REST requests towards another (non-reactive) system. It happens that a high number of requests are performed towards exactly the same remote resource (the resulting HTTP request is the same).
I was thinking to avoid flooding the other system by using a simple cache in my app.
I am in full control of the cache and I have proper moments when to invalidate it, so this is not an issue. Without this cache, I'm running into other issues, like connection timeout or read timeout, the other system having troubles with high load.
Map<String, Future<Element>> cache = new ConcurrentHashMap<>();
Future<Element> lookupElement(String id) {
String key = createKey(id);
return cache.computeIfAbsent(key, key -> {
return performRESTRequest(id);
}.onSucces( element -> {
// some further processing
}
}
As I mentioned lookupElement() is invoked from different worker threads with same id.
The first thread will enter in the computeIfAbsent and perform the remote quest while the other threads will be blocked by ConcurrentHashMap.
However, when the first thread finishes, the waiting threads will receive the same Future object. Imagine 30 "clients" reacting to the same Future instance.
In my case this works quite fine and fast up to a particular load, but when the processing input of the app increases, resulting in even more invocations to lookupElement(), my app becomes slower and slower (although it reports 300% CPU usage, it logs slowly) till it starts to report OutOfMemoryException.
My questions are:
Do you see any Vertx specific issue with this approach?
Is there a more Vertx friendly caching approach I could use when there is a high concurrency on the same cache key?
Is it a good practice to cache the Future?
So, a bit unusual to respond to my own question, but I managed to solve the problem.
I was having two dilemmas:
Is ConcurentHashMap and computeIfAbsent() appropriate for Vertx?
Is it safe to cache a Future object?
I am using this caching approach in two places in my app, one for protecting the REST endpoint, and one for some more complex database query.
What was happening is that for the database query there was up to 1300 "clients" waiting for a response. Or 1300 listeners waiting for an onSuccess() of the same Future. When the Future was emitting strange things were happening. Some kind of thread strangulation.
I did a bit of refactoring eliminating this concurrency on the same resource/key, but I did kept both caches and things went back to normal.
In conclusion I think my caching approach is safe as long as we have enough spreading or in other words, we don't have such a high concurrency on the same resource. Having 20-30 listeners on the same Future works just fine.
we have started using the high level REST client finally, to ease the development of queries from backend engineering perspective. For indexing, we are using the client.update(request, RequestOptions.DEFAULT) so that new documents will be created and existing ones modified.
The issue that we are seeing is, the indexing is delayed, almost by 5 minutes. I see that they use async http calls internally. But that should not take so long, I looked for some timing options inside the library, didn't find anything. Am I missing anything or the official documentation is missing for this?
Since refresh_interval: 1 in your index settings, it means it is never refreshed unless you do it manually, which is why you don't see the data just after it's been updated.
You have three options here:
A. You can call the _update endpoint with the refresh=true (or refresh=wait_for) parameter to make sure that the index is refreshed just after your update.
B. You can simply set refresh_interval: 1s (or any other duration that makes sense for you) in your index settings, to make sure the index is automatically refreshed on a regular basis.
C. You can explicitly call index/_refresh on your index to refresh it whenever you think is appropriate.
Option B is the one that usually makes sense in most use cases.
Several reference on using the refresh wait_for but I had a hard time finding what exactly needed to be done in the rest high level client.
For all of you that are searching this answer:
IndexRequest request = new IndexRequest(index, DOC_TYPE, id);
request.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);
Consider: you have a collection of user ids and want to load the details of each user represented by their id from an API. You want to bag up all of those users into some kind of collection and send it back to the calling code. And you want to use LINQ.
Something like this:
var userTasks = userIds.Select(userId => GetUserDetailsAsync(userId));
var users = await Task.WhenAll(tasks); // users is User[]
This was fine for my app when I had relatively few users. But, there came a point where it didn't scale. When it got to the point of thousands of users, this resulted in thousands of HTTP requests being fired concurrently and bad things started to happen. Not only did we realise we were launching a denial of service attack on the API we were consuming as did this, we were also bringing our own application to the point of collapse through thread starvation.
Not a proud day.
Once we realised that the cause of our woes was a Task.WhenAll / Select combo, we were able to move away from that pattern. But my question is this:
What is going wrong here?
As I read around on the topic, this scenario seems well described by #6 on Mark Heath's list of Async antipatterns: "Excessive parallelization":
Now, this does "work", but what if there were 10,000 orders? We've flooded the thread pool with thousands of tasks, potentially preventing other useful work from completing. If ProcessOrderAsync makes downstream calls to another service like a database or a microservice, we'll potentially overload that with too high a volume of calls.
Is this actually the reason? I ask as my understanding of async / await becomes less clear the more I read about the topic. It's very clear from many pieces that "threads are not tasks". Which is cool, but my code appears to be exhausting the number of threads that ASP.NET Core can handle.
So is that what it is? Is my Task.WhenAll and Select combo exhausting the thread pool or similar? Or is there another explanation for this that I'm not aware of?
Update:
I turned this question into a blog post with a little more detail / waffle. You can find it here: https://blog.johnnyreilly.com/2020/06/taskwhenall-select-is-footgun.html
N+1 Problem
Putting threads, tasks, async, parallelism to one side, what you describe is an N+1 problem, which is something to avoid for exactly what happened to you. It's all well and good when N (your user count) is small, but it grinds to a halt as the users grow.
You may want to find a different solution. Do you have to do this operation for all users? If so, then maybe switch to a background process and fan-out for each user.
Back to the footgun (I had to look that up BTW 🙂).
Tasks are a promise, similar to JavaScript. In .NET they may complete on a separate thread - usually a thread from the thread pool.
In .NET Core, they usually do complete on a separate thread if not complete and the point of awaiting, for an HTTP request that is almost certain to be the case.
You may have exhausted the thread pool, but since you're making HTTP requests, I suspect you've exhausted the number of concurrent outbound HTTP requests instead. "The default connection limit is 10 for ASP.NET hosted applications and 2 for all others." See the documentation here.
Is there a way to achieve some parallelism and not take exhaust a resource (threads or http connections)? - Yes.
Here's a pattern I often implement for just this reason, using Batch() from morelinq.
IEnumerable<User> users = Enumerable.Empty<User>();
IEnumerable<IEnumerable<string>> batches = userIds.Batch(10);
foreach (IEnumerable<string> batch in batches)
{
Task<User> batchTasks = batch.Select(userId => GetUserDetailsAsync(userId));
User[] batchUsers = await Task.WhenAll(batchTasks);
users = users.Concat(batchUsers);
}
You still get ten asynchronous HTTP requests to GetUserDetailsAsync(), and you don't exhaust threads or concurrent HTTP requests (or at least max out with the 10).
Now if this is a heavily used operation or the server with GetUserDetailsAsync() is heavily used elsewhere in the app, you may hit the same limits when your system is under load, so this batching is not always a good idea. YMMV.
You already have an excellent answer here, but just to chime in:
There's no problem with creating thousands of tasks. They're not threads.
The core problem is that you're hitting the API way too much. So the best solutions are going to change how you call that API:
Do you really need user details for thousands of users, all at once? If this is for a dashboard display, then change your API to enforce paging; if this is for a batch process, then see if you can access the data directly from the batch process.
Use a batch route for that API if it supports one.
Use caching if possible.
Finally, if none of the above are possible, look into throttling the API calls.
The standard pattern for asynchronous throttling is to use SemaphoreSlim, which looks like this:
using var throttler = new SemaphoreSlim(10);
var userTasks = userIds.Select(async userId =>
{
await throttler.WaitAsync();
try { await GetUserDetailsAsync(userId); }
finally { throttler.Release(); }
});
var users = await Task.WhenAll(tasks); // users is User[]
Again, this kind of throttling is best only if you can't make the design changes to avoid thousands of API calls in the first place.
While there is no thread waiting for async operation if the async operation is pure, there is a thread for continuation, so assuming that your GetUserDetailsAsync will await for some IO-bound operation the continuation (parsing output, returning result ...) will need to run on some thread so your Task.Result which was created by GetUserDetailsAsync can be set, so every one of them will wait for a thread from thread pool to finish.
I have set up a cluster of three elastic search nodes all master eligible with 2 being the minimum required. I have configured a client to then bulk upload using the low level client with a static connection pool using the code below.
What I am trying to test is live fail over scenarios i.e. start client with three nodes available and then randomly drop one (shutting down the VM), but keep two up. However I am not seeing the behavior I would expect, it keeps trying the dead node. It actually it seems to take up to about sixty seconds before it moves to the next node.
What I would expect is it to do is to take a the failed attempt and mark that node as potentially dead but at least move on to the next node. What is odd is this is the behavior I get if I start my application with only two of the three nodes available in my list or if I just stop the elastic search service during a test rather than a power down.
Is there a correct way to deal with such a case and get it to move to the next available node as quickly as possible? Or do I need to potentially back off in my code for up to sixty seconds before attempting a republication?
var nodes = new[]
{
new Node(new Uri("http://172.16.2.10:9200")),
new Node(new Uri("http://172.16.2.11:9200")),
new Node(new Uri("http://172.16.2.12:9200"))
};
var connectionPool = new StaticConnectionPool(nodes);
var settings = new ConnectionConfiguration(connectionPool)
.PingTimeout(TimeSpan.FromSeconds(10))
.RequestTimeout(TimeSpan.FromSeconds(20))
.ThrowExceptions()
.MaximumRetries(3);
_lowLevelClient = new ElasticLowLevelClient(settings);
The following I then have wrapped in a try catch where I retry for a maximum of three times before I consider it a failed attempt and revert to an error strategy.
ElasticsearchResponse<Stream> indexResponse = _lowLevelClient.Bulk<Stream>(data);
Any input is appreciated,
Thank you.
The tests for the client include tests for failover scenarios from which the API conventions documentation is generated. Specifically, take a look at the retry and failover documentation
With a StaticConnectionPool, the nodes to which requests can be made are static and never refreshed to reflect nodes that may join and leave the cluster, but they will be marked as being dead if a bad response is returned, and will be taken out of rotation for executing requests on for a configurable dead time, controlled by DeadTimeout and MaxDeadTimeout on connection settings.
The audit trail on the response should provide a timeline of what has happened for a given request, which is easiest to see with response.DebugInformation. The Virtual Clustering test harness (an example) that are part of the Tests project may help to ascertain the correct settings for the behaviour you're after.
Here's my scenario:
I have a page that contains a list of users. I create a new user through my web interface and save it to the server. The server indexes the document in elasticsearch and returns successfully. I am then redirected to the list page which doesn't contain the new user because it can take up to 1-second for documents to become available for search in elasticsearch
Near real-time search in elasticsearch.
The elasticsearch guide says you can manually refresh the index, but says not to do it in production.
...don’t do a manual refresh every time you index a document in production; it will hurt your performance. Instead, your application needs to be aware of the near real-time nature of Elasticsearch and make allowances for it.
I'm wondering how other people get around this? I wish there was an event or something I could listen for that would tell me when the document was available for search but there doesn't appear to be anything like that. Simply waiting for 1-second is plausible but it seems like a bad idea because it presumably could take much less time than that.
Thanks!
Even though you can force ES to refresh itself, you've correctly noticed that it might hurt performance. One solution around this and what people often do (myself included) is to give an illusion of real-time. In the end, it's merely a UX challenge and not really a technical limitation.
When redirecting to the list of users, you could artificially include the new record that you've just created into the list of users as if that record had been returned by ES itself. Nothing prevents you from doing that. And by the time you decide to refresh the page, the new user record would be correctly returned by ES and no one cares where that record is coming from, all the user cares about at that moment is that he wants to see the new record that he's just created, simply because we're used to think sequentially.
Another way to achieve this is by reloading an empty user list skeleton and then via Ajax or some other asynchronous way, retrieve the list of users and display it.
Yet another way is to provide a visual hint/clue on the UI that something is happening in the background and that an update is to be expected very shortly.
In the end, it all boils down to not surprise users but to give them enough clues as to what has happened, what is happening and what they should still expect to happen.
UPDATE:
Just for completeness' sake, this answer predates ES5, which introduced a way to make sure that the indexing call would not return until the document is either visible when searching the index or return an error code. By using ?refresh=wait_for when indexing your data you can be certain that when ES responds, the new data will be indexed.
Elasticsearch 5 has an option to block an indexing request until the next refresh ocurred:
?refresh=wait_for
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.0/docs-refresh.html#docs-refresh
Here is a fragment of code which is what I did in my Angular application to cope with this. In the component:
async doNewEntrySave() {
try {
const resp = await this.client.createRequest(this.doc).toPromise();
this.modeRefreshDelay = true;
setTimeout(() => {
this.modeRefreshDelay = false;
this.refreshPage();
}, 2500);
} catch (err) {
this.error.postError(err);
}
}
In the template:
<div *ngIf="modeRefreshDelay">
<h2>Waiting for update ...</h2>
</div>
I understand this is a quick-and-dirty solution but it illustrates how the user experience should work. Obviously it breaks if the real-world latency turns out to be more than 2.5 seconds. A fancier version would loop until the new record showed up in the page delay (with a limit of course).
Unless you completely redesign ElasticSearch you will always have some latency between the successful index operation and the time when that document shows up in search results.
Data should be available immediately after indexing is complete. Couple of general questions:
Have you checked CPU and RAM to determine whether you are taxing your ES cluster? If so, you may need to beef up your hardware config to account for it. ES loves RAM!
Are you using NAS (network-attached-storage) or virtualized storage like EBS? Elastic recommends not doing so because of the latency. If you can use DAS (direct-attached) and SSD, you'll be in much, much better shape.
To give you an AWS example, moving from m4.xlarge instances to r3.xlarge made HUGE performance improvements for us.