Azure Redis cache - timeouts on GET calls - caching

We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4,
mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line
1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line
79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key,
CommandFlags flags) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line
1346 at
OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at
Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable1
shouldRetryPredicates, Func`1 policyStateFactory) at
OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean
allowDirtyRead) in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at
OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext()
in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196;
TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.

I know this thread is months old but I think my own experiences can add some value here. I had the same problem with Azure Redis Cache (timeouts on Gets) but realized that it was almost exclusively happening on Gets where the string value was relatively large (> 250K in length). I implemented gzip on both Gets and Sets (when the string value is large) and now I almost never get a timeout.
Even if this doesn't solve your particular problem, it's probably good practice to compress the values in general to reduce costs and improve performance.

Regarding the version numbers, it seems that the AssemblyVersion has been locked at 1.0.316 for the last several releases, but the AssemblyFileVersion has been updated to match the NuGet package version. For now, I recommend ignoring AssemblyVersion and just using AssemblyFileVersion to ensure you have the correct binary.
Please contact us at AzureCache#microsoft.com if you are still seeing timeouts using Azure Redis Cache.

Related

How to implement a cache in a Vertx application?

I have an application that at some point has to perform REST requests towards another (non-reactive) system. It happens that a high number of requests are performed towards exactly the same remote resource (the resulting HTTP request is the same).
I was thinking to avoid flooding the other system by using a simple cache in my app.
I am in full control of the cache and I have proper moments when to invalidate it, so this is not an issue. Without this cache, I'm running into other issues, like connection timeout or read timeout, the other system having troubles with high load.
Map<String, Future<Element>> cache = new ConcurrentHashMap<>();
Future<Element> lookupElement(String id) {
String key = createKey(id);
return cache.computeIfAbsent(key, key -> {
return performRESTRequest(id);
}.onSucces( element -> {
// some further processing
}
}
As I mentioned lookupElement() is invoked from different worker threads with same id.
The first thread will enter in the computeIfAbsent and perform the remote quest while the other threads will be blocked by ConcurrentHashMap.
However, when the first thread finishes, the waiting threads will receive the same Future object. Imagine 30 "clients" reacting to the same Future instance.
In my case this works quite fine and fast up to a particular load, but when the processing input of the app increases, resulting in even more invocations to lookupElement(), my app becomes slower and slower (although it reports 300% CPU usage, it logs slowly) till it starts to report OutOfMemoryException.
My questions are:
Do you see any Vertx specific issue with this approach?
Is there a more Vertx friendly caching approach I could use when there is a high concurrency on the same cache key?
Is it a good practice to cache the Future?
So, a bit unusual to respond to my own question, but I managed to solve the problem.
I was having two dilemmas:
Is ConcurentHashMap and computeIfAbsent() appropriate for Vertx?
Is it safe to cache a Future object?
I am using this caching approach in two places in my app, one for protecting the REST endpoint, and one for some more complex database query.
What was happening is that for the database query there was up to 1300 "clients" waiting for a response. Or 1300 listeners waiting for an onSuccess() of the same Future. When the Future was emitting strange things were happening. Some kind of thread strangulation.
I did a bit of refactoring eliminating this concurrency on the same resource/key, but I did kept both caches and things went back to normal.
In conclusion I think my caching approach is safe as long as we have enough spreading or in other words, we don't have such a high concurrency on the same resource. Having 20-30 listeners on the same Future works just fine.

How to use Pomelo.EntityFrameworkCore.MySql provider for ef core 3 in async mode properly?

We are building an asp.net core 3 application which uses ef core 3.0 with Pomelo.EntityFrameworkCore.MySql provider 3.0.
Right now we are trying to replace all database calls from sync to async, like:
//from
dbContext.SaveChanges();
//to
await dbContext.SaveChangesAsync();
Unfortunetly when we do it we expereince two issues:
Number of connections to the server grows significatntly compared to the same tests for sync calls
Average processing speed of our application drops significantly
What is the recommended way to use ef core with mysql asynchronously? Any working example or evidence of using ef-core 3 with MySql asynchonously would be appreciated.
It's hard to say what the issue here is without seeing more code. Can you provide us with a small sample app that reproduces the issue?
Any DbContext instance uses exactly one database connection for normal operations, independent of whether you call sync or async methods.
Number of connections to the server grows significatntly compared to the same tests for sync calls
What kind of tests are we talking about? Are they automated? If so, how many tests are being run? Because of the nature of async calls, if you run 1000 tests in parallel, every test with its own DbContext, you will end up with 1000 parallel connections.
Though with Pomelo, you will not end up additionally with 1000 threads, as you would with using Oracle's provider.
Update:
We test asp.net core call (mvc) which goes to db and read and writes something. 50 threads, using DbContextPool with limit 500. If i use dbContext.SaveChanges(), Add(), all context methods sync, I am land up with around 50 connections to MySql, using dbContext.SaveChangesAsnyc() also AddAsnyc, ReadAsync etc, I end up seeing max 250 connection to MySql and the average response time of the page drops by factor of 2 to 3.
(I am talking about ASP.NET and requests below. The same is true for parallel run test cases.)
If you use Async methods all the way, nothing will block, so your 50 threads are free to handle the next 50 requests while the database is still executing the queries for the first 50 requests.
This will happen again and again because ASP.NET might process your requests faster than your database can return its results. So you will end up with a lot of parallel database queries.
This does not happen when executing the Sync methods, because every thread blocks and you end up with a maximum of 50 parallel queries (one per thread).
So this is expected behavior and just a consequence of async method calls.
You can always modify your code or web server configuration to limit the amount of concurrent ASP.NET requests.
50 threads, using DbContextPool with limit 500.
Also be aware that DbContextPool does not limit how many DbContext objects can concurrently exist, but only how many will be kept in the pool. So if you set DbContextPool to 500, you can create more than 500 contexts, but only 500 will be kept alive after using them.
Update:
There is a very interesting low level talk about lock-free database pool programming from #roji that addresses this behavior and takes your position, that there should be an upper limit in the connection pool that should result in blocking when exceeded and makes a great case for this behavior.
According to #bgrainger from MySqlConnector, that is how it is already implemented (the docs did not explicitly state this, but they do now). The MaxPoolSize connection string option has a default value of 100, so if you use connection pooling and if you don't overwrite this value and if you don't use multiple connection pools, you should not have more than 100 connections active at a given time.
From GitHub:
This is a documentation error, if you are interpreting the docs to mean that you can create an unlimited number of connections.
When pooling is true, each connection pool (there is one per unique connection string) only allows MaximumPoolSize connections to be open simultaneously. Each additional call to MySqlConnection.Open will block until a connection is returned to the pool.
When pooling is false, there is no limit to the number of connections that can be opened simultaneously; it's up to the user to manage the concurrency.
Check to see whether you have Pooling=false in your connection string, as mentioned by Bradley Grainger in comments.
After I removed pooling=false from my connection string, my app ran literally 3x faster.

Is there a "best practice" in microservice development for versioning a database table?

A system is being implemented using microservices. In order to decrease interactions between microservices implemented "at the same level" in an architecture, some microservices will locally cache copies of tables managed by other services. The assumption is that the locally cached table (a) is frequently accessed in a "read mode" by the microservice, and (b) has relatively static content (i.e., more of a "lookup table" vice a transactional content).
The local caches will maintain synch using inter-service messaging. As the content should be fairly static, this should not be a significant issue/workload. However, on startup of a microservice, there is a possibility that the local cache has gone stale.
I'd like to implement some sort of rolling revision number on the source table, so that microservices with local caches can check this revision number to potentially avoid a re-synch event.
Is there a "best practice" to this approach? Or, a "better alternative", given that each microservice is backed by it's own database (i.e., no shared database)?
In my opinion you shouldn't be loading the data at start up. It might be bit complicated to maintain version.
Cache-Aside Pattern
Generally in microservices architecture you consider "cache-aside pattern". You don't build the cache at front but on demand. When you get a request you check the cache , if it's not there you update the cache with latest value and return response, from there it's always returned from cache. The benefit is you don't need to load everything at front. Say you have 200 records, while services are only using 50 of them frequently , you are maintaining the extra cache that may not be required.
Let the requests build the cache , it's the one time DB hit . You can set the expiry on cache and incoming request build it again.
If you have data which is totally static (never ever change) then this pattern may not be worth a discussion , but if you have a lookup table that can change even once a week, month, then you should be using this pattern with longer cache expiration time. Maintaining the version could be costly. But really upto you how you may want to implement.
https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside
We ran into this same issue and have temporarily solved it by using a LastUpdated timestamp comparison (same concept as your VersionNumber). Every night (when our application tends to be slow) each service publishes a ServiceXLastUpdated message that includes the most recent timestamp when the data it owns was added/edited. Any other service that subscribes to this data processes the message and if there's a mismatch it requests all rows "touched" since it's last local update so that it can get back in sync.
For us, for now, this is okay as new services don't tend to come online and be in use same day. But, our plan going forward is that any time a service starts up, it can publish a message for each subscribed service indicating it's most recent cache update timestamp. If a "source" service sees the timestamp is not current, it can send updates to re-sync the data. This has the advantage of only sending the needed updates to the specific service(s) that need it even though (at least for us) all services subscribed have access to the messages.
We started with using persistent Queues so if all instances of a Microservice were down, the messages would just build up in it's queue. There are 2 issues with this that led us to build something better:
1) It obviously doesn't solve the "first startup" scenario as there is no queue for messages to build up in
2) If ANYTHING goes wrong either in storing queued messages or processing them, you end up out of sync. If that happens, you still need a proactive mechanism like we have now to bring things back in sync. So, it seemed worth going this route
I wouldn't say our method is a "best practice" and if there is one I'm not aware of it. But, the way we're doing it (including planned future work) has so far proven simple to build, easy to understand and monitor, and robust in that it's extremely rare we get an event caused by out-of-sync local data.

Golang app-engine performance parameters

Using stock out-of-the-box configuration on a golang app-engine project, I am getting very disappointing performance. Any hints on what I might be missing? How should a golang google app be optimized?
Sending a few dozen requests, not more than six concurrently, I find only one instance handling all the requests, up to six requests concurrently (not sequentially) on that one instance - where I expected to see up to six instances. Possibly as a result, things seem to be blocking. I am seeing many timeouts, even on administrative functions like blobstore.Create(), which didn't happen when requests were being sent and processed individually.
EDIT1: These three lines
context.Infof("Sending request to blobstore to create %s as %s", Name, MimeType)
blobWriter, err := blobstore.Create(context, MimeType)
if err!=nil {
context.Warningf("Unable to access content store: %v",err)
}
are producing:
I 12:47:36.201 Sending request to blobstore to create download.jpg as application/octet-stream
W 12:47:41.251 Unable to access content store: Canceled: Deadline exceeded (timeout)
On failure here it is always about five seconds in blobstore.Create (a few milliseconds when it passes). Timeouts also occur in blobstore.Write and blobstore.Close and datastore, but with 20 to 30 second delays.
--End EDIT1.
There also seem to be performance issues. There is one computationally intensive bit, taking nearly a second to complete on my home machine (at 1.7GHz). According to the logged time stamps, that same code running on the remote app-engine (at 600MHz) is taking over 30 seconds on average, with a maximum of 109 seconds. That doesn't seem right!
EDIT2: The most computationally intensive bit used the resize function:
https://code.google.com/p/appengine-go/source/browse/example/moustachio/resize/resize.go
(with the obvious bug fixes). Not the most efficient resizer, but fast enough for now in a stand-along app. However it runs an order of magnitude slower in appengine (either the local SDK version 1.9 or running on Google's servers). Perhaps Google's version of the image library is slower? Probably the library? - A recursive fibonacci computation runs inside appengine in the same time as outside (same order of magnitude as C code).
--- End EDIT2
Any hints on how to get google app performance more similar to a multi-threaded stand-along application? So far these preliminary scaling experiments have been a miserable failure!
UPDATE: Using runtime.GOMAXPROCS(6), for a maximum of 6 concurrent requests, made no measurable difference. When using "manual_scaling" with more instances that requests was helpful, requests usually get assigned to different instances, but sometimes not - leading to problems.
A partial solution: Segregate computationally intensive requests on a separate module, running on separate instances, so that they do not block smaller more time-sensitive requests. Next, break down larger functions into several smaller requests, so that several can run "concurrently" on the same instance without timing out? (Make the client send several requests to do one job!)
It would be much better if I could ask the appengine just to start new instances for each request when none are available. Experimentally, starting a new instance is much cheaper than running two requests in slow motion on one instance.

Stream writeheaders take too much time WCF

A similar question is asked at NewRelic stream & writeHeaders
I am profiling my WCF services on New Relic. There is a WCF service which calls another WCF service.
Now I suppose while calling the other WCF service, when it creates request, somewhere the internal process writes headers to request stream which is slow some times.
The traces I found in New Relic tells me that for a particular method of one of my WCF service which calls a method of my another WCF service, takes around 50-60 seconds, out of which 95-100 % of time is consumed by System.Net.ConnectStream.WriteHeaders.
Stream[url of WCF service/soap]: WriteHeaders -> 99.78 % time (approx 49 seconds).
I am not getting what it is and how to reduce this time ?
I have searched and I didn't found what ConnectStream actually do or some details about it, so that I can find any way to lessen the amount of time its taking.
Please, let me know your suggestions.
It sounds like you're streaming a large file up from a client, catching it in one WCF web service, then re-writing the data into a new HttpWebRequest, then sending it to another host. I think I'd be tempted to try buffering the data from the client to your web service rather than streaming.
I've spent the last year working on a project that sounds similar to wha you're doing. The difference betweens streaming and buffering is this:
Streaming reads (from source) and then writes (to target) the data in an interative process you don't have much control over. If the source file is large (like a gig or more), the WCF request/response will iterate a dozen or more times back and forth between the client and host before the request is complete.
Buffering, on the other hand, accummulates the entire content of the target file BEFORE filling the request and sending it to the host, thus speeding up the process. And since the performance penalty incurred by buffering (time required to accummulate the bytes in memory) is placed on the client, it's generally not a problem.
So when buffering data from the client, your host you'll receive one Http request with a complete byte array (let's say) that's ready to be repackaged into the request you're passing onto the second, target WCF host. At that point, again, you have the choice between buffering and streaming. On the host, were performance matters, streaming the request to the second host will improve your scalability but (again) potentially hurt your performance speed.
On the client side:
With binding
.TransferMode =TransferMode.Buffered 'instead of Transfermode.Streamed
.MessageEncoding = WSMessageEncoding.Text
.TextEncoding = System.Text.Encoding.UTF8
.MaxReceivedMessageSize = Integer.MaxValue
.ReaderQuotas.MaxArrayLength = Integer.MaxValue
.ReaderQuotas.MaxBytesPerRead = Integer.MaxValue
.ReaderQuotas.MaxDepth = Integer.MaxValue
.ReaderQuotas.MaxNameTableCharCount = Integer.MaxValue
.ReaderQuotas.MaxStringContentLength = Integer.MaxValue
.MaxBufferSize = Integer.MaxValue
.MaxBufferPoolSize = Integer.MaxValue
On the host side:
With binding
.TransferMode = TransferMode.Buffered
.MaxReceivedMessageSize = Integer.MaxValue
I've seen the same thing before when the service you're calling is stalling or is flooded with too many concurrent connections. If the issue is the former, profiling your WCF service may help identify the root cause -- maybe it's slow to respond due to database access or some other I/O bound process. If the issue is the later, it may be something that you can resolve by tuning the performance of the service (http://msdn.microsoft.com/en-us/library/ee377061(v=bts.10).aspx)
This can also manifest itself as "BeginRequest" for an ASP.NET application in New Relic. Rarely does BeginRequest or WriteHeaders mean the problem is really with sending the data itself, though it could be if you have large payloads, but in regular calls where the data transmitted is small, the problem with a slow time to connect or slow response will appear in these two areas.

Resources