Scalability issue with singleton object (remoting) - .net-remoting

Scalability issue with singleton object
Hi All,
We have a singleton object hosted in windows service.
It works fine untill the number of simultaneous client requests exceeds some magical number around 100.
After this all new calls seems to be queued and processed one by one only when one of current connections is released.
We would very much appreciate if someone could tell us how to get rid of this limitation.
At the time when it happens the number of threads (according to Task Manager) is about 120 so thread pooling shouldn’t be an issue (there are 2 CPUs which makes up to 512 threads, if I correctly understand).
There is also plenty of free memory (the process allocates about 200-300 MB and there is still more than 1GB of free memory)
We use .Net framework 3.5
Below is fragment of app.config.
<configuration>
<system.runtime.remoting>
<application>
<service>
<wellknown type="CompanyName.Server.ServerStub, MyServer" objectUri="MyServer" mode="Singleton"/>
</service>
<channels>
<channel port="3210" ref="tcp">
<serverProviders>
<formatter ref="binary" typeFilterLevel="Full"/>
</serverProviders>
</channel>
</channels>
</application>
</system.runtime.remoting>
</configuration>

There always is only 1 singleton Object. Its handling all request one by one. After about 100 requests you'll probably notice some slowdown because some buffers are filling up.

Related

How to avoid saturation in Akka HTTP (latency spikes)?

I have a akka-http (Scala) API server that serves data to a NodeJS server.
In moments after startup, everything works fine, everything is fast. Latency is low. But suddenly, latency increases fastly. The API no longer responds, and the website becomes unusable.
The strange thing is that the traffic and the requests count remain stable. Latency spikes seem decorrelated from them.
I guess this saturation is achieved with the blocking of all the threads in akka thread pool. Unfortunately, my Akka dispatcher is blocking, because I'm doing a lot of SQL queries (in MySQL) and because I'm not using a reactive library. I'm using Slick 2, which is, contrary to Slick 3, blocking-only.
Here's my dispatcher :
http-blocking-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 46
}
throughput = 1
}
So, my question is, how to avoid this sort of bottling ? How to keep a latency proportional with the traffic ? Is there a way to evict the requests that cause the saturation in order to prevent them to compromise everything ?
Thank you !
You should not use Akka's own thread pool for running long blocking tasks. Create your own thread pool, and run your slick queries using it, leaving free threads for akka. That's for your first 2 questions.
I don't know of any good answer for the last one. You could maybe look into specific slick settings to set a timeout on sql queries, but I don't know if such things exist. Otherwise try to analyse why your queries takes so much time, could you be missing an index or two?

Hard limit connections Spring Boot

I'm working on a simple micro service written in Spring Boot. This service will act as a proxy towards another resources that have a hard concurrent connection limit and the requests take a while to process.
I would like to impose a hard limit on concurrent connections allowed to my micro service and rejecting any with either a 503 or on tcp/ip level. I've tried to look into different configurations that can be made for Jetty/Tomcat/Undertow but haven't figured out yet something completely convincing.
I found some settings regulating thread pools:
server.tomcat.max-threads=0 # Maximum amount of worker threads.
server.undertow.io-threads= # Number of I/O threads to create for the worker.
server.undertow.worker-threads= # Number of worker threads.
server.jetty.acceptors= # Number of acceptor threads to use.
server.jetty.selectors= # Number of selector threads to use.
But if understand correctly these are all configuring thread pool sizes and will just result in connections being queued on some level.
This seems like really interesting, but this hasn't been merged in yet and is targeted for Spring Boot 1.5 , https://github.com/spring-projects/spring-boot/pull/6571
Am I out of luck using a setting for now? I could of course implement a filter but would rather block it on an earlier level and not have to reinvent the wheel. I guess using apache or something else in front is also an option, but still that feels like an overkill.
Try to look at EmbeddedServletContainerCustomizer
this gist could give you and idea how to do that.
TomcatEmbeddedServletContainerFactory factory = ...;
factory.addConnectorCustomizers(connector ->
((AbstractProtocol) connector.getProtocolHandler()).setMaxConnections(10000));

Azure Redis cache - timeouts on GET calls

We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4,
mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line
1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line
79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key,
CommandFlags flags) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line
1346 at
OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at
Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable1
shouldRetryPredicates, Func`1 policyStateFactory) at
OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean
allowDirtyRead) in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at
OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext()
in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196;
TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.
I know this thread is months old but I think my own experiences can add some value here. I had the same problem with Azure Redis Cache (timeouts on Gets) but realized that it was almost exclusively happening on Gets where the string value was relatively large (> 250K in length). I implemented gzip on both Gets and Sets (when the string value is large) and now I almost never get a timeout.
Even if this doesn't solve your particular problem, it's probably good practice to compress the values in general to reduce costs and improve performance.
Regarding the version numbers, it seems that the AssemblyVersion has been locked at 1.0.316 for the last several releases, but the AssemblyFileVersion has been updated to match the NuGet package version. For now, I recommend ignoring AssemblyVersion and just using AssemblyFileVersion to ensure you have the correct binary.
Please contact us at AzureCache#microsoft.com if you are still seeing timeouts using Azure Redis Cache.

Spring Integration Poller too slow

We have a Spring Integration project which uses a the following
<int-file:inbound-channel-adapter
directory="file:#{'${poller.landingzonepath}'.toLowerCase()}" channel="createMessageChannel"
filename-regex="${ingestion.filenameRegex}" queue-size="10000"
id="directoryPoller" scanner="leafScanner">
<!-- <int:poller fixed-rate="${ingestion.filepoller.interval:10000}" max-messages-per-poll="100" /> -->
<int:poller fixed-rate="10000" max-messages-per-poll="1000" />
</int-file:inbound-channel-adapter>
We also have a leafScanner which extends from the default RecursiveLeafOnlyDirectoryScanner, our leafscanner doesn't do too much. Just checks a directory against a regex property.
The issue we're seeing is one where there are 250,000 (.landed [the ones we care about] files) which means about 500k actual files in the directory that we are polling. This is redesign of an older system and the redesign was to make the application more scalable, whilst being agnostic of the directory names inside the polled parent directory. We wanted to get away from a poller per specific directory, but it seems unless we're doing something wrong, we'll have to go back to this.
If anyone has any possible solutions, or configuration items we could try please let me know. On my local machine with 66k .landed files, it takes about 16 minutes before the first file is presented to our transformer to do something.
As the JavaDocs indicate, the RecursiveLeafOnlyDirectoryScanner will not scale well with large directories or deep trees.
You could make your leafScanner stateful and, instead of subclassing RecursiveLeafOnlyDirectoryScanner, subclass DefaultDirectoryScanner and implement listEligibleFiles and return when you have 1000 files after saving off where you are; and on the next poll, continue from where you left off; when you get to the end, start again at the beginning.
You could maintain state in a field (which would mean you'd start over after a JVM restart) or use some persistence.
Just an update. The reason our implementation was so slow was beacuse of locking (trying to prevent duplicates), locking (preventing duplicates) is automatically disabled by adding a filter.
The max-messages-per-poll is also very important if you want to add a thread pool. Without this you will see no performance improvements.

Consequences of changing USERPostMessageLimit

One of our legacy applications relies heavily on PostThreadMessage() for inter-thread communication, so we increased USERPostMessageLimit in the registry (way) beyond the normal 10.000.
However, documentation on MSDN states that "This limit should be sufficiently large. If your application exceeds the limit, it should be redesigned to avoid consuming so many system resources." [1]
Can anyone enlighten me as to how exactly consuming too many system resources manifests itself? What exactly are system resources? Can I somehow monitor an application's usage of system resources? Any information would be very helpful in deciding whether it is worth the time and effort to redesign this application.
The resources it is refering to are those used by the threads for receiving/handling the messages. You can monitor the thread pool size & other resources using the Taskmanager (look at View->Select Columns). It it may help you identify the specific resource if the consumer is resource locked, look for a resource count that tops out even while your threads are increasing.
However; if you need to increase USERPostMessageLimit then message producer is simply overloading the message consumer; by increasing this limit you are compounding your problem not fixing it. Reducing USERPostMessageLimit back to the default, and if your message producer cannot post the message try sleeping before retrying, allowing the consuming thread to clear some messages.

Resources