We are using Kafka Streams as a stream processing engine in our application. One of our topologies aggregates the data from 30 millions devices. When the application runs on two machines, everything works fine. However, when running on one machine, the lag grows.
Inspecting the threads showed the following:
All threads are stuck on method size() of NamedCache, which in turn calls the method size() of ConcurrentSkipListSet. As we learn from the documentation of ConcurrentSkipListSet, size() is a very slow method since, it requires the traversal of the collection. The problem was solved by disabling the KafkaStreams cache by setting cache.max.bytes.buffering to 0 and giving more memory to RocksDB cache.
Have you encountred such an issue too ?
How can i use heap memory with Kafka streams and rocks DB ?
Not sure what version you are using, but one performance issue with NamedCache.size() was fixed in 2.3.1 and 2.4.0: https://issues.apache.org/jira/browse/KAFKA-8736
If you are an older version and it works with 2 instances, why not just run on two instance?
Related
In Spring cloud stream, what exactly is the usage of that property spring.cloud.stream.instanceCount?
I mean if that value become wrong because at a moment one or more micro services instances are down, how could this affect the behavior of our infrastructure?
instanceCount is used to partition data across different consumers. Having one or more services down should not really impact your producers, that's the job of the broker.
So let's say you have a source that sends data to 3 partitions, so you'd have instanceCount=3 and each instance would have it's own partition assigned via instanceIndex.
Each instance would be consuming data, but if instance 2 crashes, 0,1 would still be reading data from the partitions, and source would still be sending data as usual.
Assuming your platform has some sort of recoverability in place, your crashed instance should come back to life and resume it's operations.
What we still don't support is dynamic allocation of partitions on runtime, we are investigating this as a story for a future release.
I have a akka-http (Scala) API server that serves data to a NodeJS server.
In moments after startup, everything works fine, everything is fast. Latency is low. But suddenly, latency increases fastly. The API no longer responds, and the website becomes unusable.
The strange thing is that the traffic and the requests count remain stable. Latency spikes seem decorrelated from them.
I guess this saturation is achieved with the blocking of all the threads in akka thread pool. Unfortunately, my Akka dispatcher is blocking, because I'm doing a lot of SQL queries (in MySQL) and because I'm not using a reactive library. I'm using Slick 2, which is, contrary to Slick 3, blocking-only.
Here's my dispatcher :
http-blocking-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 46
}
throughput = 1
}
So, my question is, how to avoid this sort of bottling ? How to keep a latency proportional with the traffic ? Is there a way to evict the requests that cause the saturation in order to prevent them to compromise everything ?
Thank you !
You should not use Akka's own thread pool for running long blocking tasks. Create your own thread pool, and run your slick queries using it, leaving free threads for akka. That's for your first 2 questions.
I don't know of any good answer for the last one. You could maybe look into specific slick settings to set a timeout on sql queries, but I don't know if such things exist. Otherwise try to analyse why your queries takes so much time, could you be missing an index or two?
We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4,
mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line
1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line
79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key,
CommandFlags flags) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line
1346 at
OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at
Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable1
shouldRetryPredicates, Func`1 policyStateFactory) at
OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean
allowDirtyRead) in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at
OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext()
in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196;
TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.
I know this thread is months old but I think my own experiences can add some value here. I had the same problem with Azure Redis Cache (timeouts on Gets) but realized that it was almost exclusively happening on Gets where the string value was relatively large (> 250K in length). I implemented gzip on both Gets and Sets (when the string value is large) and now I almost never get a timeout.
Even if this doesn't solve your particular problem, it's probably good practice to compress the values in general to reduce costs and improve performance.
Regarding the version numbers, it seems that the AssemblyVersion has been locked at 1.0.316 for the last several releases, but the AssemblyFileVersion has been updated to match the NuGet package version. For now, I recommend ignoring AssemblyVersion and just using AssemblyFileVersion to ensure you have the correct binary.
Please contact us at AzureCache#microsoft.com if you are still seeing timeouts using Azure Redis Cache.
I'm using Infinispan 6.0.0 in a 3-node setup (distributed caching with 2 replicas for each entry, no writes into persistent store) and I'm just reading the file line-by-line and storing that lines' contents into the cache. The speed seems a bit low to me (I can achieve more writes onto the SSD (persistent storage) than into RAM with Infinispan), but there isn't any obvious bottleneck in the test code (I'm using buffered input streams, and their limits certainly aren't reached. As for now, I'm able to write 100K entries each ~45 seconds and that doesn't satisfy me. Assume simplified code snippet:
while ((s = reader.readLine()) != null) {
cache.put(s.substring(0,2), s.substring(2,5));
}
And CacheManager is created as follows:
return new DefaultCacheManager(
GlobalConfigurationBuilder.defaultClusteredBuilder()
.transport().addProperty("configurationFile", "jgroups.xml").build(),
new ConfigurationBuilder()
.clustering().cacheMode(CacheMode.DIST_ASYNC).hash().numOwners(2)
.transaction().transactionMode(TransactionMode.TRANSACTIONAL).lockingMode(LockingMode.OPTIMISTIC)
.build());
What could I be possibly doing wrong?
I am not fully aware of all the asynchronous mode specialities, but I'd afraid that something in the two-phase commit (Prepare and Commit) might force some blocking RPC => waiting for network latency => slow down.
Do you need transactional behaviour? If not, switch them off. If you really need it, you may disable just the autocommit feature and load the cluster via non-transactional operations. Or, you may try one phase commits.
Another option could be mass loading via putAll (with tens or hundreds of entries, depends on your entry size), but routing of this message is not really smart. In transactional mode it could behave a bit better, I guess.
The last option if you just want to load the cluster fast and then operate on it could be transferring the bulk data to each node without Infinispan (using your own JGroups channel, or just with sockets), and loading all nodes with the CACHE_MODE_LOCAL flag.
By default Infinispan follows the Map.put() contract of returning the previous value, so even though you are using the DIST_ASYNC cache mode you're still implicitly performing a synchronous cache.get() for every put.
You can avoid this in two ways:
configurationBuilder.unsafe().unreliableReturnValues(true) will suppress the remote lookup for all the operations on the cache.
cache.getAdvancedCache().withFlags(Flag.IGNORE_RETURN_VALUES).put(k, v) will suppress the remote lookup for a single operation.
I made a elasticsearch cluster with big data, and the client can send searching request to it.
Sometimes, the cluster costs much time to deal with one request.
My question is, is there any API to kill the specified thread which cost too much time?
I wanted to follow up on this answer now that elasticsearch 1.0.0 has been released. I am happy to announce that there is new functionality that has been introduced that implements some protection for the heap, called the circuit breaker.
With the current implementation, the circuit breaker tries to anticipate how much data is going to be loaded into the field data cache, and if it's greater than the limit (80% by default) it will trip the circuit breaker and there by kill your query.
There are two parameters for you to set if you want to modify them:
indices.fielddata.breaker.limit
indices.fielddata.breaker.overhead
The overhead is the constant that is used to estimate how much data will be loaded into the field cache; this is 1.03 by default.
This is an exciting development to elasticsearch and a feature I have been waiting to be implemented for months.
This is the pull request if interested in seeing how it was made; thanks to dakrone for getting this done!
https://github.com/elasticsearch/elasticsearch/pull/4261
Hope this helps,
MatthewJ
Currently it is not possible to kill or stop the long running queries, But Elasticsearch is going to add a task management api to do this. The API is likely to be added in Elasticsearch 5.0, maybe in 2016 or later.
see Task management 1 and Task management 2.