StackExchange.Redis timeout and "No connection is available to service this operation" - stackexchange.redis

I have the following issues in our production environment (Web-Farm - 4 nodes, on top of it Load balancer):
1) Timeout performing HGET key, inst: 3, queue: 29, qu=0, qs=29, qc=0, wr=0/0
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor``1 processor, ServerEndPoint server) in ConnectionMultiplexer.cs:line 1699 This happens 3-10 times in a minute
2) No connection is available to service this operation: HGET key at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor``1 processor, ServerEndPoint server) in ConnectionMultiplexer.cs:line 1666
I tried to implement as Marc suggested (Maybe I interpreted it incorrectly) - better to have fewer connections to Redis than multiple.
I made the following implementation:
public class SeRedisConnection
{
private static ConnectionMultiplexer _redis;
private static readonly object SyncLock = new object();
public static IDatabase GetDatabase()
{
if (_redis == null || !_redis.IsConnected || !_redis.GetDatabase().IsConnected(default(RedisKey)))
{
lock (SyncLock)
{
try
{
var configurationOptions = new ConfigurationOptions
{
AbortOnConnectFail = false
};
configurationOptions.EndPoints.Add(new DnsEndPoint(ConfigurationHelper.CacheServerHost,
ConfigurationHelper.CacheServerHostPort));
_redis = ConnectionMultiplexer.Connect(configurationOptions);
}
catch (Exception ex)
{
IoC.Container.Resolve<IErrorLog>().Error(ex);
return null;
}
}
}
return _redis.GetDatabase();
}
public static void Dispose()
{
_redis.Dispose();
}
}
Actually dispose is not being used right now. Also I have some specifics of the implementation which could cause such behavior (I'm only using hashes):
1. Add, Remove hashes - async
2. Get -sync
Could somebody help me how to avoid this behavior?
Thanks a lot in advance!
SOLVED - Increasing Client connection timeout after evaluating network capabilities.
UPDATE 2: Actually it didn't solve the problem. When cache volume starting to get increased e.g. from 2GB.
Then I saw the same pattern actually these timeouts were happend about every 5 minutes.
And our sites were frozen for some period of time every 5 minutes until fork operation was finished.
Then I found out that there is an option to make a fork (save to disk) every x seconds:
save 900 1
save 300 10
save 60 10000
In my case it was "save 300 10" - save in every 5 minutes if at least 10 updates were happened. Also I found out that "fork" could be very expensive. Commented "save" section resolved the problem at all. We can commented "save" section as we are using only Redis as "cache in memory" - we don't need any persistance.
Here is configuration of our cache servers "Redis 2.4.6" windows port: https://github.com/rgl/redis/downloads
Maybe it has been solved in recent versions of Redis windows port in MSOpentech: http://msopentech.com/blog/2013/04/22/redis-on-windows-stable-and-reliable/
but I haven't tested yet.
Anyway StackExchange.Redis has nothing to do with this issue and it works pretty stable in our production environment, thanks to Marc Gravell.
FINAL UPDATE:
Redis is single-threaded solution - it is ultimately fast but when it comes to the point of releasing the memory (Removing items that are stale or expired) the problems are emerged due to one thread should reclaim the memory (that is not fast operation - whatever algorithm is used) and the same thread should handle GET, SET operations. Of course it happens when we are talking about medium-loaded production environment. Even if you use a cluster with slaves when the memory barrier is reached it will have the same behavior.

It looks like in most cases this exception is a client issue. Previous versions of StackExchange.Redis used Win32 socket directly which sometimes has a negative impact. Probably Asp.net internal routing somehow related to it.
The good news is that StackExchange.Redis's network infra was completely rewritten recently. The last version is 2.0.513. Try it and there is a good chance that your problem will go.

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

How to fix the slow checkpoint with small state size issue?

I have a flink app (flink version is 1.9.2) which enabled checkpoint function. When I run it in the apache flink platform. I always get the checkpoint failed message: Checkpoint expired before completing.After check the threadDumps of the taskManager during a checkpoint, I found that a thread which contains two operators that request external service was always in runnable state. Below are my design of this operator and the checkpoint configuration. Please help advise how to resolve the issue ?
operator design:
public class OperatorA extends RichMapFunction<POJOA, POJOA> {
private Connection connection;
private String getCusipSourceIdPairsQuery;
private String getCusipListQuery;
private MapState<String, List<POJOX>> modifiedCusipState;
private MapState<String, List<POJOX>> bwicMatchedModifiedCusipState;
#Override
public POJOA map(POJOA value) throw Exception {
// create local variable PreparedStatement every time invoke this map method
// update/clear those two MapStates
}
#Override
public void open(Configuration parameters) {
// initialize jdbc connection and TTL MapStates using GlobalJobParameters
}
#Override
public void close() {
// close jdbc connection
}
}
public class OperatorB extends RichMapFunction<POJOA, POJOA> {
private MyServiceA serviceA;
private MyServiceB serviceB;
#Override
public POJOA map(POJOA value) throw Exception {
// call a restful GET API of ServiceB, get a XML response, about 500 fields in the response.
// use serviceA's function to extract the XML document and then populate the value fields.
}
#Override
public void open(Configuration parameters) {
// initialize local jdbc connection and PreparedStatement using globalJobParameters. then use the executed results to initialize serviceA.
// initialize serviceB.
}
}
checkpoint configuration:
Checkpointing Mode Exactly Once
Interval 15m 0s
Timeout 10m 0s
Minimum Pause Between Checkpoints 5m 0s
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Disabled
Sample checkpoint history:
ID Status Acknowledged Trigger Time Latest Acknowledgement End to End Duration State Size Buffered During Alignment
20 In Progress 3/12 (25%) 15:03:13 15:04:14 1m 1s 5.65 KB 0 B
19 Failed 3/12 14:48:13 14:50:12 10m 0s 5.65 KB 0 B
18 Failed 3/12 14:33:13 14:34:50 10m 0s 5.65 KB 0 B
17 Failed 4/12 14:18:13 14:27:04 9m 59s 2.91 MB 64.0 KB
16 Failed 3/12 14:03:13 14:05:18 10m 0s 5.65 KB 0 B
Doing any sort of blocking i/o in a Flink user function (e.g., a RichMap or ProcessFunction) is asking for trouble with checkpointing. The reason is that it is very easy to end up with significant backpressure, which then prevents the checkpoint barriers from making sufficiently rapid progress through the execution graph, leading to checkpoint timeouts.
The preferred way to improve on this would be to use async i/o rather than a RichMap. This will allow for there to be more outstanding requests at any given moment (assuming the external service is capable of handling the higher load), and won't leave the operator blocked in user code waiting for responses to synchronous requests -- thereby allowing checkpoints to progress unimpeded.
An alternative would be to increase the parallelism of your cluster, which should reduce the backpressure, but at the expense of tying up more computing resources that won't really doing much other than waiting.
In the worst case, where the external service simply isn't capable of keeping up with your throughput requirements, then backpressure is unavoidable. This then is going to be more difficult to manage, but unaligned checkpoints, coming in Flink 1.11, should help.
Here are some tips that I usually use during locating expired checkpoint problem:
Check Checkpoint UI to know the distribution of subtasks which cause the expiration.
If most subtasks already finish the checkpoint, skip to Tip 3, otherwise skip to Tip 4.
The most possible reason is data skew and the problem subtask receives much more records than other subtasks. If it's not a data skew problem, take a look at the host that the subtask is running on, and check whether there're issue about CPU/MEM/DISK which may slow down the consuming of the subtask.
This situation is relatively rare, and it's usually caused by used codes. For example, user tries to access database in operators but the connection is not stable which slows down the processing.
I recently encountered a similar problem. Suggestions provided by #David Anderson are really good! Nevertheless, I have a few things to add.
You can try to tune your checkpoints according to Apache Flink documentation.
In my case, checkpoint interval was lower than min pause between checkpoints, so I increased it to make it bigger. In my case, I multiplied checkpoint interval by 2 and set this value as min pause between checkpoints.
You can also try to increase checkpoint timeout.
Another issue may be ValueState. My pipleline was keeping state for a long period of time and it wasn't evicted what was causing thoroughput problems. I set TTL for the ValueState (in my case for 30 minutes) and it started to work better. TTL is well described in the Apache Flink documentation. It's really simple and looks like that:
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);
It's also worth noticing that this SO thread is related to the similar topic: Flink Checkpoint Failure - Checkpoints time out after 10 mins and tips provided there may be useful.
Regards,
Piotr

How to do multi-threaded writing in ehcache?

Here's my code of using ehcache when I do multi-threaded reading and writing:
write code:
try {
targetCache.acquireWriteLockOnKey(key);
targetCache.putIfAbsent(new Element(key, value));
}
finally {
targetCache.releaseWriteLockOnKey(key);
}
reading code:
try{
cache.acquireReadLockOnKey(key);
cacheCarId = (String)ele.getObjectValue();
}
finally {
cache.releaseReadLockOnKey(key);
}
key and value are both String.
My config is as follows:
CacheConfiguration config = new CacheConfiguration();
config.name("carCache");
config.maxBytesLocalHeap(128, MemoryUnit.parseUnit("M"));
config.eternal(false);
config.timeToLiveSeconds(60);
config.setTimeToIdleSeconds(60);
SizeOfPolicyConfiguration sizeOfPolicyConfiguration = new SizeOfPolicyConfiguration();
sizeOfPolicyConfiguration.maxDepth(10000);
sizeOfPolicyConfiguration.maxDepthExceededBehavior("abort");
config.addSizeOfPolicy(sizeOfPolicyConfiguration);
Cache memoryOnlyCache = new Cache(config);
CacheManager.getInstance().addCache(memoryOnlyCache);
Values are evict within 60s and will be written by multi-thread. The total number of key is less than 25,000.
The reading and writing was ok at the beginning, but after a couple of hours, i get inconsistence of reading and writing...
Could Anybody help me with this problem? Thanks a lot
A Cache is already a thread safe data structure, so you should not need to use explicit locking as you do.
Also the method Cache.putIfAbsent is already an atomic operation that guarantees that only one thread will succeed with the put.
Note that eviction and expiry are two different things. With your configuration, eviction happens when the cache size grows beyond 128MB and expiry indeed happens after 60 seconds. However Ehcache does expiry in-line, so it is triggered when you read or write the mapping.
As for your remark on inconsistence, you will need to describe in more detail what you mean by that.

MongoDB-Java performance with rebuilt Sync driver vs Async

I have been testing MongoDB 2.6.7 for the last couple of months using YCSB 0.1.4. I have captured good data comparing SSD to HDD and am producing engineering reports.
After my testing was completed, I wanted to explore the allanbank async driver. When I got it up and running (I am not a developer, so it was a challenge for me), I first wanted to try the rebuilt sync driver. I found performance improvements of 30-100%, depending on the workload, and was very happy with it.
Next, I tried the async driver. I was not able to see much difference between it and my results with the native driver.
The command I'm running is:
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://192.168.0.13:27017/ycsb -p mongodb.writeConcern=strict -threads 96
Over the course of my testing (mostly with the native driver), I have experimented with more and less threads than 96; turned on "noatime"; tried both xfs and ext4; disabled hyperthreading; disabled half my 12 cores; put the journal on a different drive; changed sync from 60 seconds to 1 second; and checked the network bandwidth between the client and server to ensure its not oversubscribed (10GbE).
Any feedback or suggestions welcome.
The Async move exceeded my expectations. My experience is with the Python Sync (pymongo) and Async driver (motor) and the Async driver achieved greater than 10x the throughput. further, motor is still using pymongo under the hoods but adds the async ability. that could easily be the case with your allanbank driver.
Often the dramatic changes come from threading policies and OS configurations.
Async needn't and shouldn't use any more threads than cores on the VM or machine. For example, if you're server code is spawning a new thread per incoming conn -- then all bets are off. start by looking at the way the driver is being utilized. A 4 core machine uses <= 4 incoming threads.
On the OS level, you may have to fine-tune parameters like net.core.somaxconn, net.core.netdev_max_backlog, sys.fs.file_max, /etc/security/limits.conf nofile and the best place to start is looking at nginx related performance guides including this one. nginx is the server that spearheaded or at least caught the attention of many linux sysadmin enthusiasts. Contrary to popular lore one should reduce your keepalive timeout opposed to lengthen it. The default keep-alive timeout is some absurd (4 hours) number of seconds. you might want to cut the cord in 1 minute. basically, think a short sweet relationship with your clients connections.
Bear in mind that Mongo is not Async so you can use a Mongo driver pool. nevertheless, don't let the driver get stalled on slow queries. cut it off in 5 to 10 seconds using the following equivalents in Java. I'm just cutting and pasting here with no recommendations.
# Specifies a time limit for a query operation. If the specified time is exceeded, the operation will be aborted and ExecutionTimeout is raised. If max_time_ms is None no limit is applied.
# Raises TypeError if max_time_ms is not an integer or None. Raises InvalidOperation if this Cursor has already been used.
CONN_MAX_TIME_MS = None
# socketTimeoutMS: (integer) How long (in milliseconds) a send or receive on a socket can take before timing out. Defaults to None (no timeout).
CLIENT_SOCKET_TIMEOUT_MS=None
# connectTimeoutMS: (integer) How long (in milliseconds) a connection can take to be opened before timing out. Defaults to 20000.
CLIENT_CONNECT_TIMEOUT_MS=20000
# waitQueueTimeoutMS: (integer) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to None (no timeout).
CLIENT_WAIT_QUEUE_TIMEOUT_MS=None
# waitQueueMultiple: (integer) Multiplied by max_pool_size to give the number of threads allowed to wait for a socket at one time. Defaults to None (no waiters).
CLIENT_WAIT_QUEUE_MULTIPLY=None
Hopefully you will have the same success. I was ready to bail on Python prior to async

Delay / Lag between Commit and select with Distributed Transactions when two connections are enlisted to the transaction in Oracle with ODAC

We have our application calling to two Oracle databases using two connections (which are kept open through out the application). For certain functionality, we use distributed transactions. We have Enlist=false in the connection string and manually enlist the connection to the transaction.
The problem comes with a scenario where, we update the same record very frequently within a distributed transaction, on which we see a delay to see the commited data in the previous run.
ex.
using (OracleConnection connection1 = new OracleConnection())
{
using(OracleConnection connection2 = new OracleConnection())
{
connection1.ConnectionString = connection1String;
connection1.Open();
connection2.ConnectionString = connection2String;
connection2.Open();
//for 100 times, do an update
{
.. check the previously updated value
connection1.EnlistTransaction(currentTransaction);
connection2.EnlistTransaction(currentTransaction);
.. do an update using connection1
.. do some updates with connection2
}
}
}
as in the above code fragment, we do update and check the previously updated value in the next iteration. The issues comes up when we run this for a single record frequently, on which we don't see the committed update in the last iteration in the next iteration even though it was committed in the previous iteration. But when this happens this update is visible in other applications in a very very small delay, and even within our code it's visible if we were to debug and run the line again.
It's almost like delay in the commit even though previous commit returned from the code.
Any one has any ideas ?
It turned out that I there's no way to control this behavior through ODAC. So the only viable solution was to implement a retry behavior in our code, since this occurs very rarely and when it happens, delay 10 seconds and retry the same.
Additional details on things I that I found on this can be found here.

Resources