StackExchange.Redis TIMEOUT growing unsent queue - stackexchange.redis

Our test environment last weekend saw a number VMs start logging timeouts where the unsent queue just kept growing:
Timeout performing GET 0:B:ac64ebd0-3640-4b7b-a108-7fd36f294640, inst:
0, mgr: ExecuteSelect, queue: 35199, qu: 35199, qs: 0, qc: 0, wr: 0,
wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400), WORKER:
(Busy=5,Free=395,Min=4,Max=400)
Timeout performing SETEX 0:B:pfed2b3f5-fbbf-4ed5-9a58-f1bd888f01,
inst: 0, mgr: ExecuteSelect, queue: 35193, qu: 35193, qs: 0, qc: 0,
wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400),
WORKER: (Busy=6,Free=394,Min=4,Max=400)
I've read quite a few posts on analyzing these but most of the time it doesn't involve the unsent message queue growing. No connectivity errors were logged during this time; an AppPool recycle resolved the issue. Has anyone else seen this issue before?
Some potentially relevant extra info:
Same timeouts seen on 1.0.450 and 1.0.481 versions of StackExchange.Redis nuget package
ASP.Net v4.5 Web API 1.x site was the one affected
Upgraded to Redis 3.0.4 (from 3.0.3) same week errors were encountered (but days prior)
Installed New Relic .NET APM v5.5.52.0 that includes some StackExchange.Redis instrumentation (https://docs.newrelic.com/docs/release-notes/agent-release-notes/net-release-notes/net-agent-55520), again, a couple days prior to the timeouts. We've rolled this back here just in case.

I'm encountering same issue.
To research issue, we logging ConnectionCounters of ConnectionMultiplexer for monitoring on every 10 seconds.
It shows growing pendingUnsentItems only, it means StackExchange.Redis does not send/receive from socket.
completedAsynchronously completedSynchronously pendingUnsentItems responsesAwaitingAsyncCompletion sentItemsAwaitingResponse
1 10 4 0 0
1 10 28 0 0
1 10 36 0 0
1 10 51 0 0
1 10 65 0 0
1 10 72 0 0
1 10 85 0 0
1 10 104 0 0
1 10 126 0 0
1 10 149 0 0
1 10 169 0 0
1 10 190 0 0
1 10 207 0 0
1 10 230 0 0
1 10 277 0 0
1 10 296 0 0
...snip
1 10 19270 0 0
1 10 19281 0 0
1 10 19291 0 0
1 10 19302 0 0
1 10 19313 0 0
I guess socket writer thread was stopped?
My Environment is
StackExchange.Redis 1.0.481
Windows Server 2012 R2
.NET Framework 4.5
ASP.NET MVC 5.2.3
Installed New Relic .NET APM v5.7.17.0

Looks like the issue was seen when using New Relic .NET APM between versions 5.5-5.7 and is fixed in 5.8.

Related

Why are there so many free movable DMA32 blocks on the x86 64bits platform?

Why are there so many free movable DMA32 blocks on the x86 64bits platform?
As its name, I think it is used for DMA. But 730 free blocks(with order 10) means more than 1GB memory. How huge the memory is!
cat /proc/pagetypeinfo says:
sudo cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 1 1 0 2 1 1 0 1 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 1 0 0 0 0 0 1 1 1 1 0
Node 0, zone DMA32, type Movable 3 4 5 4 2 3 4 4 1 2 730
Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Unmovable 17 2 2 1 0 0 0 1 2 13 0
Node 0, zone Normal, type Movable 15 4 0 15 4 1 1 0 0 0 934
Node 0, zone Normal, type Reclaimable 0 6 21 9 6 3 3 1 2 0 0
Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Movable Reclaimable HighAtomic Isolate
Node 0, zone DMA 1 7 0 0 0
Node 0, zone DMA32 2 1526 0 0 0
Node 0, zone Normal 160 2314 78 0 0

How to create relational matrix?

I have the following data:
client_id <- c(1,2,3,1,2,3)
product_id <- c(10,10,10,20,20,20)
connected <- c(1,1,0,1,0,0)
clientID_productID <- paste0(client_id,";",product_id)
df <- data.frame(client_id, product_id,connected,clientID_productID)
client_id product_id connected clientID_productID
1 1 10 1 1;10
2 2 10 1 2;10
3 3 10 0 3;10
4 1 20 1 1;20
5 2 20 0 2;20
6 3 20 0 3;20
The goal is to produce a relational matrix:
client_id product_id clientID_productID client_pro_1_10 client_pro_2_10 client_pro_3_10 client_pro_1_20 client_pro_2_20 client_pro_3_20
1 1 10 1;10 0 1 0 0 0 0
2 2 10 2;10 1 0 0 0 0 0
3 3 10 3;10 0 0 0 0 0 0
4 1 20 1;20 0 0 0 0 0 0
5 2 20 2;20 0 0 0 0 0 0
6 3 20 3;20 0 0 0 0 0 0
In other words, when product_id equals 10, clients 1 and 2 are connected. Importantly, I do not want client 1 to be connected with herself. When product_id=20, I have only one client, meaning that there is no connection, so I should have only zeros.
To be more specific, all that I am trying to create is a square matrix of relations, with all the combinations of client/product in the columns. A client can only be connected with another if they bought the same product.
I have searched a bunch and played with other code. The difference between this problem and others already answered is that I want to keep on my table client number 3, even though she never bought any product. I want to show that she does not have a relationship with any other client. Right now, I am able to create the matrix by stacking the relationships by product (How to create relational matrix in R?), but I am struggling with a way to not stack them.
I apologize if the question is not specific enough, or too specific. Thank you anyway, stackoverflow is a lifesaver for beginners.
I believe I figured it out.
It is for sure not the most elegant answer, though.
client_id <- c(1,2,3,1,2,3)
product_id <- c(10,10,10,20,20,20)
connected <- c(1,1,0,1,0,0)
clientID_productID <- paste0(client_id,";",product_id)
df <- data.frame(client_id, product_id,connected,clientID_productID)
df2 <- inner_join(df[c(1:3)], df[c(1:3)], by = c("product_id", "connected"))
df2$Source <- paste0(df2$client_id.x,"|",df2$product_id)
df2$Target <- paste0(df2$client_id.y,"|",df2$product_id)
df2 <- df2[order(df2$product_id),]
indices = unique(as.character(df2$Source))
mtx <- as.matrix(dcast(df2, Source ~ Target, value.var="connected", fill=0))
rownames(mtx) = mtx[,"Source"]
mtx <- mtx[,-1]
diag(mtx)=0
mtx = as.data.frame(mtx)
mtx = mtx[indices, indices]
I got the result I wanted:
1|10 2|10 3|10 1|20 2|20 3|20
1|10 0 1 0 0 0 0
2|10 1 0 0 0 0 0
3|10 0 0 0 0 0 0
1|20 0 0 0 0 0 0
2|20 0 0 0 0 0 0
3|20 0 0 0 0 0 0

EC2 high stolen time without load

I can see very high % of stolen time on a EC2 web server (t2.micro) without any load (one current user) with a high page load time. Is there a correlation between hight load time and hight stolen time? I have the same symptoms with another server from class t2.medium
Do you have an explanation?
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 79824 7428 479172 0 0 0 0 52 49 18 0 0 0 82
1 0 0 79792 7436 479172 0 0 0 6 54 49 18 0 0 0 82
1 0 0 79824 7444 479172 0 0 0 5 54 51 18 0 0 0 82

How to track total number of rejected and dropped events

What is the correct way to track number of dropped or rejected events in the managed elasticsearch cluster?
GET /_nodes/stats/thread_pool which gives you something like:
"thread_pool": {
"bulk": {
"threads": 4,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 4,
"completed": 42
}
....
"flush": {
"threads": 0,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 0,
"completed": 0
}
...
Another way to get more concise and better formatted info (especially if you are dealing with several nodes) about thread pools is to use the _cat threadpool API
$ curl -XGET 'localhost:9200/_cat/thread_pool?v'
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
10.10.1.1 10.10.1.1 1 10 0 2 0 0 10 0 0
10.10.1.2 10.10.1.2 2 0 1 4 0 0 4 10 2
10.10.1.3 10.10.1.3 1 0 0 1 0 0 5 0 0
UPDATE
You can also decide which thread pools to show and for each thread pool which fields to include in the output. For instance below, we're showing the following fields from the search threadpool:
sqs: The maximum number of search requests that can be queued before being rejected
sq: The number of search requests in the search queue
sa: The number of currently active search threads
sr: The number of rejected search threads (since the last restart)
sc: The number of completed search threads (since the last restart)
Here is the command:
curl -s -XGET 'localhost:9200/_cat/thread_pool?v&h=ip,sqs,sq,sa,sr,sc'
ip sqs sq sa sr sc
10.10.1.1 100 0 1 0 62636120
10.10.1.2 100 0 2 0 15528863
10.10.1.3 100 0 4 0 64647299
10.10.1.4 100 0 5 372 103014657
10.10.1.5 100 0 2 0 13947055

Rebalancing a Cassandra Ring with Vnodes

We had a system with a 3-node Cassandra 2.0.6 ring. Over time, the application load on that system increased until a limit where the ring could not handle it anymore, causing the typical node overload failures.
We doubled the size of the ring, and recently even added one more node, to try to handle the load, but there're still only 3 nodes taking all the load; but not the original 3 nodes of the initial ring.
We did the bootstrap + cleanup process described in the adding nodes guide. We also tried repairs on each node after not seeing much improvements in the ring load. Our load is 99.99% writes on this system.
Here's a chart of the cluster load illustrating the issue:
The highest load tables have a high cardinality on the partition key that I'd expect distributes well over vnodes.
Edit: nodetool info
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN x.y.z.92 56.83 GB 256 13.8% x-y-z-b53e8ab55e0a rack1
UN x.y.z.253 136.87 GB 256 15.2% x-y-z-bd3cf08449c8 rack1
UN x.y.z.70 69.84 GB 256 14.2% x-y-z-39e63dd017cd rack1
UN x.y.z.251 74.03 GB 256 14.4% x-y-z-36a6c8e4a8e8 rack1
UN x.y.z.240 51.77 GB 256 13.0% x-y-z-ea239f65794d rack1
UN x.y.z.189 128.49 GB 256 14.3% x-y-z-7c36c93e0022 rack1
UN x.y.z.99 53.65 GB 256 15.2% x-y-z-746477dc5db9 rack1
Edit: tpstats (node highly loaded)
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 11591287 0 0
RequestResponseStage 0 0 283211224 0 0
MutationStage 32 405875 349531549 0 0
ReadRepairStage 0 0 3591 0 0
ReplicateOnWriteStage 0 0 0 0 0
GossipStage 0 0 3246983 0 0
AntiEntropyStage 0 0 72055 0 0
MigrationStage 0 0 133 0 0
MemoryMeter 0 0 205 0 0
MemtablePostFlusher 0 0 94915 0 0
FlushWriter 0 0 12521 0 0
MiscStage 0 0 34680 0 0
PendingRangeCalculator 0 0 14 0 0
commitlog_archiver 0 0 0 0 0
AntiEntropySessions 1 1 1 0 0
InternalResponseStage 0 0 30 0 0
HintedHandoff 0 0 1957 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 196
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 31663792
_TRACE 24409
REQUEST_RESPONSE 4
COUNTER_MUTATION 0
How could I further troubleshoot this issue?
You need to run nodetool cleanup on the previous nodes that were part of the ring. Nodetool cleanup will remove the partition keys that the node currently does not own.
Seems like after the addition of the nodes, the keys have not been deleted hence causing the load to be higher on the previous nodes.
Try running
nodetool cleanup
on the previous nodes

Resources