rsyslog omfwd-tcp stops sending after some time - rsyslog

I have an rsyslog that should just forward messages. It has the regular 514-UDP Port open and receives messages. Forwarding to omfwd-tcp works for a while and than stops.
if $syslogfacility != 1 then {
action(Name="syslog-fwd" Type="omfwd" Target="127.0.0.1" Port="10514" >template="JSONDefaultstr" Action.ResumeInterval="5" Protocol="tcp")
stop
}
In the log i can see the following:
2093.110977082:syslog-fwd queue:Reg/w0: wti 0x55e240948920: wti.c: worker awoke from idle processing
2093.110980024:syslog-fwd queue:Reg/w0: queue.c: DeleteProcessedBatch: we deleted 0 objects and enqueued 0 objects
2093.110982399:syslog-fwd queue:Reg/w0: queue.c: doDeleteBatch: delete batch from store, new sizes: log 1, phys 1
2093.110984879:syslog-fwd queue:Reg/w0: syslog-fwd queue: queue.c: dequeued 1 consumable elements, szlog 0 sz phys 1
2093.110991750:syslog-fwd queue:Reg/w0: ../action.c: action 'syslog-fwd': is transactional - executing in commit phase
2093.110994557:syslog-fwd queue:Reg/w0: omfwd.c: omfwd: beginTransaction
2093.110997258:syslog-fwd queue:Reg/w0: omfwd.c: omfwd: doTryResume 127.0.0.1 iRet 0
2093.110999651:syslog-fwd queue:Reg/w0: ../action.c: action[syslog-fwd] transitioned to state: itx
2093.111002109:syslog-fwd queue:Reg/w0: ../action.c: processBatchMain: i 0, processMsgMain iRet -2121
2093.111004393:syslog-fwd queue:Reg/w0: ../action.c: processBatchMain: i 0, COMM state set
2093.111006850:syslog-fwd queue:Reg/w0: ../action.c: actionCommit[syslog-fwd]: enter, 1 msgs
2093.111009128:syslog-fwd queue:Reg/w0: ../action.c: actionCommit[syslog-fwd]: processing...
2093.111011368:syslog-fwd queue:Reg/w0: ../action.c: actionTryCommit[syslog-fwd] enter
2093.111013724:syslog-fwd queue:Reg/w0: ../action.c: doTransaction: have commitTransaction IF, using that, pWrkrInfo 0x55e2409489f0
2093.111016211:syslog-fwd queue:Reg/w0: ../action.c: entering actionCallCommitTransaction[syslog-fwd], state: itx, nMsgs 1
2093.111018502:syslog-fwd queue:Reg/w0: omfwd.c: omfwd: doTryResume 127.0.0.1 iRet 0
2093.111020942:syslog-fwd queue:Reg/w0: omfwd.c: 127.0.0.1:10514/tcp
2093.111024094:syslog-fwd queue:Reg/w0: omfwd.c: omfwd: add 227 bytes to send buffer (curr offs 0)
2093.111047664:syslog-fwd queue:Reg/w0: omfwd.c: omfwd: TCP sent 227 bytes, requested 227
2093.111051182:syslog-fwd queue:Reg/w0: ../action.c: actionCallCommitTransaction[syslog-fwd] state: itx mod commitTransaction returned 0
2093.111053587:syslog-fwd queue:Reg/w0: ../action.c: action[syslog-fwd] transitioned to state: rdy
2093.111055999:syslog-fwd queue:Reg/w0: ../action.c: actionCommit[syslog-fwd]: return actionTryCommit 0
2093.111058371:syslog-fwd queue:Reg/w0: ../action.c: actionCommit[syslog-fwd]: done, iRet 0
2093.111060964:syslog-fwd queue:Reg/w0: queue.c: regular consumer finished, iret=0, szlog 0 sz phys 1
2093.111063484:syslog-fwd queue:Reg/w0: queue.c: DeleteProcessedBatch: etry 0 state 3
2093.111066649:syslog-fwd queue:Reg/w0: queue.c: DeleteProcessedBatch: we deleted 1 objects and enqueued 0 objects
2093.111069152:syslog-fwd queue:Reg/w0: queue.c: doDeleteBatch: delete batch from store, new sizes: log 0, phys 0
2093.111071641:syslog-fwd queue:Reg/w0: syslog-fwd queue: queue.c: dequeued 0 consumable elements, szlog 0 sz phys 0
2093.111074225:syslog-fwd queue:Reg/w0: queue.c: regular consumer finished, iret=4, szlog 0 sz phys 0
2093.111076514:syslog-fwd queue:Reg/w0: wti.c: syslog-fwd queue:Reg/w0: worker IDLE, waiting for work.
2093.280167252:imtcp.c : nsdpoll_ptcp.c: epoll returned 1 entries
2093.280182600:imtcp.c : tcpsrv.c: tcpsrv: ready to process 1 event entries
...
this works fine... but than suddenly:
2093.280485033:syslog-fwd queue:Reg/w0: wti 0x55e240948920: wti.c: worker awoke from idle processing
2093.280488998:syslog-fwd queue:Reg/w0: queue.c: DeleteProcessedBatch: we deleted 0 objects and enqueued 0 objects
2093.280491486:syslog-fwd queue:Reg/w0: queue.c: doDeleteBatch: delete batch from store, new sizes: log 2, phys 2
2093.280494077:syslog-fwd queue:Reg/w0: syslog-fwd queue: queue.c: dequeued 2 consumable elements, szlog 0 sz phys 2
2093.293312843:imtcp.c : nsdpoll_ptcp.c: epoll returned 1 entries
2093.293326156:imtcp.c : tcpsrv.c: tcpsrv: ready to process 1 event entries
And than "wti 0x55e240948920: wti.c: worker awoke from idle processing" never come up again.
Queue get filled:
2094.037943773:main Q:Reg/w0 : ../action.c: action 'syslog-fwd': called, logging to builtin:omfwd (susp 0/0, direct q 0)
2094.037946442:main Q:Reg/w0 : syslog-fwd queue: queue.c: qqueueAdd: entry added, size now log 11, phys 13 entries
2094.037948880:main Q:Reg/w0 : syslog-fwd queue: queue.c: EnqueueMsg advised worker start
2094.037951334:main Q:Reg/w0 : ../action.c: action 'syslog-fwd': set suspended state to 0
...
2363.077252235:main Q:Reg/w0 : ../action.c: action 'syslog-fwd': called, logging to builtin:omfwd (susp 0/0, direct q 0)
2363.077255829:main Q:Reg/w0 : syslog-fwd queue: queue.c: queue nearly full (3000 entries), but could not drop msg (iRet: 0, severity 6)
2363.077258619:main Q:Reg/w0 : syslog-fwd queue: queue.c: doEnqSingleObject: queue FULL - waiting 2000ms to drain.
And now the funny part: When I add the following rule (before the other)
if $syslogfacility == 4 then {
action(Name="write4" Type="omfile" File="/var/log/syslog4" )
stop
}
Everything works fine. Messages in
Oct 31 07:54:26 otherhost.com sssd_be: GSSAPI client step 2
Oct 31 07:54:27 somehost.com sssd_be: GSSAPI client step 1
Anybody with an hint?

I've observed the same issue myself and have raised it as a bug:
https://github.com/rsyslog/rsyslog/issues/3273
In my case it was not related to omfwd as I sent using omrelp, instead it is imfile that causes the issue.

Related

ZeroMQ: very high latency with PUSH/PULL

Summary
I see very high latency at the receiver using zmq.PUSH/PULL sockets.
Details
I'm new to ZeroMQ, and am trying to send a few MBytes from the sender to the receiver (my final aim is to send a camera stream from the sender to the receiver). The sender and receiver are different machines on the same local network (one is a Mac and the other is running Ubuntu 18.04, if that matters). The sender is able to send its packets very fast (I'm guessing they get buffered in some zmq/tcp queue), but the receiver receives them very slowly with increasing latency per packet.
Here's the sender :
1 """
2 Sender
3 """
4 import zmq
5 from time import time
6 import sys
7
8 addr = "tcp://192.168.86.33:5555"
9
10 context = zmq.Context()
11 socket = context.socket(zmq.PUSH)
12 socket.bind(addr)
13
14 num = 0
15 while True:
16 frame = [0] * 1000000
17 ts = time()
18 num += 1
19 socket.send_pyobj(dict(num=num, frame=frame, ts=ts))
20
21 delay = time() - ts
22 print("Sender:: pkt_num: {}, delay: {}".format(num, delay))
And the receiver :
1 """
2 Receiver
3 """
4 import zmq
5 from time import time
6
7 addr = "tcp://192.168.86.33:5555"
8 context = zmq.Context()
9
10 socket = context.socket(zmq.PULL)
11 socket.connect(addr)
12
13 while True:
14 msg = socket.recv_pyobj()
15 frame = msg['frame']
16 num = msg['num']
17 ts = msg['ts']
18
19 delay = time() - ts
20
21 if True:
22 print("Receiver:: pkt_num: {} latency: {}".format(num, delay))
When I run this, I see the sender is able to send its packets very quickly :
Sender:: pkt_num: 1, delay: 0.026965618133544922
Sender:: pkt_num: 2, delay: 0.018309354782104492
Sender:: pkt_num: 3, delay: 0.01821303367614746
Sender:: pkt_num: 4, delay: 0.016669273376464844
Sender:: pkt_num: 5, delay: 0.01674652099609375
Sender:: pkt_num: 6, delay: 0.01668095588684082
Sender:: pkt_num: 7, delay: 0.015082836151123047
Sender:: pkt_num: 8, delay: 0.014363527297973633
Sender:: pkt_num: 9, delay: 0.014063835144042969
Sender:: pkt_num: 10, delay: 0.014398813247680664
But the receiver sees very high and growing packet latencies :
Receiver:: pkt_num: 1 latency: 0.1272585391998291
Receiver:: pkt_num: 2 latency: 0.2539491653442383
Receiver:: pkt_num: 3 latency: 0.40800905227661133
Receiver:: pkt_num: 4 latency: 0.5737316608428955
Receiver:: pkt_num: 5 latency: 0.7272651195526123
Receiver:: pkt_num: 6 latency: 0.9418754577636719
Receiver:: pkt_num: 7 latency: 1.0799565315246582
Receiver:: pkt_num: 8 latency: 1.228663682937622
Receiver:: pkt_num: 9 latency: 1.3731486797332764
Receiver:: pkt_num: 10 latency: 1.5067603588104248
I tried swapping the sender and receiver between the Mac and Linux machines, and saw the same behavior. Since my goal is to send a video stream from the sender to the receiver, these high latencies make this unusable for that purpose.
Edit 1
Based on user3666197's suggestion, I edited the sender/receiver test code to remove some overheads. On the sender side keep sending the same dict. I also added more prints.
Sender:
14 num = 0
15 frame = [0] * 1000000
16 payload = dict(num=0, frame=frame, ts=0.0)
17 while True:
18 payload['num'] += 1
19 payload['ts'] = time()
20 socket.send_pyobj(payload)
21
22 delay = time() - payload['ts']
23 print("Sender:: pkt_num: {:>6d}, delay: {:6f}" \
24 .format(payload['num'], delay))
Receiver
10 socket = context.socket(zmq.PULL)
11 socket.connect(addr)
12 clk = zmq.Stopwatch()
13 clk.start()
14
15 while True:
16 iterT = clk.stop()
17 clk.start()
18 msg = socket.recv_pyobj()
19 rcvT = clk.stop()
20 delay = time() - msg['ts']
21
22 print("Server:: pkt_num: {:>6d} latency: {:>6f} iterT: {} rcvT: {}" \
23 .format(msg['num'], delay, iterT, rcvT))
24 clk.start()
The sender's per packet delay has reduced further. An interesting datapoint revealed is that the receiver takes almost 0.15s to receive each packet, which seems to be the main problem.
Sender:: pkt_num: 1, delay: 1.797830
Sender:: pkt_num: 2, delay: 0.025297
Sender:: pkt_num: 3, delay: 0.019500
Sender:: pkt_num: 4, delay: 0.019500
Sender:: pkt_num: 5, delay: 0.018166
Sender:: pkt_num: 6, delay: 0.017320
Sender:: pkt_num: 7, delay: 0.017258
Sender:: pkt_num: 8, delay: 0.017277
Sender:: pkt_num: 9, delay: 0.017426
Sender:: pkt_num: 10, delay: 0.017340
In the receiver's prints, rcvT is the per-packet receive time in microseconds.
Server:: pkt_num: 1 latency: 2.395570 iterT: 1 rcvT: 331601
Server:: pkt_num: 2 latency: 0.735229 iterT: 1 rcvT: 137547
Server:: pkt_num: 3 latency: 0.844345 iterT: 1 rcvT: 134385
Server:: pkt_num: 4 latency: 0.991852 iterT: 1 rcvT: 166980
Server:: pkt_num: 5 latency: 1.089429 iterT: 2 rcvT: 117047
Server:: pkt_num: 6 latency: 1.190770 iterT: 2 rcvT: 119466
Server:: pkt_num: 7 latency: 1.348077 iterT: 2 rcvT: 174566
Server:: pkt_num: 8 latency: 1.460732 iterT: 1 rcvT: 129858
Server:: pkt_num: 9 latency: 1.585445 iterT: 2 rcvT: 141948
Server:: pkt_num: 10 latency: 1.717757 iterT: 1 rcvT: 149666
Edit 2
I implemented the solution pointed out in this answer that uses PUB/SUB, and it works perfectly; I get 30fps video on the receiver. I still do not understand why my PUSH/PULL sample sees a delay.
Q : "ZeroMQ: very high latency with PUSH/PULL"
The ZeroMQ PUSH/PULL archetype is a part of the story. Let's decompose the claimed latency :
1 )
Let's measure the actual payload assembly costs [us] :
aClk = zmq.Stopwatch()
num = 0
while True:
aClk.start() #--------------------------------------- .start() the microsecond timer
frame = [0] * 1000000
ts = time()
num += 1
aPayLOAD = dict( num = num,
frame = frame,
ts = ts
)
_1 = aClk.stop() #---------------------------------- .stop() the microsecond timer
aPayLOAD['ts'] = time()
aClk.start() #-------------------------------------- .start() the microsecond timer
socket.send_pyobj( aPayLOAD )
_2 = aClk.stop() #---------------------------------- .stop() the microsecond timer
delay = time() - aPayLOAD['ts']
print( "Sender:: pkt_num: {0:>6d}, assy: {1:>6d} [us] .send_pyobj(): {2:>6d} [us] 'delay': {3:>10.6f} [s] ".format( num, _1, _2, delay ) )
2 )
Doing the same for the receiving side, you have a clear picture of how many [us] got consumed on the Python side, before the ZeroMQ has ever touched the first byte of the payload.
3 )
Next comes the performance tuning :
Avoid Python garbage collection
Improve the wasted memory-management ( avoid creating new instances, better inject data into memory-efficient "handlers" ( numpy.ndarray.data-zones instead of assigning re-composing list-s into a python variable )
refactor the code to avoid expensive dict-based key-value mapping and rather use compact binary maps, as available in struct-module ( your proposed mapping is both static & trivial )
test adding a compression-step ( may improve a bit the over-the-network hop latency )
4 )
Only now the ZeroMQ part comes :
may test improved performance envelopes from adding more IO-threads into either of Context( nIO_threads )-instances
may go in for .setsockopt( zmq.CONFLATE, 1 ) to ignore all non-"last" frames, as video-streaming may get no added value from "late" re-creation of already "old" frames

Inconsist delay of ReadFile a serial port device

I need to read data from a serial port device(which sends data per second) on Windows in REALTIME(<= 5 ms). But the time cost by ReadFile is unpredictable, which drives me to crazy. Some piece of the code can be found at:
https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8
the delay is dumped at https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8#file-serialport_win-cc-L283
Poco::Timestamp now;
if (!ReadFile(_handle, buffer, size, &bytesRead, NULL))
throw Poco::IOException("failed to read from serial port");
Poco::Timestamp::TimeDiff elapsed = now.elapsed();
std::cout << Poco::DateTimeFormatter::format(now, "%Y-%m-%d %H:%M:%S.%i")
<< ", elapsed: " << elapsed << ", data len: " << bytesRead << std::endl << std::flush;
Sometimes ReadFile costs about 3000 us(which is OK, affected by COMMTIMEOUTS) to return, but sometimes, it costs 15600 us(NOT affected by COMMTIMEOUTS).
Please let me know if there is anything I can do to make the problem clear.
P.S.
COMMTIMEOUTS:
COMMTIMEOUTS cto;
cto.ReadIntervalTimeout = 1;
cto.ReadTotalTimeoutConstant = 1;
cto.ReadTotalTimeoutMultiplier = 0;
cto.WriteTotalTimeoutConstant = MAXDWORD;
cto.WriteTotalTimeoutMultiplier = 0;
the main reading thread part:
https://gist.github.com/morris-stock/62b1674b4cda0e9df84d4738e54773f8#file-serialdevice-cc-L31
device data type
baudrate: 9600, it sends about 400 bytes per second(continuously, then no data in the rest of the second).
consle output
wPacketLength: 64
wPacketVersion: 2
dwServiceMask: 1
dwReserved1: 0
dwMaxTxQueue: 0
dwMaxRxQueue: 0
dwMaxBaud: 268435456
dwProvSubType: 1
dwProvCapabilities: 255
dwSettableParams: 127
dwSettableBaud: 268959743
wSettableData: 15
wSettableStopParity: 7943
dwCurrentTxQueue: 0
dwCurrentRxQueue: 68824
dwProvSpec1: 0
dwProvSpec2: 1128813859
wcProvChar: 0039F16C
2018-01-22 03:35:52.658, elapsed: 15600, data len: 0
2018-01-22 03:35:52.673, elapsed: 15600, data len: 0
2018-01-22 03:35:52.689, elapsed: 15600, data len: 0
2018-01-22 03:35:52.704, elapsed: 15600, data len: 0
2018-01-22 03:35:52.720, elapsed: 15600, data len: 0
2018-01-22 03:35:52.736, elapsed: 15600, data len: 0
2018-01-22 03:35:52.751, elapsed: 15600, data len: 0
In my case, it's the Windows system clock resolution that matters.
ClockRes gives me:
C:\work\utils\ClockRes>Clockres.exe
Clockres v2.1 - Clock resolution display utility
Copyright (C) 2016 Mark Russinovich
Sysinternals
Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.000 ms
and
C:\work\utils\ClockRes>Clockres.exe
Clockres v2.1 - Clock resolution display utility
Copyright (C) 2016 Mark Russinovich
Sysinternals
Maximum timer interval: 15.600 ms
Minimum timer interval: 0.500 ms
Current timer interval: 15.600 ms
by calling timeBeginPeriod(1) when my app starts, I can get a more consistent result.
Thanks everyone for your kindly help.

How to track total number of rejected and dropped events

What is the correct way to track number of dropped or rejected events in the managed elasticsearch cluster?
GET /_nodes/stats/thread_pool which gives you something like:
"thread_pool": {
"bulk": {
"threads": 4,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 4,
"completed": 42
}
....
"flush": {
"threads": 0,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 0,
"completed": 0
}
...
Another way to get more concise and better formatted info (especially if you are dealing with several nodes) about thread pools is to use the _cat threadpool API
$ curl -XGET 'localhost:9200/_cat/thread_pool?v'
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
10.10.1.1 10.10.1.1 1 10 0 2 0 0 10 0 0
10.10.1.2 10.10.1.2 2 0 1 4 0 0 4 10 2
10.10.1.3 10.10.1.3 1 0 0 1 0 0 5 0 0
UPDATE
You can also decide which thread pools to show and for each thread pool which fields to include in the output. For instance below, we're showing the following fields from the search threadpool:
sqs: The maximum number of search requests that can be queued before being rejected
sq: The number of search requests in the search queue
sa: The number of currently active search threads
sr: The number of rejected search threads (since the last restart)
sc: The number of completed search threads (since the last restart)
Here is the command:
curl -s -XGET 'localhost:9200/_cat/thread_pool?v&h=ip,sqs,sq,sa,sr,sc'
ip sqs sq sa sr sc
10.10.1.1 100 0 1 0 62636120
10.10.1.2 100 0 2 0 15528863
10.10.1.3 100 0 4 0 64647299
10.10.1.4 100 0 5 372 103014657
10.10.1.5 100 0 2 0 13947055

StackExchange.Redis TIMEOUT growing unsent queue

Our test environment last weekend saw a number VMs start logging timeouts where the unsent queue just kept growing:
Timeout performing GET 0:B:ac64ebd0-3640-4b7b-a108-7fd36f294640, inst:
0, mgr: ExecuteSelect, queue: 35199, qu: 35199, qs: 0, qc: 0, wr: 0,
wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400), WORKER:
(Busy=5,Free=395,Min=4,Max=400)
Timeout performing SETEX 0:B:pfed2b3f5-fbbf-4ed5-9a58-f1bd888f01,
inst: 0, mgr: ExecuteSelect, queue: 35193, qu: 35193, qs: 0, qc: 0,
wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400),
WORKER: (Busy=6,Free=394,Min=4,Max=400)
I've read quite a few posts on analyzing these but most of the time it doesn't involve the unsent message queue growing. No connectivity errors were logged during this time; an AppPool recycle resolved the issue. Has anyone else seen this issue before?
Some potentially relevant extra info:
Same timeouts seen on 1.0.450 and 1.0.481 versions of StackExchange.Redis nuget package
ASP.Net v4.5 Web API 1.x site was the one affected
Upgraded to Redis 3.0.4 (from 3.0.3) same week errors were encountered (but days prior)
Installed New Relic .NET APM v5.5.52.0 that includes some StackExchange.Redis instrumentation (https://docs.newrelic.com/docs/release-notes/agent-release-notes/net-release-notes/net-agent-55520), again, a couple days prior to the timeouts. We've rolled this back here just in case.
I'm encountering same issue.
To research issue, we logging ConnectionCounters of ConnectionMultiplexer for monitoring on every 10 seconds.
It shows growing pendingUnsentItems only, it means StackExchange.Redis does not send/receive from socket.
completedAsynchronously completedSynchronously pendingUnsentItems responsesAwaitingAsyncCompletion sentItemsAwaitingResponse
1 10 4 0 0
1 10 28 0 0
1 10 36 0 0
1 10 51 0 0
1 10 65 0 0
1 10 72 0 0
1 10 85 0 0
1 10 104 0 0
1 10 126 0 0
1 10 149 0 0
1 10 169 0 0
1 10 190 0 0
1 10 207 0 0
1 10 230 0 0
1 10 277 0 0
1 10 296 0 0
...snip
1 10 19270 0 0
1 10 19281 0 0
1 10 19291 0 0
1 10 19302 0 0
1 10 19313 0 0
I guess socket writer thread was stopped?
My Environment is
StackExchange.Redis 1.0.481
Windows Server 2012 R2
.NET Framework 4.5
ASP.NET MVC 5.2.3
Installed New Relic .NET APM v5.7.17.0
Looks like the issue was seen when using New Relic .NET APM between versions 5.5-5.7 and is fixed in 5.8.

Rebalancing a Cassandra Ring with Vnodes

We had a system with a 3-node Cassandra 2.0.6 ring. Over time, the application load on that system increased until a limit where the ring could not handle it anymore, causing the typical node overload failures.
We doubled the size of the ring, and recently even added one more node, to try to handle the load, but there're still only 3 nodes taking all the load; but not the original 3 nodes of the initial ring.
We did the bootstrap + cleanup process described in the adding nodes guide. We also tried repairs on each node after not seeing much improvements in the ring load. Our load is 99.99% writes on this system.
Here's a chart of the cluster load illustrating the issue:
The highest load tables have a high cardinality on the partition key that I'd expect distributes well over vnodes.
Edit: nodetool info
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN x.y.z.92 56.83 GB 256 13.8% x-y-z-b53e8ab55e0a rack1
UN x.y.z.253 136.87 GB 256 15.2% x-y-z-bd3cf08449c8 rack1
UN x.y.z.70 69.84 GB 256 14.2% x-y-z-39e63dd017cd rack1
UN x.y.z.251 74.03 GB 256 14.4% x-y-z-36a6c8e4a8e8 rack1
UN x.y.z.240 51.77 GB 256 13.0% x-y-z-ea239f65794d rack1
UN x.y.z.189 128.49 GB 256 14.3% x-y-z-7c36c93e0022 rack1
UN x.y.z.99 53.65 GB 256 15.2% x-y-z-746477dc5db9 rack1
Edit: tpstats (node highly loaded)
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 11591287 0 0
RequestResponseStage 0 0 283211224 0 0
MutationStage 32 405875 349531549 0 0
ReadRepairStage 0 0 3591 0 0
ReplicateOnWriteStage 0 0 0 0 0
GossipStage 0 0 3246983 0 0
AntiEntropyStage 0 0 72055 0 0
MigrationStage 0 0 133 0 0
MemoryMeter 0 0 205 0 0
MemtablePostFlusher 0 0 94915 0 0
FlushWriter 0 0 12521 0 0
MiscStage 0 0 34680 0 0
PendingRangeCalculator 0 0 14 0 0
commitlog_archiver 0 0 0 0 0
AntiEntropySessions 1 1 1 0 0
InternalResponseStage 0 0 30 0 0
HintedHandoff 0 0 1957 0 0
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 196
PAGED_RANGE 0
BINARY 0
READ 0
MUTATION 31663792
_TRACE 24409
REQUEST_RESPONSE 4
COUNTER_MUTATION 0
How could I further troubleshoot this issue?
You need to run nodetool cleanup on the previous nodes that were part of the ring. Nodetool cleanup will remove the partition keys that the node currently does not own.
Seems like after the addition of the nodes, the keys have not been deleted hence causing the load to be higher on the previous nodes.
Try running
nodetool cleanup
on the previous nodes

Resources