ZeroMQ: very high latency with PUSH/PULL - zeromq

Summary
I see very high latency at the receiver using zmq.PUSH/PULL sockets.
Details
I'm new to ZeroMQ, and am trying to send a few MBytes from the sender to the receiver (my final aim is to send a camera stream from the sender to the receiver). The sender and receiver are different machines on the same local network (one is a Mac and the other is running Ubuntu 18.04, if that matters). The sender is able to send its packets very fast (I'm guessing they get buffered in some zmq/tcp queue), but the receiver receives them very slowly with increasing latency per packet.
Here's the sender :
1 """
2 Sender
3 """
4 import zmq
5 from time import time
6 import sys
7
8 addr = "tcp://192.168.86.33:5555"
9
10 context = zmq.Context()
11 socket = context.socket(zmq.PUSH)
12 socket.bind(addr)
13
14 num = 0
15 while True:
16 frame = [0] * 1000000
17 ts = time()
18 num += 1
19 socket.send_pyobj(dict(num=num, frame=frame, ts=ts))
20
21 delay = time() - ts
22 print("Sender:: pkt_num: {}, delay: {}".format(num, delay))
And the receiver :
1 """
2 Receiver
3 """
4 import zmq
5 from time import time
6
7 addr = "tcp://192.168.86.33:5555"
8 context = zmq.Context()
9
10 socket = context.socket(zmq.PULL)
11 socket.connect(addr)
12
13 while True:
14 msg = socket.recv_pyobj()
15 frame = msg['frame']
16 num = msg['num']
17 ts = msg['ts']
18
19 delay = time() - ts
20
21 if True:
22 print("Receiver:: pkt_num: {} latency: {}".format(num, delay))
When I run this, I see the sender is able to send its packets very quickly :
Sender:: pkt_num: 1, delay: 0.026965618133544922
Sender:: pkt_num: 2, delay: 0.018309354782104492
Sender:: pkt_num: 3, delay: 0.01821303367614746
Sender:: pkt_num: 4, delay: 0.016669273376464844
Sender:: pkt_num: 5, delay: 0.01674652099609375
Sender:: pkt_num: 6, delay: 0.01668095588684082
Sender:: pkt_num: 7, delay: 0.015082836151123047
Sender:: pkt_num: 8, delay: 0.014363527297973633
Sender:: pkt_num: 9, delay: 0.014063835144042969
Sender:: pkt_num: 10, delay: 0.014398813247680664
But the receiver sees very high and growing packet latencies :
Receiver:: pkt_num: 1 latency: 0.1272585391998291
Receiver:: pkt_num: 2 latency: 0.2539491653442383
Receiver:: pkt_num: 3 latency: 0.40800905227661133
Receiver:: pkt_num: 4 latency: 0.5737316608428955
Receiver:: pkt_num: 5 latency: 0.7272651195526123
Receiver:: pkt_num: 6 latency: 0.9418754577636719
Receiver:: pkt_num: 7 latency: 1.0799565315246582
Receiver:: pkt_num: 8 latency: 1.228663682937622
Receiver:: pkt_num: 9 latency: 1.3731486797332764
Receiver:: pkt_num: 10 latency: 1.5067603588104248
I tried swapping the sender and receiver between the Mac and Linux machines, and saw the same behavior. Since my goal is to send a video stream from the sender to the receiver, these high latencies make this unusable for that purpose.
Edit 1
Based on user3666197's suggestion, I edited the sender/receiver test code to remove some overheads. On the sender side keep sending the same dict. I also added more prints.
Sender:
14 num = 0
15 frame = [0] * 1000000
16 payload = dict(num=0, frame=frame, ts=0.0)
17 while True:
18 payload['num'] += 1
19 payload['ts'] = time()
20 socket.send_pyobj(payload)
21
22 delay = time() - payload['ts']
23 print("Sender:: pkt_num: {:>6d}, delay: {:6f}" \
24 .format(payload['num'], delay))
Receiver
10 socket = context.socket(zmq.PULL)
11 socket.connect(addr)
12 clk = zmq.Stopwatch()
13 clk.start()
14
15 while True:
16 iterT = clk.stop()
17 clk.start()
18 msg = socket.recv_pyobj()
19 rcvT = clk.stop()
20 delay = time() - msg['ts']
21
22 print("Server:: pkt_num: {:>6d} latency: {:>6f} iterT: {} rcvT: {}" \
23 .format(msg['num'], delay, iterT, rcvT))
24 clk.start()
The sender's per packet delay has reduced further. An interesting datapoint revealed is that the receiver takes almost 0.15s to receive each packet, which seems to be the main problem.
Sender:: pkt_num: 1, delay: 1.797830
Sender:: pkt_num: 2, delay: 0.025297
Sender:: pkt_num: 3, delay: 0.019500
Sender:: pkt_num: 4, delay: 0.019500
Sender:: pkt_num: 5, delay: 0.018166
Sender:: pkt_num: 6, delay: 0.017320
Sender:: pkt_num: 7, delay: 0.017258
Sender:: pkt_num: 8, delay: 0.017277
Sender:: pkt_num: 9, delay: 0.017426
Sender:: pkt_num: 10, delay: 0.017340
In the receiver's prints, rcvT is the per-packet receive time in microseconds.
Server:: pkt_num: 1 latency: 2.395570 iterT: 1 rcvT: 331601
Server:: pkt_num: 2 latency: 0.735229 iterT: 1 rcvT: 137547
Server:: pkt_num: 3 latency: 0.844345 iterT: 1 rcvT: 134385
Server:: pkt_num: 4 latency: 0.991852 iterT: 1 rcvT: 166980
Server:: pkt_num: 5 latency: 1.089429 iterT: 2 rcvT: 117047
Server:: pkt_num: 6 latency: 1.190770 iterT: 2 rcvT: 119466
Server:: pkt_num: 7 latency: 1.348077 iterT: 2 rcvT: 174566
Server:: pkt_num: 8 latency: 1.460732 iterT: 1 rcvT: 129858
Server:: pkt_num: 9 latency: 1.585445 iterT: 2 rcvT: 141948
Server:: pkt_num: 10 latency: 1.717757 iterT: 1 rcvT: 149666
Edit 2
I implemented the solution pointed out in this answer that uses PUB/SUB, and it works perfectly; I get 30fps video on the receiver. I still do not understand why my PUSH/PULL sample sees a delay.

Q : "ZeroMQ: very high latency with PUSH/PULL"
The ZeroMQ PUSH/PULL archetype is a part of the story. Let's decompose the claimed latency :
1 )
Let's measure the actual payload assembly costs [us] :
aClk = zmq.Stopwatch()
num = 0
while True:
aClk.start() #--------------------------------------- .start() the microsecond timer
frame = [0] * 1000000
ts = time()
num += 1
aPayLOAD = dict( num = num,
frame = frame,
ts = ts
)
_1 = aClk.stop() #---------------------------------- .stop() the microsecond timer
aPayLOAD['ts'] = time()
aClk.start() #-------------------------------------- .start() the microsecond timer
socket.send_pyobj( aPayLOAD )
_2 = aClk.stop() #---------------------------------- .stop() the microsecond timer
delay = time() - aPayLOAD['ts']
print( "Sender:: pkt_num: {0:>6d}, assy: {1:>6d} [us] .send_pyobj(): {2:>6d} [us] 'delay': {3:>10.6f} [s] ".format( num, _1, _2, delay ) )
2 )
Doing the same for the receiving side, you have a clear picture of how many [us] got consumed on the Python side, before the ZeroMQ has ever touched the first byte of the payload.
3 )
Next comes the performance tuning :
Avoid Python garbage collection
Improve the wasted memory-management ( avoid creating new instances, better inject data into memory-efficient "handlers" ( numpy.ndarray.data-zones instead of assigning re-composing list-s into a python variable )
refactor the code to avoid expensive dict-based key-value mapping and rather use compact binary maps, as available in struct-module ( your proposed mapping is both static & trivial )
test adding a compression-step ( may improve a bit the over-the-network hop latency )
4 )
Only now the ZeroMQ part comes :
may test improved performance envelopes from adding more IO-threads into either of Context( nIO_threads )-instances
may go in for .setsockopt( zmq.CONFLATE, 1 ) to ignore all non-"last" frames, as video-streaming may get no added value from "late" re-creation of already "old" frames

Related

Why does printing out local data mess with MPI_GATHERV results?

The following code needs 12 processors to run properly.
program GameOfLife
use mpi
implicit none
integer ierr, myid, numprocs
integer send, recv, count, tag
parameter (tag=111)
integer recv_buff, send_buff, request
integer stat(MPI_STATUS_SIZE)
integer N, m, i, j, sum
parameter (N=3) !# of squares per a processors
integer W, H
parameter (W=4,H=3) !# of processors up and across
integer A(N,N), buff(0:N+1,0:N+1), G(N*H, N*W)
! real r
integer sizes(2), subsizes(2), starts(2), recvcount(N*N)
integer newtype, intsize, resizedtype
integer(kind=MPI_ADDRESS_KIND) extent, begin
integer disp(W*H)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
! fill up subgrid
do i = 1, N
do j = 1, N
! call random_number(r)
A(i,j) = myid! floor(r*2)
end do
end do
do i = 1, N
print *, A(i, :)
end do
starts = [0,0]
sizes = [N*H, N*W]
subsizes = [N, N]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, newtype, ierr)
call MPI_Type_size(MPI_INTEGER, intsize, ierr)
extent = intsize*N
begin = 0
call MPI_Type_create_resized(newtype, begin, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
disp = [0, 1, 2, 9, 10, 11, 18, 19, 20, 27, 28, 29]
recvcount = 1
call MPI_GATHERV(A,N*N,MPI_INTEGER,G,recvcount,disp,resizedtype,0,MPI_COMM_WORLD,ierr)
call MPI_WAIT(request, stat, ierr)
if ( myid == 0 ) then
do i = 1, N*H
print *, G(i,:)
end do
endif
call MPI_FINALIZE(ierr)
end program GameOfLife
When ran without printing out the matrix A, everything works mostly okay. But If I try to print out A before feeding it to the gather statement, I get a jumbled mess.
What's going on here? I assume memory is trying to be accessed at the same time or something along those lines.
Output of G looks like
0 0 0 4 4 4 0 -1302241474 1 13 13 13
0 0 0 4 4 4 0 0 0 13 13 13
0 0 0 4 4 4 -10349344 -12542198 -10350200 13 13 13
1 1 1 5 5 5 59 59 59 14 14 14
1 1 1 5 5 5 -1342953792 0 0 14 14 14
1 1 1 5 5 5 32767 0 0 14 14 14
2 2 2 6 6 6 -1342953752 1451441280 0 15 15 15
2 2 2 6 6 6 32767 10985 0 15 15 15
2 2 2 6 6 6 -10350200 1 0 15 15 15
3 3 3 7 7 8 8 8 12 12 12 0
3 3 3 7 7 8 8 8 12 12 12 0
3 3 3 7 7 8 8 8 12 12 12 0

How to track total number of rejected and dropped events

What is the correct way to track number of dropped or rejected events in the managed elasticsearch cluster?
GET /_nodes/stats/thread_pool which gives you something like:
"thread_pool": {
"bulk": {
"threads": 4,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 4,
"completed": 42
}
....
"flush": {
"threads": 0,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 0,
"completed": 0
}
...
Another way to get more concise and better formatted info (especially if you are dealing with several nodes) about thread pools is to use the _cat threadpool API
$ curl -XGET 'localhost:9200/_cat/thread_pool?v'
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
10.10.1.1 10.10.1.1 1 10 0 2 0 0 10 0 0
10.10.1.2 10.10.1.2 2 0 1 4 0 0 4 10 2
10.10.1.3 10.10.1.3 1 0 0 1 0 0 5 0 0
UPDATE
You can also decide which thread pools to show and for each thread pool which fields to include in the output. For instance below, we're showing the following fields from the search threadpool:
sqs: The maximum number of search requests that can be queued before being rejected
sq: The number of search requests in the search queue
sa: The number of currently active search threads
sr: The number of rejected search threads (since the last restart)
sc: The number of completed search threads (since the last restart)
Here is the command:
curl -s -XGET 'localhost:9200/_cat/thread_pool?v&h=ip,sqs,sq,sa,sr,sc'
ip sqs sq sa sr sc
10.10.1.1 100 0 1 0 62636120
10.10.1.2 100 0 2 0 15528863
10.10.1.3 100 0 4 0 64647299
10.10.1.4 100 0 5 372 103014657
10.10.1.5 100 0 2 0 13947055

StackExchange.Redis TIMEOUT growing unsent queue

Our test environment last weekend saw a number VMs start logging timeouts where the unsent queue just kept growing:
Timeout performing GET 0:B:ac64ebd0-3640-4b7b-a108-7fd36f294640, inst:
0, mgr: ExecuteSelect, queue: 35199, qu: 35199, qs: 0, qc: 0, wr: 0,
wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400), WORKER:
(Busy=5,Free=395,Min=4,Max=400)
Timeout performing SETEX 0:B:pfed2b3f5-fbbf-4ed5-9a58-f1bd888f01,
inst: 0, mgr: ExecuteSelect, queue: 35193, qu: 35193, qs: 0, qc: 0,
wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=2,Free=398,Min=4,Max=400),
WORKER: (Busy=6,Free=394,Min=4,Max=400)
I've read quite a few posts on analyzing these but most of the time it doesn't involve the unsent message queue growing. No connectivity errors were logged during this time; an AppPool recycle resolved the issue. Has anyone else seen this issue before?
Some potentially relevant extra info:
Same timeouts seen on 1.0.450 and 1.0.481 versions of StackExchange.Redis nuget package
ASP.Net v4.5 Web API 1.x site was the one affected
Upgraded to Redis 3.0.4 (from 3.0.3) same week errors were encountered (but days prior)
Installed New Relic .NET APM v5.5.52.0 that includes some StackExchange.Redis instrumentation (https://docs.newrelic.com/docs/release-notes/agent-release-notes/net-release-notes/net-agent-55520), again, a couple days prior to the timeouts. We've rolled this back here just in case.
I'm encountering same issue.
To research issue, we logging ConnectionCounters of ConnectionMultiplexer for monitoring on every 10 seconds.
It shows growing pendingUnsentItems only, it means StackExchange.Redis does not send/receive from socket.
completedAsynchronously completedSynchronously pendingUnsentItems responsesAwaitingAsyncCompletion sentItemsAwaitingResponse
1 10 4 0 0
1 10 28 0 0
1 10 36 0 0
1 10 51 0 0
1 10 65 0 0
1 10 72 0 0
1 10 85 0 0
1 10 104 0 0
1 10 126 0 0
1 10 149 0 0
1 10 169 0 0
1 10 190 0 0
1 10 207 0 0
1 10 230 0 0
1 10 277 0 0
1 10 296 0 0
...snip
1 10 19270 0 0
1 10 19281 0 0
1 10 19291 0 0
1 10 19302 0 0
1 10 19313 0 0
I guess socket writer thread was stopped?
My Environment is
StackExchange.Redis 1.0.481
Windows Server 2012 R2
.NET Framework 4.5
ASP.NET MVC 5.2.3
Installed New Relic .NET APM v5.7.17.0
Looks like the issue was seen when using New Relic .NET APM between versions 5.5-5.7 and is fixed in 5.8.

First Come First Serve Wait Time Algorithm

As i understand FCFS does not need any sorting but there are no examples covered in the internet what happens if waiting time goes negative?
So i understand that algorithm for waiting time is: WT = CT (Completion Time) - AT (Arrival Time) - BT (Burst Time).
But what happens if i have:
AT: 0 / BT: 2
AT: 0 / BT: 4
AT: 15 / BT: 5
AT: 21 / BT: 10
My calculations would be:
WT = 2 - 0 - 2 = 0
WT = 6 - 0 - 4 = 2
WT = 11 - 15 -5 = -9
WT = 21 - 21 - 10 = -10
Avg: (0+2-10-9)/4 = -17/4 = -4,25 (seems unreal)
Is the correct answer negative or should i modify the algorithm somehow?

How can I measure the performance and TCP RTT of my server code?

I created a basic TCP server that reads incoming binary data in protocol buffer format, and writes a binary msg as response. I would like to benchmark the the roundtrip time.
I tried iperf, but could not make it send the same input file multiple times. Is there another benchmark tool than can send a binary input file repeatedly?
If you have access to a linux or unix machine1, you should use tcptrace. All you need to do is loop through your binary traffic test while capturing with wireshark or tcpdump file.
After you have that .pcap file2, analyze with tcptrace -xtraffic <pcap_filename>3. This will generate two text files, and the average RTT stats for all connections in that pcap are shown at the bottom of the one called traffic_stats.dat.
[mpenning#Bucksnort tcpperf]$ tcptrace -xtraffic willers.pcap
mod_traffic: characterizing traffic
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.200709, 82318 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
Dumping port statistics into file traffic_byport.dat
Dumping overall statistics into file traffic_stats.dat
Plotting performed at 15.000 second intervals
[mpenning#Bucksnort tcpperf]$
[mpenning#Bucksnort tcpperf]$ cat traffic_stats.dat
Overall Statistics over 201 seconds (0:03:21.754962):
4135308 ttl bytes sent, 20573.672 bytes/second
4135308 ttl non-rexmit bytes sent, 20573.672 bytes/second
0 ttl rexmit bytes sent, 0.000 bytes/second
16522 packets sent, 82.199 packets/second
200 connections opened, 0.995 conns/second
11 dupacks sent, 0.055 dupacks/second
0 rexmits sent, 0.000 rexmits/second
average RTT: 67.511 msecs <------------------
[mpenning#Bucksnort tcpperf]$
The .pcap file used in this example was a capture I generated when I looped through an expect script that pulled data from one of my servers. This was how I generated the loop...
#!/usr/bin/python
from subprocess import Popen, PIPE
import time
for ii in xrange(0,200):
# willers.exp is an expect script
Popen(['./willers.exp'], stdin=PIPE, stdout=PIPE, stderr=PIPE)
time.sleep(1)
You can adjust the sleep time between loops based on your server's accept() performance and the duration of your tests.
END NOTES:
A Knoppix Live-CD will do
Filtered to only capture test traffic
tcptrace is capable of very detailed per-socket stats if you use other options...
================================
[mpenning#Bucksnort tcpperf]$ tcptrace -lr willers.pcap
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.080496, 205252 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
TCP connection info:
200 TCP connections traced:
TCP connection 1:
host c: myhost.local:44781
host d: willers.local:22
complete conn: RESET (SYNs: 2) (FINs: 1)
first packet: Tue May 31 22:52:24.154801 2011
last packet: Tue May 31 22:52:25.668430 2011
elapsed time: 0:00:01.513628
total packets: 73
filename: willers.pcap
c->d: d->c:
total packets: 34 total packets: 39
resets sent: 4 resets sent: 0
ack pkts sent: 29 ack pkts sent: 39
pure acks sent: 11 pure acks sent: 2
sack pkts sent: 0 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 0
unique bytes sent: 2512 unique bytes sent: 14336
actual data pkts: 17 actual data pkts: 36
actual data bytes: 2512 actual data bytes: 14336
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 17 pushed data pkts: 33
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/0
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 6 adv wind scale: 1
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 792 bytes max segm size: 1448 bytes
min segm size: 16 bytes min segm size: 32 bytes
avg segm size: 147 bytes avg segm size: 398 bytes
max win adv: 40832 bytes max win adv: 66608 bytes
min win adv: 5888 bytes min win adv: 66608 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 14035 bytes avg win adv: 66608 bytes
initial window: 32 bytes initial window: 40 bytes
initial window: 1 pkts initial window: 1 pkts
ttl stream length: 2512 bytes ttl stream length: NA
missed data: 0 bytes missed data: NA
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 1.181 secs data xmit time: 1.236 secs
idletime max: 196.9 ms idletime max: 196.9 ms
throughput: 1660 Bps throughput: 9471 Bps
RTT samples: 18 RTT samples: 24
RTT min: 43.8 ms RTT min: 0.0 ms
RTT max: 142.5 ms RTT max: 7.2 ms
RTT avg: 68.5 ms RTT avg: 0.7 ms
RTT stdev: 35.8 ms RTT stdev: 1.6 ms
RTT from 3WHS: 80.8 ms RTT from 3WHS: 0.0 ms
RTT full_sz smpls: 1 RTT full_sz smpls: 3
RTT full_sz min: 142.5 ms RTT full_sz min: 0.0 ms
RTT full_sz max: 142.5 ms RTT full_sz max: 0.0 ms
RTT full_sz avg: 142.5 ms RTT full_sz avg: 0.0 ms
RTT full_sz stdev: 0.0 ms RTT full_sz stdev: 0.0 ms
post-loss acks: 0 post-loss acks: 0
segs cum acked: 0 segs cum acked: 9
duplicate acks: 0 duplicate acks: 1
triple dupacks: 0 triple dupacks: 0
max # retrans: 0 max # retrans: 0
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 0.0 ms
avg retr time: 0.0 ms avg retr time: 0.0 ms
sdv retr time: 0.0 ms sdv retr time: 0.0 ms
================================
You can always stick a shell loop around a program like iperf. Also, assuming iperf can read from a file (thus stdin) or programs like ttcp, could allow a shell loop catting a file N times into iperf/ttcp.
If you want a program which sends a file, waits for your binary response, and then sends another copy of the file, you probably are going to need to code that yourself.
You will need to measure the time in the client application for a roundtrip time, or monitor the network traffic going from, and coming to, the client to get the complete time interval. Measuring the time at the server will exclude any kernel level delays in the server and all the network transmission times.
Note that TCP performance will go down as the load goes up. If you're going to test under heavy load, you need professional tools that can scale to thousands (or even millions in some cases) of new connection/second or concurrent established TCP connections.
I wrote an article about this on my blog (feel free to remove if this is considered advertisement, but I think it's relevant to this thread): http://synsynack.wordpress.com/2012/04/09/realistic-latency-measurement-in-the-application-layers
As a very simple highlevel tool netcat comes to mind ... so something like time (nc hostname 1234 < input.binary | head -c 100) assuming the response is 100 bytes long.

Resources