Is it slow to publish to no subscribers relative to one subscriber with ZeroMQ? - zeromq

Is it significantly slower to publishwhen there is one subscriber v/s no subscribers at all?
More Detail:
We're writing a ZeroMQ application, where speed is very important.We have many nodes that communicate via REQ/REP as well as PUB/SUB and the network automatically selects { ipc: | tcp: } transport-class, if the nodes are on the same machine.
We'd like to sometimes log the messages between certain nodes. With PUB/SUB this is easy, we just have a "logging node" subscribed to the publisher. However, with REQ/REP, we cannot read the request/response without becoming a proxy or otherwise slowing down the connection.
We're considering having all of the nodes using REQ/REP publish to a unique TCP address every time they send a message ( so each node has a "logging address" that they send all their messages to ), we'll then just subscribe to the "logging addresses" we're interested in, if we want to log.
Question:
Will we suffer a performance penalty if we ARE NOT subscribed to the "logging address"? A slowdown during logging is okay, but performance penalties during normal operation are not desirable.

How the subscription works?
Until v3.1, subscription mechanics ( a.k.a. a TOPIC-filter ) was handled on the SUB-side, so this part of the processign got distributed among all SUB-s ( at a cost of uniformly wide data-traffic across all transport-classes involved ) and there was no penalty, except for a sourcing such data-flow related workload ( ref. below ) on the PUB-side.
Since v3.1, the TOPIC-filter is processed on the PUB-side, at a cost of such a processing overhead, but saving all the previously wasted transport-capacities, consumed just to later realise at the SUB-side the message is not matching the TOPIC-filter and will be disposed off.
Quantitative metric for what "significantly slower" indeed means in-vivo
As postulated in Question, the comparison ought be related to:Scenario A: a PUB-process has no SUB-consumer connected/subscribed to any TOPIC-filterScenario B: a PUB-process has one SUB-consumer connected/subscribed to a TOPIC-filter
ZeroMQ has a state-full internal FSA, which saves both the programming architecture and resouces utilisation. This said, the Scenario A produces zero workload, i.e. has no impact related to PUB-processing, as there none such processing actually happens until a first real SUB-connects.
If your Scenario B does indeed represent the use-case, the additional processing overhead, related to serving just one single SUB-consumer, is easily measureable:
from zmq import Stopwatch as StopWATCH
aStopWATCH = StopWATCH()
# -----------------------------------------------------------------<TEST_SECTION>-start
aStopWATCH.start();s_PUB_send.send( "This is a MESSAGE measured for 0 SUB-s", zmq.NOBLOCK );t0 = aStopWATCH.stop()
# -----------------------------------------------------------------<TEST_SECTION>-end
# .connect the first SUB-process and let it .setsockopt() for v3.1+ accordingly
# -----------------------------------------------------------------<TEST_SECTION>-start
aStopWATCH.start();s_PUB_send.send( "This is a MESSAGE measured for 1 SUB-s", zmq.NOBLOCK );t1 = aStopWATCH.stop()
# -----------------------------------------------------------------<TEST_SECTION>-end
print "\nZeroMQ has consumed {0:} [us] for PUB-side processing on [Scenario A]\nZeroMQ has consumed {1:} [us] for PUB-side processing on [Scenario B]".format( t0, t1 )
The same test might be re-used to measure such difference in case .connect()-ed ( FSA-knows about the live counterparty ), but subscribed to nothing ( .setsockopt( "" ) ) SUB-consumer processing is to be validated, irrespective of the actually used { pre-v3.1 | v3.1+ }-API ( just be carefull to handle different versions of API in distributed-systems, where one cannot enforce uniform API-versions for remote nodes, that are outside of one's control of the Configuration Management ).
And if performance is already bleeding?
One may further fine tune performance attributes for performance already constrained projects.
For selected processing tasks, performance, that one may guess a-priori is not so tough here, one may segregate workload-streams' processing by mapping each one on disjunct sub-sets of the multiple created I/O-threads:
map s_REQ_sock.setsockopt( ZMQ_AFFINITY, 0 );
and s_PUB_send.setsockopt( ZMQ_AFFINITY, 1 );
resp. s_SUB_recv.setsockopt( ZMQ_AFFINITY, ... );
set s_SUB_recv.setsockopt( ZMQ_MAXMSGSIZE, 32000 ); // protective ceiling
set s_SUB_recv.setsockopt( ZMQ_CONFLATE, True ); // retain just the last msg
set s_SUB_recv.setsockopt( ZMQ_LINGER, 0 ); // avoid blocking
set s_SUB_recv.setsockopt( ZMQ_TOS, anAppToS_NETWORK_PRIO_CODE );

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

MQ slow dequeuing rates on a XMITQ

We have been facing an issue where message rate of a xmitq is very slow comparing with what should be a normal performance.
We have many other Qmgrs with bigger MQ flows that don't have the same issue.
Our HUB qmgr connects to business line in the same company HUB qmgr, and even the destination queues on their side being empty the flow is really slow.
At OS and Network level they say nothing can be done. At MQ we have changed the Buffersizes so it matches the OS level and uses the system tcp windows.
Now at MQ level we have the channel SDR setup with BATCHSZ to 100, but seems the receiver is configured with 30.
We noticed that because we see messages flow in batches fof 30 messages. Also not sure if that is related but we see the XMITQ havs always 30 uncommited messages.
Our questiong for advice.
Would increase the BATCHSZ parameter on SDR/RCVR help the perfomance?
Is there any other parameter at MQ level that could help it?
DIS CHS(NAME) ALL
AMQ8417: Display Channel Status details.
CHANNEL(QMGRA.QMGRB.T7) CHLTYPE(SDR)
BATCHES(234) BATCHSZ(30)
BUFSRCVD(235) BUFSSENT(6391)
BYTSRCVD(6996) BYTSSENT(14396692)
CHSTADA(2020-04-16) CHSTATI(14.38.17)
COMPHDR(NONE,NONE) COMPMSG(NONE,NONE)
COMPRATE(0,0) COMPTIME(0,0)
CONNAME(159.50.69.38(48702)) CURLUWID(398F3E5EEA43381C)
CURMSGS(30) CURRENT
CURSEQNO(43488865) EXITTIME(0,0)
HBINT(300) INDOUBT(YES)
JOBNAME(000051FC00000001) LOCLADDR(10.185.8.122(54908))
LONGRTS(999999999) LSTLUWID(398F3E5EE943381C)
LSTMSGDA(2020-04-16) LSTMSGTI(14.49.46)
LSTSEQNO(43488835) MCASTAT(RUNNING)
MONCHL(HIGH) MSGS(6386)
NETTIME(2789746,3087573) NPMSPEED(NORMAL)
RQMNAME(QMGRB) SHORTRTS(10)
SSLCERTI(*******************)
SSLKEYDA( ) SSLKEYTI( )
SSLPEER(*******************)
SSLRKEYS(0) STATUS(RUNNING)
STOPREQ(NO) SUBSTATE(RECEIVE)
XBATCHSZ(23,7) XMITQ(QMGRB.X7)
XQTIME(215757414,214033427) RVERSION(08000008)
RPRODUCT(MQMM)
qm.ini:
Log:
LogPrimaryFiles=10
LogSecondaryFiles=10
LogFilePages=16384
LogType=LINEAR
LogBufferPages=4096
LogPath=/apps/wmq/QMGR/log/QMGR/
LogWriteIntegrity=SingleWrite
Service:
Name=AuthorizationService
EntryPoints=13
TCP:
SvrSndBuffSize=0
SvrRcvBuffSize=0
ServiceComponent:
Service=AuthorizationService
Name=MQSeries.UNIX.auth.service
Module=/opt/mqm75/lib64/amqzfu
ComponentDataSize=0
Channels:
MaxChannels=500
UPDATED: 15:41 GMT
Just to update the information, both sides are now with BATCHSZ 100 and seems slightly.
AMQ8417: Display Channel Status details.
CHANNEL(QMGRA.QMGRB.T7) CHLTYPE(SDR)
BATCHES(403) BATCHSZ(100)
BUFSRCVD(405) BUFSSENT(23525)
BYTSRCVD(11756) BYTSSENT(53751066)
CHSTADA(2020-04-17) CHSTATI(15.13.51)
COMPHDR(NONE,NONE) COMPMSG(NONE,NONE)
COMPRATE(0,0) COMPTIME(0,0)
CONNAME(159.50.69.38(48702)) CURLUWID(6D66985E94343410)
CURMSGS(0) CURRENT
CURSEQNO(44115897) EXITTIME(0,0)
HBINT(300) INDOUBT(NO)
JOBNAME(0000172A00000001) LOCLADDR(10.185.8.122(2223))
LONGRTS(999999999) LSTLUWID(6D66985E93343410)
LSTMSGDA(2020-04-17) LSTMSGTI(15.30.06)
LSTSEQNO(44115897) MCASTAT(RUNNING)
MONCHL(HIGH) MSGS(23505)
NETTIME(101563,480206) NPMSPEED(NORMAL)
RQMNAME(QMGRB) SHORTRTS(10)
SSLCERTI(*************************************)
SSLKEYDA( ) SSLKEYTI( )
SSLPEER(****************************)
SSLRKEYS(0) STATUS(RUNNING)
STOPREQ(NO) SUBSTATE(MQGET)
XBATCHSZ(1,1) XMITQ(QMGRB.X7)
XQTIME(191225,794134) RVERSION(08000008)
RPRODUCT(MQMM)
AMQ8450: Display queue status details.
QUEUE(QMGRB.X7) TYPE(QUEUE)
CURDEPTH(0) IPPROCS(1)
LGETDATE(2020-04-17) LGETTIME(15.30.06)
LPUTDATE(2020-04-17) LPUTTIME(15.30.06)
MEDIALOG(S2488154.LOG) MONQ(LOW)
MSGAGE(0) OPPROCS(9)
QTIME(794134, 191225) UNCOM(NO)
I'll put a few observations in this answer, but based on any further feedback I may add more.
You are running a very old version of the software on the sender side, MQ 7.5 went out of support almost two years ago (April 30 2018). IBM for a cost will provide extended support for an additional three years, so maybe you fall in that group. The 7.5.0.2 maintenance release itself came out in July 11th 2013, so it is almost seven years old at this point. I would strongly suggest you move to a newer version.
Note that MQ v8.0 goes out of support April 30 2020, and IBM just announced a few days ago that MQ v9.0 goes out of support September 30 2021. When you do migrate you should go with either 9.1 which has no announced end of support (they give five years minimum so it could be 2023) or go with the next version of MQ that should be out some time this year.
You mention setting the following:
TCP:
SvrSndBuffSize=0
SvrRcvBuffSize=0
The above setting apply to the SVRCONN end of a client connection. You can see this in the MQ v7.5 Knowledge Center page WebSphere MQ>Configuring>Changing configuration information>Changing queue manager configuration information>TCP, LU62, NETBIOS, and SPX:
SvrSndBuffSize=32768|number
The size in bytes of the TCP/IP send buffer used by the server end of a client-connection
server-connection channel.
SvrRcvBuffSize=32768|number
The size in bytes of the TCP/IP receive buffer used by the server end of a client-connection
server-connection channel.
At IBM MQ v7.5.0.2 APAR IV58073 introduced the concept of setting various buffer settings to a value to 0 which means that it will allow the operating system defaults to be used. Unfortunately like many things in the Knowledge Center it does not look like IBM documented this correctly for 7.5.
You can however review the IBM MQ v8.0 Knowledge Center to get the full picture regarding these settings at the page Configuring>Changing configuration information>Changing queue manager configuration information>TCP, LU62, and NETBIOS, specifically you would want to set these two settings to have any impact on your Sender Channel:
SndBuffSize=number| 0
The size in bytes of the TCP/IP send buffer used by the sending end of
channels. This stanza value can be overridden by a stanza more
specific to the channel type, for example RcvSndBuffSize. If the
value is set as zero, the operating system defaults are used. If no
value is set, then the IBM MQ default, 32768, is used.
RcvSndBuffSize=number| 0
The size in bytes of the TCP/IP send buffer used by the sender end of
a receiver channel. If the value is set as zero, the operating system
defaults are used. If no value is set, then the IBM MQ default, 32768,
is used.
Starting at IBM MQ v8.0 any newly created queue manager will have all of the following in the qm.ini:
TCP:
SndBuffSize=0
RcvBuffSize=0
RcvSndBuffSize=0
RcvRcvBuffSize=0
ClntSndBuffSize=0
ClntRcvBuffSize=0
SvrSndBuffSize=0
SvrRcvBuffSize=0
However, any queue manager that is upgraded will not by default get those settings, meaning if those are not present they will not be added, if they are present they will remain the same. If the setting is not present then as it says above "the IBM MQ default, 32768, is used."
I had extensive discussions with IBM support on this topic and came to the conclusion that they did not see any reason to not set it to 0, they only saw benefit in doing so, but with an abundance of caution they do not change it to 0 for you.
I would recommend you add all of those to your qm.ini, but at minimum add the two I highlighted above.
These are good setting to implement but may not solve your problem if nothing changed recently on either end. If however something did change, for example a network difference, or MQ was upgraded to 8.0.0.8 on the remote side, then this setting just might solve your problem.
In the channel status output two values are interesting:
NETTIME(2789746,3087573)
XQTIME(215757414,214033427)
NETTIME means that based on recent activity it took 2.7 seconds to receive a response from the RCVR channel, over a longer period of time it took 3.1 seconds to receive a response from the RCVR channel. Can you compare this to a TCP ping from the sender channel server to the receive channel server, 2.7 seconds for a response over the network seems excessive. In the presentation Keeping MQ Channels Up and Running given at Capitalware's MQ Technical Conference v2.0.1.4, Paul Clarke who used to work for IBM states "NETTIME only measures network time, and does not include
the MQCMIT for example".
XQTIME means that based on recent activity and over a longer period of time it took ~215 seconds for a message on the XMITQ to be picked up by the SDR channel to be sent.
See below for how IBM documents these:
NETTIME
Amount of time, displayed in microseconds, to send a request to the remote end of the channel and receive a response. This time only measures the network time for such an operation. Two values are displayed:
A value based on recent activity over a short period.
A value based on activity over a longer period.
XQTIME
The time, in microseconds, that messages remained on the transmission queue before being retrieved. The time is measured from when the message is put onto the transmission queue until it is retrieved to be sent on the channel and, therefore, includes any interval caused by a delay in the putting application.
Two values are displayed:
A value based on recent activity over a short period.
A value based on activity over a longer period.
Information on the BATCHSZ channel parameter can be found in the IBM MQ v8.0 Knowledge Center page Reference>Configuration reference>Channel attributes>Channel attributes in alphabetical order>Batch size (BATCHSZ). I have quoted it and highlighted a few areas in bold.
This attribute is the maximum number of messages to be sent before a sync point is taken.
The batch size does not affect the way the channel transfers messages; messages are always transferred individually, but are committed or backed out as a batch.
To improve performance, you can set a batch size to define the maximum number of messages to be transferred between two sync points. The batch size to be used is negotiated when a channel starts, and the lower of the two channel definitions is taken. On some implementations, the batch size is calculated from the lowest of the two channel definitions and the two queue manager MAXUMSGS values. The actual size of a batch can be less; for example, a batch completes when there are no messages left on the transmission queue or the batch interval expires.
A large value for the batch size increases throughput, but recovery times are increased because there are more messages to back out and send again. The default BATCHSZ is 50, and you are advised to try that value first. You might choose a lower value for BATCHSZ if your communications are unreliable, making the need to recover more likely.
This attribute is valid for channel types of:
Sender
Server
Receiver
Requester
Cluster sender
Cluster receiver
Follow up questions:
Are the messages that are PUT to this XMITQ persistent?
Answer: Yes, in our PROD env all messages are pesistent.
Have you had a recent increase in volume going to this XMITQ?
Answer: No, we use a monitoring tools, we extracted a report that show very similar message rate during the period. The same rate over the last 2 weeks.
Do the putting applications set MQPMO_SYNCPOINT and then commit after 1 or more messages are PUT to the queue?
Answer: I will check with the application team.
A couple of things..
You have XBATCHSZ(1,1) so your recent batch size is 1 message per batch.
Total messages 23505 batches 403, so an average of 58 messages per batch. If your recent batch size is 1, then you must have had some larger (100?) batch sizes
XQTIME 191225 is number of microseconds messages were on the xmit queue before being sent. This is 0.1 second!
Nettime 101563 microseconds. This is a long time ( 0.1 seconds) 10,000 would be a good value. Compare this with a "TCP PING"
BUFSSENT 23525 is similar to number of messages - so message size is typically under 32K. Bytessent. messages gives 2286 so small messages.
Things to check
The queue at the remote end. Has it filled up? This would cause the sender queue to get more messages
The nettime seems very long. Compare this with TCP Ping. Nettime can include slow IO at the remote end - or a queue full at the remote end
XQTIME is high. This could be caused by sending applications not committing, or slow disk IO
I wrote "Why is my xmit queue filling up" in this blog
*Search for the title
have a read.
Capture these metrics over a day and see if they are typical
regards
Colin Paice

Why does ZeroMQ not receive a string when it becomes too large on a PUSH/PULL MT4 - Python setup?

I have an EA set in place that loops history trades and builds one large string with trade information. I then send this string every second from MT4 to the python backend using a plain PUSH/PULL pattern.
For whatever reason, the data isn't received on the pull side when the string transferred becomes too long. The backend PULL-socket slices each string and further processes it.
Any chance that the PULL-side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Talking about file sizes we are well below 5kb per second.
This is the PULL-socket, which manipulates the data after receiving it:
while True:
# check 24/7 for available data in the pull socket
try:
msg = zmq_socket.recv_string()
data = msg.split("|")
print(data)
# if data is available and msg is account info, handle as follows
if data[0] == "account_info":
[...]
except zmq.error.Again:
print("\nResource timeout.. please try again.")
sleep(0.000001)
I am a bit curious now since the pull socket seems to not even be able to process a string containing 40 trades with their according information on a single MT4 client - Python connection. I actually planned to set it up to handle more than 5.000 MT4 clients - python backend connections at once.
Q : Any chance that the pull side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Zero chance.
Sending 640 B each second is definitely no showstopper ( 5kb per second - is nowhere near a performance ceiling... )
The posted problem formulation is otherwise undecidable.
Step 1) POSACK/NACK prove whether a PUSH side accepts the payload for sending error-free.
Step 2) prove the PULL side is not to be blamed - [PUSH.send(640*chr(64+i)) for i in range( 10 )] via a python-2-python tcp://-transport-class solo-channel crossing host-to-host hop, over at least your local physical network ( no VMCI/emulated vLAN, no other localhost colocation )
Step 3) if either steps above got POSACK-ed, your next chances are the ZeroMQ configuration space and/or the MT4-based PUSH-side incompatibility, most probably "hidden" inside a (not mentioned) third party ZeroMQ wrapper used / first-party issues with string handling / processing ( which you must have already read about, as it has been so many times observed and mentioned in the past posts about this trouble with well "hidden" MQL4 internal eco-system changes ).
Anyway, stay tuned. ZeroMQ is a sure bet and a truly horsepower for professional and low-latency designs in distributed-system's domain.

How many bytes are used for longer string when sending via ZMQ?

I'm using ZeroMQ / ZMQ from Python and Java and have a question. When sending a shorter string, ZMQ uses one byte as described here (http://zguide.zeromq.org/page:all#A-Minor-Note-on-Strings)
Then what goes onto the wire is a length (one byte for shorter
strings) and the string contents as individual characters.
Does anyone know how many bytes are used when sending a longer string?
How many bytes are used for longer string when sending via ZMQ?
That depends on hell more things, than just on the string itself :
Your post refers to indeed historical text - the zguide pages.
While this was sure a very helpful first-read source in the early days of ZeroMQ v.2.x, today, we live with distributed-sysems spanning many versions, from v.2.1+, 3.x, 4.x, 4.2 being so far the last stable API version in 2018-Q2.
No one can a priori guess what API-version was used on the message-sender's side, until a receiver actually sets/accepts the link-setup and .recv()-s the respective message. Relying on a C-lang based s_recv()-helper tricks in post v4.0 API is not a sure direction to follow.
Being in python, many protocol-hardwired details remain beyond your sight, yet there are byte-maps, that get filled under the hood exactly as the benevolent dictatorship, indoctrinated in the published ZeroMQ RFC/ZMTP-specifications, dictates.
If we cannot guess or know beforehand, can we ... ?
Yes, we can experiment. Best setup a controlled distributed-system experiment.
Node A : The Sender
can be pythonic, being a sender:
- setup a REQ-archetype AccessNode ( details in "ZeroMQ Hierarchy in less than a five seconds" ),
- setup .setsockopt( zmq.IDENTITY, ... ) with a randomness-generated static identity,
- setup a .setsockopt( zmq.REQ_RELAXED, 1 ),
- .bind() it to a known endpoint of a transport-class of one's choice
- start an xrange()-generator controlled for L in xrange( 1, int( 1E+9 ) )-loop of .send()-s
- there load a payload .send( r"{0:}|{1:}".format( str( L ), L * r"*" ) )
- handle the respective .recv() for a REP-side "answer",
- check the error-states and adapt the time.sleep()-s / re-loops, according to the sender-side socket capacity and capability to send further payloads
Node B : The Receiver | The MitM-sniffer
ought be as low level as possible, so as to dis-assemble the RFC/ZMTP wire-line protocol, so python gets out of the candidate list. Other option may include a wire-level sniffer(s), if the selected transport-class permits ( an ipc:// or a vmci:// one will not )
setup a ROUTER-archetype AccessNode,
.connect() it to the know Node A's transport-class endpoint used,
start .recv()-ing messages,
If your experiment has correctly acquired / sniffed the wire-level details about the ZMTP-compliant transport sizes of the know payload compositions, your question gets repeatable, verifiable, quantitatively correct records on string-size to message-size mapping-function.
A BONUS POINT: for those indeed interested . . .Next, re-run the controlled white-box distributed-system experiment above, now with having the Node A: The Sender-side extended it's behaviour to also{ randomly | deterministically } { change | alter } its own configuration ( or map both such options onto a pair of the same payload re-.send()-s )with a.setsockopt( zmq.REQ_CORRELATE, { on | off } ) inside its for-loop and record the observed changes in the expected outputs.
This adds a final touch to the definitive answer, as far as the API v.4.2.x permits in the 2018-Q2.

RXTXLostPackets count is non zero even when allowTxDuringRx=false

I am using veins4.6 with sumo 0.30 and omnet++5.1.1 in ubuntu 14.04. I have created a custom network with a cross(one intersection with 4 roads) and ran the simulation with 200 vehicles. I did not observe this behaviour for 4vehicles. I have seen it with 50 vehicles too. I need to get the count of total lost packets for my masters project. So I was looking at statistics and found that RXTXLostPackets is not zero. As far as I understood from documentation it should be zero if allowTxDuringRx=false. Default is false(PhyLayer80211p.ned). As I did not change any code yet, I was confused if that is expected behaviour.
What I have done so far.
from Mac1609_4::handleLowerControl, statsTXRXLostPackets is updated when Decider80211p responds with RECWHILESEND.
In Decider80211p::processSignalEnd, if value of whileSending is true RECWHILESEND is sent to mac layer as control message.
In Decider80211p::processSignalEnd, if(frame->getWasTransmitting() || phy11p->getRadioState() == Radio::TX) , this frame was considered as received while sending and sets the value for whileSending as true.
The wasTransmitting varilable is set to true in Decider80211p::switchToTx and Decider80211p::processNewSignal functions.
currentFrame->setWasTransmitting(true);
currentFrame->setBitError(true);
in Decider80211p::processNewSignal:
if (phy11p->getRadioState() == Radio::TX ) {
frame->setBitError(true); --> tried disabling both these values and the RXTXLostPackets was zero.
frame->setWasTransmitting(true);
DBG_D11P << "AirFrame: " << frame->getId() << " (" << recvPower << ") received, while already sending. Setting BitErrors to true" << std::endl;
}
There is one thread with similar issue with the fix of adding this line in processSignalEnd function. But looks like veins4.6 does not use curSyncFrame anymore.
Veins - Unexpected behavior with lost packets in certain vehicles
if (!frame->getWasTransmitting()){
curSyncFrame = 0;
}
I could not clearly understand the issue. The code and configuration files I have used are here. https://github.com/Rajeswar59/veins_learning.
Can anyone please take a look and help me with this. Thanks in advance.
edit: I went through the logs. This is what I could understand as of now.
https://drive.google.com/open?id=0BzjDW8PQhkSmSEUtZ2lpcld4ZXc --> some portion of logs are here.
---> order of sending
#13332 0.247987176594 node[30] --> node[48] id=22266
#13375 0.247987796864 node[18] --> node[20] id=22447
#13384 0.247987864534 node[20] --> node[30] id=22573
From logs I have concentrated on node 18. Two nodes that transmitted before 30 are 32 and 4. These 2 messages are received successfully by all 3 nodes. When a message arrives decider tries to set channel state as busy in processnewsignal and set idle after processing packet. This calls mac1609_4.cc channelBusy and channelIdle functions respectively. So the channelIdle variable is set accordingly. Also if channel is to be set busy it will stop contention and calculate currentBackoff if any packet is waiting to be transmitted. If channel is being set idle at the end of reception, startContent is called. Based on this only the lastIdle variable is set which is used to calculate nextMacEvent. So when the last successful message was received all the nodes which have a packet to send decide nextMacEvent and it is sent as self message in Mac1609_4.cc. on receiving the nextMacEvent self message we will start transmitting without checking if any other node has started transmission. We can not identify that probably because we are setting channel busy when we receive messages after some propagation delay. So between last successful transmission and nextMacEvent other nodes also take decision to transmit without checking current channel state. That's why the node has some receive events while sending. As far as my understanding goes before transmission we should sense current state of channel and retry backoff accordingly. We do not check this at the nextMacEvent. It looks like a collision behaviour but should we not check the current state of channel when backoff counter reaches zero and retry. Please correct me if I am wrong anywhere.
Thanks for your patience.
Any help or advice??
My Learnings(probably last update):
After Some digging, these are my learnings if it helps some one. The basic CSMA mechanism says before attempting for transmission, the node has to sense the channel, initiate transmission if the channel is sensed idle for AIFS time, or go in to back off if channel is busy. In veins the channel busy status is stored in idleChannel variable whose status is checked in Mac1609_4:channelBusySelf() function before initiating transmission (nextMacEvent in Mac1609_4::handleSelfMsg). The idleChannel is updated in Mac1609_4::channelBusy and Mac1609_4::channelIdle functions when a message reception starts and when message reception ends respectively. So when a previously transmitting node sends a packet, all the recieving nodes will receive the packet with varying delay i.e., starts receiving at different times and update their channelIdle variable. After that they calculate best time to transmit and starts transmission. It does check if channel is idle or not but as the channelIdle status is updated at next reception and because of transmission delay it takes some time between transmission start at sender and reception start at receiver side, both the transmitting nodes cant see other transmission. As far as I understand this is called a collision when more than two nodes start transmission at the same time. So the BitError statistic is set and statsTXRXLostPackets is also set. So while calculating totalLostPackets we can take only one of these two values.

Resources