measuring cross process latency on windows - windows

I am building latency measurement into a communication middleware I am building. The way I have it working is that I periodically send a probe msg from my publishing apps. Subscribing apps receive this probe, cache it, and send an echo back at a time of their choosing, noting how much time the msg was kept “on hold”. The subscribing app receives these echos and calculates latency as (now() – time_sent – time_on_hold) / 2.
This kinda works, but the numbers are vastly different (3x) when “time on hold” is greater than 0. I.e if I echo the msg back immediately I get around 50us on my dev env and if I wait, then send the msg back the time jumps to 150us (though I discount whatever time I was on hold). I use QueryPerfomanceCounter for all measurements.
This is all inside a single Windows 7 box. What am I missing here?
TIA.

A bit more information. I am using the following to measure time:
static long long timeFreq;
static struct Init
{
Init()
{
QueryPerformanceFrequency((LARGE_INTEGER*) &timeFreq);
}
} init;
long long OS::now()
{
long long result;
QueryPerformanceCounter((LARGE_INTEGER*)&result);
return result;
}
double OS::secondsDiff(long long ts1, long long ts2)
{
return (double) (ts1-ts2)/timeFreq;
}
On the publish side I do something like:
Probe p;
p.sentTimeStamp = OS::now();
send(p);
Response r = recv();
latency=OS::secondsDiff(OS::now()- r.sentTimeStamp) - r.secondsOnHoldOnReceiver;
And on the receiver side:
Probe p = recv();
long long received = OS::now();
sleep();
Response r;
r.sentTimeStamp = p.timeStamp;
r.secondOnHoldOnReceiver = OS::secondsDiff(OS::now(), received);
send(r);

Ok, I have edited my answer to reflect your answer: Sorry for the delay, but I didn't notice that you had elaborated on the question by creating an answer.
It's seems that functionally you are doing nothing wrong.
I think that when you distribute your application outside of localhost conditions, the additional 100us (if it is indeed roughly constant) will pale into insignificance compared to the average latency of a functioning network.
For the purposes of answering your question am thinking that there is a thread/interrupt scheduling issue on the server side that needs to be investigated, as you do not seem to be doing anything on the client that is not accounted for.
Try the following test scenario:
Send two Probes to clients A and B. (all localhost)
Send the Probe to 'Client B' one second (or X/2 seconds) after you send the probe to Client A.
Ensuring that 'Client A' waits for two seconds (or X seconds) and 'Client B' waits 'one second (or X/2 seconds)
The idea being that hopefully, both clients will send back their probe answers at roughly the same time and both after a sleep/wait (performing the action that exposes the problem). The objective is to try to get one of the clients responses to 'wake up' the publisher to see if the next clients answer will be processed immediately.
If one of these returned probes is not showing the anomily (most likely the second response) it could point to the fact that the publisher thread is waking from a sleep cycle (on recv 1st responce) and is immediately available to process the second response.
Again, if it turns out that the 100us delay is roughly constant, it will be +-10% of 1ms which is the timeframe appropriate for realworld network conditions.

Related

Kotlin coroutines slow start

I've been attempting to do a bit of performance review on an app I have, it's a back end Kotlin app that just pulls in some data, does a bit of data transformation and dumps it out, nothing too fancy. One thing that caught my eye was the final bit of execution where we dump our final data onto a queue, at first I noticed when we start up the app the final network call takes a very long time at first, sometimes over a second. Normally we run this network call in a coroutine to stop that last call blocking everything but I started trying to time the coroutine and the network call separately and got some odd results, from what I can see the coroutine takes can take forever to launch/complete compared to the network call. It's entirely possible I'm not recording things correctly but this is the general timing approach I have:
val coroutineTime - Instant.now().toEpochMillis()
GlobalScope.launch {
executionTime = measureTimeMillis { <--DO Message Sending -->}
totalTime = Instant.now().toEpochMillis() - coroutineTime
// Log out execution Time and total time
}
Now here what I'll see is something like
- totalTime = ~800ms
- executionTime = ~150ms
These aren't one-offs either, I have multiple of these processes going on at once ( up to 10 threads I think) and the first total times will always take significantly longer than the actual executionTime/network call. Eventually after a new dozen messages the overhead will calm down and these times will become equivalent at about 15ms, but having nearly 700ms overhead on coroutine start up seems insane to me.
Is this normal/expected behavior? I've tested this in a separate app and see similar but less extreme results where the first coroutine will take about 70ms to boot up, I'm struggling to find any other examples of this type of discussion outside of kotlin being used in android development.
As a first note, it's almost never a good idea to use the GlobalScope unless you really know what you're doing. This is why it was marked as delicate API. You should instead use a scope that is appropriately closed (following the lifecycle of whatever component launches this work).
Now, AFAIK, this GlobalScope runs on the default dispatcher, so maybe this is due to a cold start of that default thread pool. Later, it could also be a problem to use this dispatcher for network calls depending on the amount of concurrent coroutines you have. It would be more appropriate to use Disptachers.IO instead for IO bound work (or a custom thread pool).
It still doesn't explain the cold start, but I would first change that before investigating.
This is expected behavior if you use coroutines inappropriately ;-)
My guess is that your message sending is a blocking operation. By default GlobalScope.launch() dispatches coroutines with Dispatchers.Default which is designed to perform CPU-intensive operations, it has a limited number of threads and you should never block when using it. If you do you may run out of threads and coroutines will need to wait until some blocking operations will finish.
If you need to run blocking or IO code, you should use Dispatchers.IO instead:
GlobalScope.launch(Dispatchers.IO) {
I was facing similar issue, I have a function that loads some data from shared prefs, makes some calculations on the data (all this done in Dispatcher.Default), and return the result on Dispatcher.Main. I measured how long it took the Coroutine to actually start executing the block inside Dispatchers.Main.launch { } after calculations are done(time from tag2 to tag3 below), and got about 950ms (!!) Here is the function :
fun someName() {
CoroutineScope(Dispatchers.Default).launch {
val time = System.currentTimeMillis()
//load data and calculations
Log.d("tag2", "load and calculations took " + (System.currentTimeMillis() - time))
CoroutineScope(Dispatchers.Main.immediate).launch {
Log.d("tag3", "reached main thread code " + (System.currentTimeMillis() - time))
//do something
Log.d("tag4", "do something took " + (System.currentTimeMillis() - time))
}
}
}
But then I realized this happens while app launch, and main thread is busy creating all the UI, so even with .immediate it takes time until main thread will get to execute the dispatched code... then I tried to run this function after app already started and waiting, and found that from tag2 to tag 3 takes about 1ms (!!) (with .immediate). So looks like when dispatching something on Coroutine, when thread isn't busy it will start immediately

zeromq: ZMQ_CONFLATE==1 does not stop queues from saving old messages

With ZeroMQ and CPPZMQ 4.3.2, I want to drop old messages for all my sockets including
PAIR
Pub/Sub
REQ/REP
So I use m_socks[channel].setsockopt(ZMQ_CONFLATE, 1) on all my sockets before binding/connecting.
Test
However, when I made the following test, it seems that the old messages are still flushed out on each reconnection. In this test,
I use a thread to keep sending generated sinewave to a receiver thread
Every 10 seconds I double the sinewave's frequency
Then after 10 seconds I stop the process
Below is the pseudocode of the sender
// on sender end
auto thenSec = high_resolution_clock::now();
while(m_isRunning) {
// generate sinewave, double the frequency every 10s or so
auto nowSec = high_resolution_clock::now();
if (duration_cast<seconds>(nowSec - thenSec).count() > 10) {
m_sine.SetFreq(m_sine.GetFreq()*2);
thenSec = nowSec;
}
m_sine.Generate(audio);
// send to rendering thread
m_messenger.send("inproc://sound-ear.pair",
(const void*)(audio),
audio_size,
zmq::send_flags::dontwait
);
}
Note that I already use DONTWAIT to mitigate blocking.
On the receiver side I have a zmq::poller_event handler that simply receives the last message on event polling.
In the stop sequence I reset the sinewave frequency to its lowest value, say, 440Hz.
Expected
The expected behaviour would be:
If I stop both the sender and the receiver after 10s when the frequency is doubled,
and I restart both,
then I should see the sinewave reset to 440Hz.
Observed
But the observed behaviour is that the received sinewave is still of the doubled frequency after restarting the communication, i.e., 880Hz.
Question
Am I doing it wrong or should I use some kind of killswitch to force drop all messages in this case?
OK, I think I solved it myself. Kind of.
Actual solution
I finally realized that the behaviour I want is to flush all messages when I stop the rendering. According to the official doc(How can I flush all messages that are in the ZeroMQ socket queue?), this can only be achieved by
set the sockets of both sender's and receiver's ZMQ_LINGER option to 0, meaning to keep nothing on closing those sockets;
closing the sockets on both sender and receiver ends, which also involves bootstrapping pollers and all references to the sockets.
This seems a lot of unnecessary work if I'm to restart rendering my data again, right after the stop sequence. But I found no other way to solve this cleanly.
Initial effort
It seems to me that ZMQ_CONFLATE does not make a difference on PAIR. I really have to tweak high water marks on sender and receiver ends using ZMQ_SNDHWM and ZMQ_RCVHWM.
However, I said "kind of solved" because tweaking HWM in the end is not the optimal solution for a realtime application,
having ZMQ_SNDHWM / ZMQ_RCVHWM set to the minimum "1", we still have a sizable latency in terms of realtime.
Also, the consumer thread could fall into underrun situatioin, i.e., perceivable jitters with the lowest HWM.
If I'm not doing anything wrong, I guess the optimal solution would still be shared memory for my targeted scenario. This is sad because I really enjoyed the simplicity of ZMQ's multicast messaging patternsand hate to deal with thread locking littered everywhere.

Does an OMNET++ / Veins simulation get very slow if both Vehicles and RSUs broadcast messages periodically?

Let me give a brief context first:
I have a scenario where the RSUs will broadcast a fixed message 'RSUmessage' about every TRSU seconds. I have implemented the following code for RSU broadcast (also, these fixed messages have Psid = -100 to differentiate them from others):
void TraCIDemoRSU11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
if(wsm->getPsid()==-100){
sendDown(RSUmessage->dup());
scheduleAt(simTime() + trsu +uniform(0.02, 0.05), RSUmessage);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(wsm);
}
}
A car can receive these messages from other cars as well as RSUs. RSUs discard the messages received from cars. The cars will receive multiple such messages, do some comparison stuff and periodically broadcast a similar type of message : 'aggregatedMessage' per interval Tcar. aggregatedMessage also have Psid=-100 ,so that the message can be differentiated from other messages easily.
I am scheduling the car events using self messages. (Though it could have been done inside handlePositionUpdate I believe). The handleSelfMsg of a car is following:
void TraCIDemo11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
wsm->setSerial(wsm->getSerial() +1);
if (wsm->getPsid() == -100) {
sendDown(aggregatedMessage->dup());
//sendDelayedDown(aggregatedMessage->dup(), simTime()+uniform(0.1,0.5));
scheduleAt(simTime()+tcar+uniform(0.01, 0.05), aggregatedMessage);
}
//send this message on the service channel until the counter is 3 or higher.
//this code only runs when channel switching is enabled
else if (wsm->getSerial() >= 3) {
//stop service advertisements
stopService();
delete(wsm);
}
else {
scheduleAt(simTime()+1, wsm);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(msg);
}
}
PROBLEM: With this setting, the simulation is very very slow. I get about 50 simulation seconds in 5-6 hours or more in Express mode in OMNET GUI. (No. of RSU: 64, Number of Vehicle: 40, around 1kmx1km map)
Also, I am referring to this post. The OP says that he got faster speed by removing the sending of message after each RSU received a message. In my case I cannot remove that, because I need to send out the broadcast messages after each interval.
Question: I think that this slowness is because every node is trying to sendDown messages at the beginning of each simulated second. Is it the case that when all vehicles and nodes schedules and sends message at the same time OMNET slows down? (Makes sense to slow down , but by what degree) But, there are only around 100 nodes overall in the simulation. Surely it cannot be this slow.
What I tried : I tried using sendDelayedDown(wsm->dup(), simTime()+uniform(0.1,0.5)); to spread the sending of the messages through out 1st half of each simulated second. This seems to stop messages piling up at the beginning of each simulation seconds and sped things up a bit, but not so much overall.
Can anybody please let me know if this is normal behavior or whether I am doing something wrong.
Also I apologize for the long post. I could not explain my problem without giving the context.
It seems you are flooding your network with messages: Every message from an RSU gets duplicated and transmitted again by every Car which has received this message. Hence, the computational time increases quadratically with the number of nodes (sender of messages) in your network (every sent message has to be handled by every node which is in range to receive it). The limit of 3 transmissions per message does not seem to help much and, as the comment in the code indicates, is not used at all, if there is no channel switching.
Therefore, if you can not improve/change your code to simply send less messages, you have to live with that. Your little tweak to send the messages in a delayed manner only distributes the messages over one second but does not solve the problem of flooding.
However, there are still some hints you can follow to improve the performance of your simulation:
Compile in release mode: make MODE=Release
Run your simulation in the terminal environment Cmdenv: ./run -u Cmdenv ...
If you need to use the graphical environment by all means, you should at least speed up the animations by using the slider in the upper part of the interface.
Removing the simtime-resolution parameter from the omnetpp.ini file solves the problem.
It seems the simulation kernel has an issue when the channel delay does not match the simulation-time resolution.
You can verify the solution by cloning the following repository. Note that you need a functional installation of the OMNeT++ framework. Specifically, I test this fix in OMNeT++ 5.6.2.
https://github.com/Ryuuba/flooding

Rocketmq:MQBrokerException: CODE: 2 DESC: [TIMEOUT_CLEAN_QUEUE]

when i send message to broker,this exception occasionally occurs.
MQBrokerException: CODE: 2 DESC: [TIMEOUT_CLEAN_QUEUE]broker busy, start flow control for a while
This means broker is too busy(when tps>1,5000) to handle so many sending message request.
What would be the most impossible reason to cause this? Disk ,cpu or other things? How can i fix it?
There are many possible ways.
The root cause is that, there are some messages has waited for long time and no worker thread processes them, rocketmq will trigger the fast failure.
So the below is the cause:
Too many thread are working and they are working very slow to process storing message which makes the cache request is timeout.
The jobs it self cost a long time to process for message storing.
This may be because of:
2.1 Storing message is busy, especially when SYNC_FLUSH is used.
2.2 Syncing message to slave takes long when SYNC_MASTER is used.
In
/broker/src/main/java/org/apache/rocketmq/broker/latency/BrokerFastFailure.java you can see:
final long behind = System.currentTimeMillis() - rt.getCreateTimestamp();
if (behind >= this.brokerController.getBrokerConfig().getWaitTimeMillsInSendQueue()) {
if (this.brokerController.getSendThreadPoolQueue().remove(runnable)) {
rt.setStopRun(true);
rt.returnResponse(RemotingSysResponseCode.SYSTEM_BUSY, String.format("[TIMEOUT_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: %sms, size of queue: %d", behind, this.brokerController.getSendThreadPoolQueue().size()));
}
}
In common/src/main/java/org/apache/rocketmq/common/BrokerConfig.java, getWaitTimeMillsInSendQueue() method returns
public long getWaitTimeMillsInSendQueue() {
return waitTimeMillsInSendQueue;
}
The default value of waitTimeMillsInSendQueue is 200, thus you can just set it bigger to make the queue waiting for longer time. But if you wanna solve the problem completely, you should follow Jaskey's advice and check your code.

Rate limiting algorithm for throttling request

I need to design a rate limiter service for throttling requests.
For every incoming request a method will check if the requests per second has exceeded its limit or not. If it has exceeded then it will return the amount of time it needs to wait for being handled.
Looking for a simple solution which just uses system tick count and rps(request per second). Should not use queue or complex rate limiting algorithms and data structures.
Edit: I will be implementing this in c++. Also, note I don't want to use any data structures to store the request currently getting executed.
API would be like:
if (!RateLimiter.Limit())
{
do work
RateLimiter.Done();
}
else
reject request
The most common algorithm used for this is token bucket. There is no need to invent a new thing, just search for an implementation on your technology/language.
If your app is high avalaible / load balanced you might want to keep the bucket information on some sort of persistent storage. Redis is a good candidate for this.
I wrote Limitd is a different approach, is a daemon for limits. The application ask the daemon using a limitd client if the traffic is conformant. The limit is configured on the limitd server and the app is agnostic to the algorithm.
since you give no hint of language or platform I'll just give out some pseudo code..
things you are gonna need
a list of current executing requests
a wait to get notified where a requests is finished
and the code can be as simple as
var ListOfCurrentRequests; //A list of the start time of current requests
var MaxAmoutOfRequests;// just a limit
var AverageExecutionTime;//if the execution time is non deterministic the best we can do is have a average
//for each request ether execute or return the PROBABLE amount to wait
function OnNewRequest(Identifier)
{
if(count(ListOfCurrentRequests) < MaxAmoutOfRequests)//if we have room
{
Struct Tracker
Tracker.Request = Identifier;
Tracker.StartTime = Now; // save the start time
AddToList(Tracker) //add to list
}
else
{
return CalculateWaitTime()//return the PROBABLE time it will take for a 'slot' to be available
}
}
//when request as ended release a 'slot' and update the average execution time
function OnRequestEnd(Identifier)
{
Tracker = RemoveFromList(Identifier);
UpdateAverageExecutionTime(Now - Tracker.StartTime);
}
function CalculateWaitTime()
{
//the one that started first is PROBABLY the first to finish
Tracker = GetTheOneThatIsRunnigTheLongest(ListOfCurrentRequests);
//assume the it will finish in avg time
ProbableTimeToFinish = AverageExecutionTime - Tracker.StartTime;
return ProbableTimeToFinish
}
but keep in mind that there are several problems with this
assumes that by returning the wait time the client will issue a new request after the time as passed. since the time is a estimation, you can not use it to delay execution, or you can still overflow the system
since you are not keeping a queue and delaying the request, a client can be waiting for more time that what he needs.
and for last, since you do not what to keep a queue, to prioritize and delay the requests, this mean that you can have a live lock, where you tell a client to return later, but when he returns someone already took its spot, and he has to return again.
so the ideal solution should be a actual execution queue, but since you don't want one.. I guess this is the next best thing.
according to your comments you just what a simple (not very precise) requests per second flag. in that case the code can be something like this
var CurrentRequestCount;
var MaxAmoutOfRequests;
var CurrentTimestampWithPrecisionToSeconds
function CanRun()
{
if(Now.AsSeconds > CurrentTimestampWithPrecisionToSeconds)//second as passed reset counter
CurrentRequestCount=0;
if(CurrentRequestCount>=MaxAmoutOfRequests)
return false;
CurrentRequestCount++
return true;
}
doesn't seem like a very reliable method to control whatever.. but.. I believe it's what you asked..

Resources