I have read a bunch of posts on SO regarding the computation of end-to-end delay in Veins, but have not found an answer to be fulfilling in explaining why the delay is seemingly too low.
I am using:
Veins 4.7
Sumo 0.32.0
Omnetpp 5.3
Channel switching is turned off.
I have the following code, sending a message from the transmitting node:
if(sendMessage) {
WaveShortMessage* wsm = new WaveShortMessage();
sendDown(wsm);
}
The receiving node computes the delay using the wsm creation time, but I have also tried setting the timestamp on the transmitting side. The result is the same.
simtime_t delay = simTime() - wsm -> getCreationTime();
delayVector.record(delay);
The sample output for the delay vector is as follows:
Item# Event# Time Value
0 165 14.400239402394 2.39402394E-4
1 186 14.500240403299 2.40403299E-4
2 207 14.600241404069 2.41404069E-4
3 228 14.700242404729 2.42404729E-4
Which means that the end-to-end delay (from creation to reception) is equivalent to roughly a quarter of a millisecond, which seems to be quite low - and a fair bit below what is typically reported in the literature. This seems to be consistent with what other people have reported as being an issue (e.g. end to end delay in Veins)
Am I missing something in this computation? I have tried adding load on the network by adding a high number of vehicular nodes (21 nodes within a 1000x50 sandbox on a straight highway, with an average speed of 50 km/h), but the result seems to be the same. The difference is negligible. I have read several research papers that suggest that end-to-end delay should increase dramatically in high vehicular densities.
This end-to-end delay is to be expected. If your application's simulation model does not explicitly model processing delay (e.g., by an application running on a slow general purpose computer), all you would expect to delay a frame is propagation delay (lightspeed, so negligible here) and queueing delay on the MAC (time from inserting frame into TX queue until transmission finishes).
To give an example, for a 2400 bit frame sent at 6 Mbit/s this delay is roughly 0.45 ms. You are likely using slightly shorter frames, so your values appear to be reasonable.
For background information, see F. Klingler, F. Dressler, C. Sommer: "The Impact of Head of Line Blocking in Highly Dynamic WLANs" (DOI 10.1109/TVT.2018.2837157), which also includes a comparison of theory vs. Veins vs. real measurements.
Related
As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.
Is the internal clock on the ATTiny85 sufficiently accurate for one-wire timing?
Per https://learn.sparkfun.com/tutorials/ws2812-breakout-hookup-guide one-wire timing seems to need accuracy around the 0.05us range, so a 10% clock error on the AVR at 8MHZ would cause 0.0125us sized timing differences (assuming the 10% error figure is accurate, and that it's 10% error on frequency, not +/- 10% variance on each pulse).
Not a ton of margin - but is it good enough?
First of all, WS2812 LEDs are not the 1-wire.
The control protocol of WS2812 is described in the datasheet
The short answer is yes, ATTiny85, also the whole AVR family have enough clock accuracy to control the WS2812 chain. But routine should be written at assembler, also no interrupts should be allowed, to guarantee match the timing requests. When doing the programming well, 8MHz speed of the internal oscillator may be enough to output the different data to two WS2812 chains simultaneously.
So, when running 8MHz ±10%, the one clock cycle would be 112...138 ns.
The datasheet requires (with 150ns tolerance):
When transmitting "one": high level to be 550...850ns; - 6 clock cycles (672...828) matches this range (also 5 clock cycles (560...690ns) matches)
following low level - 450...750ns; - 5 cycles (560...690ns)
When transmitting "zero": high level 200...500ns; - 3 cycles (336...414ns)
following low level 650...950ns; - 6 cycles (672...828).
So, as you can see, considering tolerance ±10% of the clock's source, you can find the integer number of cycles which will guarantee match to the required intervals.
Speaking from the experience, it still be working if the low level, which follows the pulse, will be extended for a couple hundreds of nanoseconds.
There are known issues using internal oscilator with UART - should be timed to 2% accuracy while the internal oscilator can be up to 10% off with factory setting. While it can be calibrated(AVR has register OSCCAL for that purpose), its frequency is influenced by temperature.
It is worth the try, but might not to be reliable with temperature changes or fluctuating operating voltage.
References: ATmega's internal oscillator - how bad is it, Timing accuracy on tiny2313, Tuning internal oscilator
The timing requirements of NeoPixels (WS2812B) are wide enough that the only really critical part is the minimum width of a 1 bit. The ATtiny85 at 16Mhz is plenty fast to drive a string of them from a GPIO pin. At 8Mhz, it may not work (I haven't tried yet). I just released a small Arduino sketch which allows you to control NeoPixel strings of any length on a ATtiny85 without using any RAM.
https://github.com/bitbank2/NeoPixel
For devices with hardware SPI (e.g. ATMega328p), it's better to use SPI to shift out the bits (also included in my code).
I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.
I'm trying to simulate an imergancy breaking application using veins and analyze its performance. Research papers on 802.11p shows that as beacon frequency and number of vehicles increase delay should increase considerably due to mac layer delay of the protocol ( for 50 vehicles at 8Hz - about 300ms average delay).
But when simulating application with veins delay values does not show much different ( it ranges from 1ms-4ms).I've checked the Mac layer functionality and it appears that the channel is idle most of the time. So when a packet reaches Mac layer the channel has already been idle for more than the DIFS so packet gets sent quickly. I tried increasing packet size and reducing bitrate. It increase the previous delay by a certain amount. But drastic increase of delay due to backoff process cannot be seen.
Do you know why this happens ???
As you use 802.11p the default data rate on the control channel is 6Mbits (source: ETSI EN 302 663)
750Mbyte/s = 750.000bytes/s
Your beacons contain 500bytes. So the transmission of a beacon takes about 0.0007 seconds. As you have about 50 cars in your multi lane scenario and for example they are sending beacons with a 10 hertz frequency, it takes about 0.35s from 1 second to transmit your 500 beacons.
In my opinion, this are to less cars to create your mentioned effect, because the channel is idling for about 60% of the time.
is there any function that returns channel busy time? I use veins-2.2, mac and decider 802.11p. if there is not such function, how measuring the channel busy time is possible?
Channel busy time in Veins 2.2 is measured at two points: in the Phy layer and in the Mac layer. Both record a corresponding scalar value at the end of the simulation. Note that there is a difference in meaning between the two:
Mac busy time is (in almost all cases) what you want to record: it records how many seconds the Mac treated the channel as busy. Divide the scalar totalBusyTime by the total simulation time and you know the fraction of time that the Mac could not send.
Phy busy time is calculated very different: its value busyTime increases for each frame received above the sensitivity threshold. To give an example, if 1 frame is being received at any given time during the simulation, the value of this scalar would be 100%. If 4 frames are interfering for all of your simulation, the value of this scalar would be 400% (which is different to the Mac busy time you probably want).