I am considering the idea of using Inet/omnet++ to evaluate a routing algorithm we are working on. Since I am using the tool for the first time, I was executing some examples and reading the source code.
Then I found an example, which is shipped with inet, /inet/examples/wireless/throughput.
The problem is that I don't get the same values.
In the README file one can read:
"Throughput is measured by the "sink" submodule of the AP. It is recorded
into the output scalar file, but can also be inspected during runtime.
The Excel sheet includes throughput measured by the simulation, and compares
it to the theoretical maximum which is roughly 5.12 Mbps (at 11 Mbps bitrate
and 1000-byte packets). The theoretical value and the simulation output
are very close, the difference being less than 1 kbps."
The same value is presented in Timing.xls
However, I obtain a different value when I execute the simulation: 846266 bit/sec
Do I need to perform some additional calculation to obtain the final value of throughput?
Is that a bug?
Is the value no longer valid due to some modification in INET?
The default value of bitrate for throughput example is 1 Mbps. So the value you obtained is correct.
To change the bitrate edit this line in omnetpp.ini in throughput directory:
**.wlan*.bitrate = 1Mbps
Related
I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).
I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.
CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.
As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.
This is a two part question:
I have a fluid flow sensor connected to an NI-9361 on my DAQ. For those that don't know, that's a pulse counter card. None the less, from the data read from the card, I'm able to calculate fluid flowing through the device in Gallons per hour, min, sec, etc. But what I need to do is the following:
Calculate total number of gallons of fluid that has flowed through the sensor since the loop began running
if possible, store that total so that it can be incremented next time the program runs
I know how to calculate it by hand, just not sure how to achieve the running summation required to calculate total amount of fluid that has passed through the sensor, or how to store the variable being incremented at the next program execution. I'm presuming the latter would involve writing a TDMS file, then opening and reading back the data, unless there's a better way?
Edit:
Below is the code used to determine GPM flow through my sensor. This setup is in accordance with the 9361 manual; it executes and yields proper results.
See this link for details:
http://zone.ni.com/reference/en-XX/help/373197L-01/criodevicehelp/crio-9361/
I can extrapolate how many gallons flow per second, or sample period, the 1526.99 scalar is the fluid flow manufacturer's constant - number of pulses per gallon passing through the sensor. The 9361 is set to frequency/period mode, so I'm calculating cycles per second, dividing by the constant for cycles per gallon to get gallons per second/min.
I suppose I could get a time reference by looking at the sample period, so I guess the better question is, how do I keep an incrementing sum?
I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.