Golang - What is the meaning of the seconds in CPU profiling graph? - go

For example, the data in the figure runtime.scanobject:
runtime.scanobject 9.69s(4.51%) of 18.30s(8.52%).
what is the meaning of the seconds and percent?

When CPU profiling is enabled, the Go program stops about 100 times per second and records a sample consisting of the program counters on the currently executing goroutine's stack.
That time and percentange is in reference to the sample.
Here is a nice reference for you to read more about it: https://blog.golang.org/profiling-go-programs


Is it possible to do a perf stat with an interval which would be not a time?

I would like to know if it is possible to modify easily perf linux with stat module to create an interval of cycles (or instructions by cycle) instead of an interval of time ? The goal is to optimize the precision of the counters got by interval. The time unit is not accurate enough.
I have a friend which submits this idea but I looked the source code a little and I understood (maybe I have wrong) that we :
create a condition for a time calculated with the rdtsc library (some clock_gettime)
create a "wait" in the perf processus
launch the program to test
test if we respect the time condition : we continue or we break the wait function with a save on the mapped register system information in perf (and call the wait function if it is not over)
I would like this result :
cycles counts unit events
10000 25000 instructions
10000 450 branch-misses
20000 21000 instructions
20000 850 branch-misses
Unfortunately, I'm seeing a big problem if I want to use the result of a counter like a condition I have not yet. Or should I get all the time this or these counter(s) which define my "interval condition" ? I also saw that for a time interval, we shouldn't get counters with a frequency lower than 100ms because it generates overhead. If I get some counters every 10000 cycles, would I have the same problems ? I don't know where is this overhead (calls system ?).

How to interpret cpu profiling graph

I was following the go blog here
I tried to profile my program but it looks a bit different. (Seems that go has moved from sampling to instrumentation?)
I wonder what these numbers mean
Especially showing nodes accounting for 2.59s, 92.5% of 2.8
What does total sample = 2.8s mean? The sample is drawn in an interval of 2.8 seconds?
Does it mean that only nodes that are running over 92.5% of sample
time are shown?
Also I wonder these numbers are generated. In the original go blog, the measure is how many times the function is detected in execution among all samples. However, we are dealing with seconds here. How does go profiling tool know how many seconds a function call takes.
Any help will be appreciated
Think of the graph as a graph of a resource, time. You'll start at the top with, for example, 10 seconds. Then you'll see that 5 seconds went to time.Sleep and 5 went to encoding/json. The particular divides in that time is represented by the arrows, so they show that 5 went to each part of the program. So now we have 3 nodes, the first node 10 seconds, time.Sleep 5 seconds, and encoding/json 5 seconds. Then those 5 seconds in encoding/json are broken down even further into the functions that took up most the time. The 0.01s (percentage) out of 0.02s (larger percentage) means that this function took 0.01s of processing time out of a total of 0.02s of the block of time (the arrow with the number) total by this particular call stack. The percentage represents the overall percentage of execution time this part took up from the whole pie. So you'll see that encoding/json string/encoder took 0.36 percent of the total execution time/resources of your program.

WinDbg runaway command output explained

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?
This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

Getting cpu usage and calculating % used

I need to calculate the cpu usage and aggregate it from proc file in linux
/proc/stat gives me data but how would i come to know the % used of cpu at time as
stat gives me the count of processes at cores running at any time which does not give me any idea of %use of cpu?
And i am coding this in Golang and have to do this w/o scripts
Thanks in advance!!
/proc/stat does not only give you the count of processes on each core. man proc will tell you the exact format of that file. Copied from it, here is the part you should be interested in:
cpu 3357 0 4313 1362393
The amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures, use
sysconf(_SC_CLK_TCK) to obtain the right value), that the
system spent in user mode, user mode with low priority
(nice), system mode, and the idle task, respectively.
The last value should be USER_HZ times the second entry
in the uptime pseudo-file.
It is then easy to do the substraction of the idle field between two measures, which will give you the time spent not doing anything by this CPU. The other value that you can extract is the time doing something, which is the difference between two measures of:
time in user mode + time spent in user mode with low priority + time spent in system mode
You will then have two values; one, A, is expressing the time doing nothing, and the other, B, the time actually doing something. B / (A + B) will give you the percentage of time the CPU was busy.
