where and how grace period for Linux RCU's is initialized and accounted? - linux-kernel

Trying to understand, how and where grace period in an RCU is initialized. Is there a static declaration somewhere in Linux Kernel to define the grace period, or are there some other complex ways to do this, for example as per this small snippet:
for_each_online_cpu(cpu)
run_on(cpu);
grace period is total time taken to run the current thread on all the CPUs and after this total time has elapsed all the readers are done their critical sections ?

Related

Is it possible to do a perf stat with an interval which would be not a time?

I would like to know if it is possible to modify easily perf linux with stat module to create an interval of cycles (or instructions by cycle) instead of an interval of time ? The goal is to optimize the precision of the counters got by interval. The time unit is not accurate enough.
I have a friend which submits this idea but I looked the source code a little and I understood (maybe I have wrong) that we :
create a condition for a time calculated with the rdtsc library (some clock_gettime)
create a "wait" in the perf processus
launch the program to test
test if we respect the time condition : we continue or we break the wait function with a save on the mapped register system information in perf (and call the wait function if it is not over)
I would like this result :
cycles counts unit events
10000 25000 instructions
10000 450 branch-misses
20000 21000 instructions
20000 850 branch-misses
Unfortunately, I'm seeing a big problem if I want to use the result of a counter like a condition I have not yet. Or should I get all the time this or these counter(s) which define my "interval condition" ? I also saw that for a time interval, we shouldn't get counters with a frequency lower than 100ms because it generates overhead. If I get some counters every 10000 cycles, would I have the same problems ? I don't know where is this overhead (calls system ?).

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

Changing scheduler tick time

I want to change the scheduler tcik time(The amount of time CPU spends on each process).
Initially I checked about jiffies, jiffies variable represents the no.of timer ticks from the boot. CONFIG_HZ in the configuration file represents no.of timer ticks per second, please correct me if this is not correct.
The CONFIG_HZ value is same as scheduler tick time ? if it is different then please guide me where I can change the scheduler tick time.
Yes CONFIG_HZ defines the number of ticks in one second.
Basically scheduler is invoked every 1/CONFIG_HZ second for taking waking, task sleeping, balance load.
scheduler_tick -> This function gets called every 1/CONFIG_HZ second.
CONFIG_HZ defined in Kconfig and its value is set using .config which can be modified using menuconfig.
Global Variable jiffies holds the number of ticks that have occurred since the system has booted.
I d like to clarify about terms.
Jiffies is strictly speaking a measure of time.
Like we have hours , minutes, seconds exactly the same thing
is jiffy. And only after that it happens so that kernel works
with time via jiffy units.
It happens so that scheduler is launched every jiffy (roughly
speaking). to get more details i suggest to look at "linux kernel development" book. - https://github.com/eeeyes/My-Lib-Books/blob/master/Linux%20Kernel%20Development%2C%203rd%20Edition.pdf

what context is jiffies counter updated?

I'm writing a system which process network packets on SMP (centos 6.4).
I'm using cpu isolation and running a single ktrhead on some of the cores, if I don't release the cpu once on a while by calling schedule() the system get watch dog, I've tried to move to real time priority and release the cpu for specific amount of time, for example 50 jiffies every 450 jiffes, but it get stuck.
my question, is jiffies updated by softirq kthread? preventing from jiffies to increment if I don't release the cpu?
Thanks
jiffies is incremented when timer interrupt is hit. Timer interrupt is hit by system timer. It is not updated by softirq kthread.
In x86, system timer is implemented via programmable interrupt timer (PIT). PPC implements it via decrementer.
From the description of your thread, it seems your thread is locking up the cpu, hence watchdog hit is expected based on its timeout. In most systems, jiffies is 10ms; however you can check its value by checking value of HZ: HZ value will give number of timer interrupts in a second, hence there are HZ jiffies in a second.
In your case, whenever you release the CPU, watchdog thread gets a chance to run and check the current jiffies and then it compares with the jiffies value stored when it was last run: if it finds the difference greater than or equal to watchdog timeout, it hits and resets the system if configured.

What's a good name for "total wall clock time of all cpus?"

There are only two hard things in Computer Science: cache invalidation
and naming things.
-- Phil Karlton
My app is reporting CPU time, and people reasonably want to know how much time this is out of, so they can compute % CPU utilized. My question is, what's the name for the wall clock time times the number of CPUs?
If you add up the total user, system, idle, etc. time for a system, you get the total wall clock time, times the number of CPUs. What's a good name for that? According to Wikipedia, CPU time is:
CPU time (or CPU usage, process time) is the amount of time for which
a central processing unit (CPU) was used for processing instructions
of a computer program, as opposed to, for example, waiting for
input/output (I/O) operations.
"total time" suggests just wall clock time, and doesn't connote that over a 10 second span, a four-cpu system would have 40 seconds of "total time."
Total Wall Clock Time of all CPUs
Naming things is hard, why waste a good 'un once you've found it ?
Aggregate time: 15 of 40 seconds.

Resources