How to permanently limit cpu frequency - linux-kernel

I need to limit the cpu frequency of my Linux machine.
I know about cpufreq sysfs, but limiting online is too late in my case.
Is there a kernel parameter for this?
Where do the values for cpuinfo_max_freq and scaling_max_freq come from?
Can I change them before the governor starts changing cpu frequencies?
Can I change the default governor (to powersave for example)?

You can achieve this by compiling the kernel and enabling only the "powersave" governor.
This sets the frequency statically to the lowest frequency supported
by the CPU
Check here to see if this can be set as kernel parameter for a multi-governor kernel

Related

Is there a way to measure cache coherence misses

Given a program running on multiple cores, if two or more cores are operating on the same cache line, is there a way to measure the number of cache coherence invalidations/misses there are (i.e. when Core1 writes to the cache line, which then forces Core2 to refresh its copy of the cache line so that both cores are consistent)?
Let me know if I'm using the wrong terminology for this concept.
Yes, hardware performance counters can be used to do so.
However, the way to fetch them is use to be dependent of the operating system and your processor. On Linux, the perf too can be used to track performance counters (more especially perf stat -e COUNTER_NAME_1,COUNTER_NAME_2,etc.). Alternatively, on both Linux & Windows, Intel VTune can do this too.
The list of the hardware counters can be retrieved using perf list (or with PMU-Tools).
The kind of metric you want to measure looks like Request For Ownership (RFO) in the MESI cache-coherence protocol. Hopefully, most modern (x86_64) processors include hardware events to measure RFOs. On Intel Skylake processors, there are hardware events called l2_rqsts.all_rfo, and more precisely l2_rqsts.code_rd_hit and l2_rqsts.code_rd_miss to do this at the L2-cache level. Alternatively, there are many more-advanced RFO-related hardware events that can be used at the offcore level.

Should I enable SMP on heterogeneous multi-threaded CPU's?

I'm building the Linux kernel for a big.LITTLE board and I've been wondering about the CONFIG_SMP option, which enables the kernel's Symmetric-processing support.
Linux's documentation says this should be enabled on Multi-Threaded processors, but I wonder if Symmetric Multi processing wouldn't only work properly on processors that are actually symmetric.
I understand what SMP is, but I haven't found any hint or documentation saying anything about it's use on Linux built for ARM's big.LITTLE.
Yes, if you want to use more than a single core you have to enable CONFIG_SMP. This in itself will make all cores (both big and little ones) available to the kernel.
Then, you have two options (I'm assuming you are using the mainline Linux kernel or something not excessively different from it, e.g. not an Android kernel):
If you also enable CONFIG_BL_SWITCHER (-> Kernel Features -> big.LITTLE support -> big.LITTLE switcher support) and CONFIG_ARM_BIG_LITTLE_CPUFREQ (-> CPU Power Management -> CPU Frequency scaling -> CPU Frequency scaling -> Generic ARM big LITTLE CPUfreq driver), each big core in your SoC will be paired to a little core, and only one of the cores in each pair will be active at any given time, depending on the CPU load. So basically the number of logical cores will be half the number of physical cores, and each logical core will combine one physical big core and one physical little core (unless the total number of big cores differs from the number of little cores, in which case there will be non-paired physical cores that are also logical cores). For each logical core, switching between the big and little physical core will be managed by the cpufreq governor and will be conceptually equivalent to CPU frequency switching.
If you don't enable the above two configuration options, then all physical cores will be available as logical cores, can be active at the same time and are treated by the scheduler as if they were identical.
The first option is more suited if you are aiming at low power consumption, while the second option allows you to get the most out of the CPU.
This will change when Heterogeneous Multi-Processing (HMP) support is integrated in the mainline kernel.

Does OpenCL workgroup size matter in the OS X CPU runtime?

In the OS X OpenCL CPU runtime, the documentation here indicates that "Work items are scheduled in different tasks submitted to Grand Central Dispatch". That would seem to indicate that workgroups are essentially a no-op, and you should shoot for (number of work items) = (number of hardware threads) with (number of workgroups) being irrelevant. However, on other implementations, there are low-cost switches between items in the same workgroup via essentially coroutines (setjmp and longjmp), which would make it much less expensive to schedule more work-items (since you avoid a full OS-managed thread context switch between items), which in turn would make it easier to reuse code between CPU and GPU targets. According to "Heterogeneous Computing with OpenCL", AMD's CPU runtime does this, and I vaguely recall some documentation indicating the same is true for Intel's CPU runtime.
Can anyone confirm the behavior of workgroups in the OS X CPU runtime?
As stated later in the document (see the Autovectorizer section), workgroup size on CPU is linked to autovectorized code.
The autovectorizer aggregates several consecutive work-items into a single kernel function calling vector instructions (SSE, AVX) as much as possible.
Setting the workgroup size to 1 disables the autovectorizer. Larger values will enable vector code when available. In most cases, the generated code is able to efficiently use all the CPU resources.
In all cases, OpenCL on CPU runs on a small number of hardware threads.
Update: to answer the question in the comments.
It works usually quite well. Start with a "scalar" kernel and benchmark it to see the speedup provided by the autovectorizer, then "hand-vectorize" only if the speedup is not good enough. To help the compiler, avoid using "if", and prefer conditional assignments and bit operations.

How to get real CPU frequency while system running?

Because of the Intel Turbo Boost technology, I can't trust the CPU frequency written on the chip. I want to get the real CPU frequency while system running. I searched that cpufreq device could help. But dev.cpu.n.freq turned out only support for dev.cpu.0.freq.
There are no other OID like dev.cpu.1.freq or dev.cpu.n.freq.
Is there any useful tool could see the CPU frequency immediately?
Regarding the absence of the dev.cpu.N.freq sysctl for N>0, look at the BUGS section of cpufreq(4):
When multiple CPUs offer frequency control, they cannot be set to differ‐
ent levels and must all offer the same frequency settings.
This is why only CPU 0 is reported.

high resolution timing in kernel?

I'm writing a kernel module that requires function called in 0.1ms intervals measured with at least 0.01ms precision. 250MHz ARM CPU, the HZ variable (jiffies per second) is 100 so anything jiffies-based is a no-go unless I can increase jiffies granularity.
Any suggestions where/how to look?
Assuming the kernel you are running has the Hi-Res timer support turned on (it is a build time config option) and that you have a proper timer hardware which can provide the needed support to raise an interrupt in such granularity, you can use the in kernel hrtimer API to register a timer with your requirement.
Here is the hrtimer documentation: http://www.mjmwired.net/kernel/Documentation/timers/hrtimers.txt
Bare in mind though, that for truly getting uninterrupted responses on such a scale you most probably also need to apply and configure the Linux RT (aka PREEMPT_RT) patches.
You can read more here: http://elinux.org/Real_Time

Resources