With MSI-X device can IRQ affinity be set (Linux)? - linux-kernel

I've set IRQ affinity in the past on Linux by setting values to the proc files. [1]
However, I noticed that when I do this on an system that uses MSI-X for the device
(PCIe) that I want to set affinity for e.g. NIC, the /proc/interrupt counters increment
for each core for the IRQ and not for the single core I set it for. Where in a non-
MSI-X system the specified core answers the interrupts.
I'm using Linux kernel 3.11.
Short: Can IRQ affinity be set for devices that use MSI-X interrupts?
[1] https://www.kernel.org/doc/Documentation/IRQ-affinity.txt

Unburrying this thread, I am trying to set IRQ (MSI-X) cpu affinity for my SATA controller in order to avoid cpu switching delays.
So far, I got the current used IRQ via:
IRQ=$(cat /proc/interrupts | grep ahci | awk -F':' '/ /{gsub(/ /, "", $1); print $1}')
Just looking at the interrupts via cat /proc/interrupts shows that multiple CPUs are involved in my sata controller handling.
I then set the IRQ affinity (cpu 2 in my case) via
echo 02 > /proc/irq/$IRQ/smp_affinity
I can test the effective affinity with
cat /proc/irq/$IRQ/effective_affinity
After a while of disk benchmarking, I noticed that the affinity stays as configured.
Example:
Before benchmark, having bound IRQ 134 to cpu 2:
cat /proc/interrupts | egrep "ahci|CPU"
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
134: 12421581 1 0 17 4166 0 0 0 IR-PCI-MSI 376832-edge ahci[0000:00:17.0]
After benchmark:
cat /proc/interrupts | egrep "ahci|CPU"
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
134: 12421581 2724836 0 17 4166 0 0 0 IR-PCI-MSI 376832-edge ahci[0000:00:17.0]
So in my case, the affinity that I've setup stayed as it should.
I can only imagine that you have irqbalance running as a service.
Have you checked that ?
In my case, running irqbalance redistributes the affinity and overrides the one I setup.
My test system: CentOS 8.2 4.18.0-193.6.3.el8_2.x86_64 #1 SMP Wed Jun 10 11:09:32 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
In the end, I did not achieve better disk utilization / performance. My initial problem is that fio benchmarks do not use 100% disk, mere some values between 75-85% (and sometimes 97%, without me knowing why).

Related

How to make cpuset.cpu_exclusive function of cpuset work correctly

I'm trying to use the kernel's cpuset to isolate my process. To obtain this, I follow the instructions(2.1 Basic Usage) from kernel doc cpusets, however, it didn't work in my environment.
I have tried in both my centos7 server and my ubuntu16.04 work pc, but neither did work.
centos kernel version:
[root#node ~]# uname -r
3.10.0-327.el7.x86_64
ubuntu kernel version:
4.15.0-46-generic
What I have tried is as follows.
root#Latitude:/sys/fs/cgroup/cpuset# pwd
/sys/fs/cgroup/cpuset
root#Latitude:/sys/fs/cgroup/cpuset# cat cpuset.cpus
0-3
root#Latitude:/sys/fs/cgroup/cpuset# cat cpuset.mems
0
root#Latitude:/sys/fs/cgroup/cpuset# cat cpuset.cpu_exclusive
1
root#Latitude:/sys/fs/cgroup/cpuset# cat cpuset.mem_exclusive
1
root#Latitude:/sys/fs/cgroup/cpuset# find . -name cpuset.cpu_excl
usive | xargs cat
0
0
0
0
0
1
root#Latitude:/sys/fs/cgroup/cpuset# mkdir my_cpuset
root#Latitude:/sys/fs/cgroup/cpuset# echo 1 > my_cpuset/cpuset.cpus
root#Latitude:/sys/fs/cgroup/cpuset# echo 0 > my_cpuset/cpuset.mems
root#Latitude:/sys/fs/cgroup/cpuset# echo 1 > my_cpuset/cpuset.cpu_exclusive
bash: echo: write error: Invalid argument
root#Latitude:/sys/fs/cgroup/cpuset#
It just printed the error bash: echo: write error: Invalid argument.
Google it, however, I can't get the correct answers.
As I pasted above, before my operation, I confirmed that the cpuset root path have enabled the cpu_exclusive function and all the cpus are not been excluded by other sub-cpuset.
By using ps -o pid,psr,comm -p $PID, I can confirm that the cpus can be assigned to some process if I don't care cpu_exclusive. But I have also proved that if cpu_exclusive is not set, the same cpus can also be assigned to another processes.
I don't know if it is because some pre-setting are missed.
What I expected is "using cpuset to obtain exclusive use of cpus". Can anyboy give any clues?
Thanks very much.
i believe it is a mis-understanding of cpu_exclusive flag, as i did. Here is the doc https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt, quoting:
If a cpuset is cpu or mem exclusive, no other cpuset, other than
a direct ancestor or descendant, may share any of the same CPUs or
Memory Nodes.
so one possible reason you have bash: echo: write error: Invalid argument, is that you have some other cgroup cpuset enabled, and it conflicts with your operations of echo 1 > my_cpuset/cpuset.cpu_exclusive
please run find . -name cpuset.cpus | xargs cat to list all your cgroup's target cpus.
assume you have 12 cpus, if you want to set cpu_exclusive of my_cpuset, you need to carefully modify all the other cgroups to use cpus, eg. 0-7, then set cpus of my_cpuset to be 8-11. After all these cpus configurations , you can set cpu_exclusive to be 1.
But still, other process can still use cpu 8-11. Only the tasks that belongs to the other cgroups will not use cpu 8-11
for me, i had some docker container running, which prevents me from setting my cpuset cpu_exclusive
with kernel doc, i do not think it is possible to use cpus exclusively by cgroup itself. One approach (i know this approach is running on production) is that we isolate cpus, and manage the cpu affinity/cpuset by ourselves

Disable timer interrupt on Linux kernel

I want to disable the timer interrupt on some of the cores (1-2) on my machine which is a x86 running centos 7 with rt patch, both cores are isolated cores with nohz_full, (you can see the cmdline) but timer interrupt continues to interrupt the real time process which are running on core1 and core2.
1. uname -r
3.10.0-693.11.1.rt56.632.el7.x86_64
2. cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.1.rt56.632.el7.x86_64 \
root=/dev/mapper/centos-root ro crashkernel=auto \
rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet \
default_hugepagesz=2M hugepagesz=2M hugepages=1024 \
intel_iommu=on isolcpus=1-2 irqaffinity=0 intel_idle.max_cstate=0 \
processor.max_cstate=0 idle=mwait tsc=perfect rcu_nocbs=1-2 rcu_nocb_poll \
nohz_full=1-2 nmi_watchdog=0
3. cat /proc/interrupts
CPU0 CPU1 CPU2
0: 29 0 0 IO-APIC-edge timer
.....
......
NMI: 0 0 0 Non-maskable interrupts
LOC: 835205157 308723100 308384525 Local timer interrupts
SPU: 0 0 0 Spurious interrupts
PMI: 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 IRQ work interrupts
RTR: 0 0 0 APIC ICR read retries
RES: 347330843 309191325 308417790 Rescheduling interrupts
CAL: 0 935 935 Function call interrupts
TLB: 320 22 58 TLB shootdowns
TRM: 0 0 0 Thermal event interrupts
THR: 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 Machine check exceptions
MCP: 2 2 2 Machine check polls
CPUs/Clocksource:
4. lscpu | grep CPU.s
CPU(s): 3
On-line CPU(s) list: 0-2
NUMA node0 CPU(s): 0-2
5. cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
Thanks a lot for any help.
Moses
Even with nohz_full= you get some ticks on the isolated CPUs:
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(Documentation/timers/NO_HZ.txt, cf. (Nearly) full tickless operation in 3.10 LWN, 2013)
Thus, you have to check the rate of the local timer, e.g.:
$ perf stat -a -A -e irq_vectors:local_timer_entry sleep 120
(while your isolated threads/processes are running)
Also, nohz_full= is only effective if there is just one runnable task on each isolated core. You can check that with e.g. ps -L -e -o pid,tid,user,state,psr,cmd and cat /proc/sched_debug.
Perhaps you need to move some (kernel) tasks to you house-keeping core, e.g.:
# tuna -U -t '*' -c 0-4 -m
You can get more insights into what timers are still active by looking at /proc/timer_list.
Another method to investigate causes for possible interruption is to use the functional tracer (ftrace). See also Reducing OS jitter due to per-cpu kthreads for some examples.
I see nmi_watchdog=0 in your kernel parameters, but you don't disable the soft watchdog. Perhaps this is another timer tick source that would show up with ftrace.
You can disable all watchdogs with nowatchdog.
Btw, some of your kernel parameters seem to be off:
tsc=perfect - do you mean tsc=reliable? The 'perfect' value isn't documented in the kernel docs
idle=mwait - do you mean idle=poll? Again, I can't find the 'mwait' value in the kernel docs
intel_iommu=on - what's the purpose of this?

theano:simple neural network with 6 layers trains with only 132 datapoints per second. Why is it so slow?

I have a quite small neural network with two fully connected sigmoid subnetworks 10->100->10, whose output is concatenated and then fed into another 20->100->1 network.
This architecture is quite small, having just a few weights matrices that have maximum dimension 20x100=2000 weights.
Even if I am using theano with all flags set to use gpu acceleration, the system reaches only 132 iterations (datapoints!) per second. I am not using minibatches because it's not your typical neural network, but a matrix factorization model.
The system I am using has the following specs:
OS
uname -a
Linux node081 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
software
Python 3.5.2 (compiled by myself)
theano 0.9.0dev2.dev-ee4c4e21b9e9037f2aa9626c3d779382840ea2e3
NumPy 1.11.2
cpu
Intel(R) Xeon(R) CPU E5-2620 0 # 2.00GHz
with nproc that returns 12, hence might have 12 processors (or cores?)
gpu
this is the output of nvidia-smi (with my process running):
+------------------------------------------------------+
| NVIDIA-SMI 5.319.49 Driver Version: 319.49 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m On | 0000:03:00.0 Off | Off |
| N/A 29C P0 49W / 225W | 92MB / 5119MB | 19% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 5903 python3 77MB |
+-----------------------------------------------------------------------------+
command and env vars
OMP_NUM_THREADS=8 THEANO_FLAGS=mode=FAST_RUN,device=gpu,init_gpu_device=gpu,floatX=float32,nvcc.flags=-D_FORCE_INLINES,print_active_device=True,enable_initial_driver_test=True,warn_float64=raise,force_device=True,assert_no_cpu_op=raise python3 $#
theano config settings set at run-time
theano.config.optimizer='fast_run'
theano.config.openmp=True
theano.config.openmp_elemwise_minsize=4
theano.config.floatX='float32'
theano.config.assert_no_cpu_op='raise'
I also tried to deactivate openmp, and it works slightly slower.
It seems that I took all the precautions to make sure I have gpu acceleration correctly set. What might the reason of getting only 132 gradient updates at every second? Is there any further check I need to perform?
In theano,
The compilation is much faster on optimizer=fast_compile than with optimizer=fast_run.
Using the new-backend, with the help of new optimizer, compilation time has increased by 5X on certain networks. I'd suggest you should always stick with the new backend. You can use the new backend by using device=cudaflag.
While you're using the new backend, I'd advice you to use the bleeding edge if you want speed. There are a lot of optimizations being done every week that has potential to give great speeds.
From the docs, you can set the flag allow_gc=Falseto get faster speed
You can set config.nvcc.fastmath flag to True if you require some speed up from the division and multiplication operations at the cost of precision.
If you have convolutional operations in your network, you can set some config.dnn flags depending upon your network and needs. Also, setting cnmem flag will help.
Finally, whenever you're reporting a code to be slow, please share profiling results for helping the development :)

How to get CPU utilisation, RAM utilisation in MAC from commandline

I know that using Activity Monitor we can see CPU utilisation.But i want to get this information of my remote system through script. top command is not helpful for me. Please comment me any other way to get it.
What is the objection to top in logging mode?
top -l 1 | grep -E "^CPU|^Phys"
CPU usage: 3.27% user, 14.75% sys, 81.96% idle
PhysMem: 5807M used (1458M wired), 10G unused.
Or use sysctl
sysctl vm.loadavg
vm.loadavg: { 1.31 1.85 2.00 }
Or use vm_stat
vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free: 3569.
Pages active: 832177.
Pages inactive: 283212.
Pages speculative: 2699727.
Pages throttled: 0.
Pages wired down: 372883.

How to obtain the virtual private memory of a process from the command line under OSX?

I would like to obtain the virtual private memory consumed by a process under OSX from the command line. This is the value that Activity Monitor reports in the "Virtual Mem" column. ps -o vsz reports the total address space available to the process and is therefore not useful.
You can obtain the virtual private memory use of a single process by running
top -l 1 -s 0 -i 1 -stats vprvt -pid PID
where PID is the process ID of the process you are interested in. This results in about a dozen lines of output ending with
VPRVT
55M+
So by parsing the last line of output, one can at least obtain the memory footprint in MB. I tested this on OSX 10.6.8.
update
I realized (after I got downvoted) that #user1389686 gave an answer in the comment section of the OP that was better than my paltry first attempt. What follows is based on user1389686's own answer. I cannot take credit for it -- I've just cleaned it up a bit.
original, edited with -stats vprvt
As Mahmoud Al-Qudsi mentioned, top does what you want. If PID 8631 is the process you want to examine:
$ top -l 1 -s 0 -stats vprvt -pid 8631
Processes: 84 total, 2 running, 82 sleeping, 378 threads
2012/07/14 02:42:05
Load Avg: 0.34, 0.15, 0.04
CPU usage: 15.38% user, 30.76% sys, 53.84% idle
SharedLibs: 4668K resident, 4220K data, 0B linkedit.
MemRegions: 15160 total, 961M resident, 25M private, 520M shared.
PhysMem: 917M wired, 1207M active, 276M inactive, 2400M used, 5790M free.
VM: 171G vsize, 1039M framework vsize, 1523860(0) pageins, 811163(0) pageouts.
Networks: packets: 431147/140M in, 261381/59M out.
Disks: 487900/8547M read, 2784975/40G written.
VPRVT
8631
Here's how I get at this value using a bit of Ruby code:
# Return the virtual memory size of the current process
def virtual_private_memory
s = `top -l 1 -s 0 -stats vprvt -pid #{Process.pid}`.split($/).last
return nil unless s =~ /\A(\d*)([KMG])/
$1.to_i * case $2
when "K"
1000
when "M"
1000000
when "G"
1000000000
else
raise ArgumentError.new("unrecognized multiplier in #{f}")
end
end
Updated answer, thats work under Yosemite, from user1389686:
top -l 1 -s 0 -stats mem -pid PID

Resources