In the OpenMPI codebase, each module has multiple variants. When calling mpirun, you can select the modules from the Modular Component Architecture (MCA) that you would like to use. The options include...
collective algorithms (coll): basic, tuned, inter, cuda, ml, sm, ...
byte-transfer layer (btl): openib, tcp, ...
point-to-point management layer (pml): cm, ob1, ...
matching transport layer (mtl): mxm, psm, ...
You can specify your choice of MCA components like this:
mpirun --mca btl self,openib --mca pml ob1 -np $nProcs ./myprogram
My questions:
If I leave some MCA parameters unspecified, what are the defaults?
Is there a verbose mode that will print all of the MCA components that are being used? (I tried adding -v to my mpirun command, and it didn't print anything extra.)
Depending on the version of Open MPI you have, either ompi_info --param all all (older versions) or ompi_info --all (newer versions) dumps the full list of MCA parameters available. The default values and their source are shown in the list and most of the parameters are also documented. Some MCA parameters only become available if certain other parameters are set. For example, the parameters that control the selection of algorithms for the collective communication operations in the tuned module only become available if one set coll_tuned_use_dynamic_rules to true. To have ompi_info list those too, --mca coll_tuned_use_dynamic_rules true has to be passed to it.
To have all MCA variables dumped the moment MPI_Init() is called, set mpi_show_mca_params to all. The value of each MCA parameter and where that value comes from are then dumped to the standard error stream.
Related
I would like to know what are the different kinds of IPIs available for x86_64 in Linux. In particular, I want to find out the different interrupts handlers for IPI interrupts.
In Understanding the Linux Kernel, 3rd Edition by Daniel P. Bovet, Marco Cesati
https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch04s06.html
lists three kinds of IPIs:
CALL_FUNCTION_VECTOR
RESCHEDULE_VECTOR
INVALIDATE_TLB_VECTOR
However in the latest kernels, I find the below comment in arch/x86/include/asm/entry_arch.h.
* This file is designed to contain the BUILD_INTERRUPT specifications for
* all of the extra named interrupt vectors used by the architecture.
* Usually this is the Inter Process Interrupts (IPIs)
*/
/*
* The following vectors are part of the Linux architecture, there
* is no hardware IRQ pin equivalent for them, they are triggered
* through the ICC by us (IPIs)
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/entry_arch.h?h=v5.6.15
Could someone confirm whether all those Vectors listed in the file are different kinds of IPI for x86_64.
For ARM I could find a unified handler - handle_IPI() for all the IPIs. A switch case is used to find out which IPI.
On x86 any interrupt vector can be triggered by an IPI, so there isn't (or aren't) a designated interrupt vector.
The image above depicts the format of the register used to send IPIs, the Fixed mode uses the Vector field to make the target CPUs execute the interrupt service routine associated with that vector. It's like an int vector instruction was executed in the targets.
So Linux can, theoretically, directly invoke any interrupt on any other CPU.
However, kernel modules often need to run a function on specific CPUs; so Linux has a set of utility functions like smp_call_function_single that will make the life of the programmer easy.
These functions are implemented with a mechanism that's worth a chapter on its own, now I don't know the detail but it's not hard to image the basic idea behind: have a global queue of functions to execute and an interrupt vector that, once invoked, dequeues an item and executes it.
By calling that interrupt vector with an IPI, Linux can make the target CPUs execute the given function.
The interrupt vectors you found are used for this. You probably want to look at their 64 bits counterpart in entry_64.S and under the guard #ifdef CONFIG_SMP.
The acpiinterrupt and acpiinterrupt3 are just macros that define a label with the second argument, call interrupt_entry with the first argument (the vector number) NOTted and call the function named in the third argument.
Be careful that the 32 bits analog does some nasty prefix-concatenation with the target function name.
apicinterrupt CALL_FUNCTION_SINGLE_VECTOR call_function_single_interrupt smp_call_function_single_interrupt is roughly equivalent to defining the function:
;Metadata stuff (e.g. section placement)
call_function_single_interrupt: ;<-- first arg
push ~CALL_FUNCTION_SINGLE_VECTOR ;<-- second arg
call interrupt_entry
;other stuff (tracing, flags, etc)
call smp_call_function_single_interrupt ;<-- third arg
;other stuff (like above, plus returning)
The vector numbers are defined in irq_vectors.h and are, of course, also used in idt.c for the IDT.
The target functions (the interrupt handlers) are mostly (all? I didn't check) defined in smp.c and they probably are the closest thing to the ARM's handle_IPI handler.
Those seem to be the only vectors invoked through an IPI.
I am wondering the difference between this 2 arguments in linux's kernel command line:
noexec=off
nosmep
In both cases it denies kernel to execute code which is in userland memory.
But i cannot see any differences between them.
The error message in dmesg is different but the behaviour seems to be the same.
Thanks
The noexec parameter controls whether kernel can use the XD flag (also called the NX flag) of the paging structures to mark pages that are not supposed to be executable as such. The nosmep parameter, on the other hand, specifies whether SMEP is enabled. Note that nosmep only has an effect when both the kernel version and the processor support SMEP (See: How can i enable/disable kernel kaslr, smep and smap). In addition, XD only has an effect when the kernel is running in 64-bit or using 36-bit paging and IA32_EFER.NXE is set to 1.
The XD and SMEP flags determine whether the instruction at a given memory location can be fetched. SMEP overrides XD, which means that if SMEP is set, supervisor-mode code is not allowed to fetch instructions (for execution) from a User page irrespective of XD flag. Otherwise if SMEP is not supported or disabled, instruction fetch is not allowed in the following cases:
Supervisor-mode code attempts to fetch instructions from a User or Supervisor page with a translation whose XD flag is 1 in at least one of the paging structures.
User-mode code attempts to fetch instructions from a User page with a translation whose XD flag is 1 in at least one of the paging structures.
User-mode code attempts to fetch instructions from a Supervisor page.
In any of these cases, a page fault Exception (#PF) occurs.
I compiled BPF example from samples/bpf/pare_simple.c (from the Linux kernel tree) with very simple change:
SEC("simple")
int handle_ingress(struct __sk_buff *skb)
{
return TC_ACT_SHOT;
}
So I want ANY packets to be dropped. I install it as follows:
This happens on Ubuntu 16.04.3 LTS with kernel 4.4.0-98, llvm and clang of version 3.8 installed from packages, iproute2 is the latest from github.
$ tc qdisc add dev eth0 clsact
$ tc filter add dev eth0 ingress bpf \
object-file ./net-next.git/samples/bpf/parse_simple.o \
section simple verbose
Prog section 'simple' loaded (5)!
- Type: 3
- Instructions: 2 (0 over limit)
- License: GPL
Verifier analysis:
0: (b7) r0 = 2
1: (95) exit
processed 2 insns, stack depth 0
So it seems it installs successfully, however this filter/ebpf does not drop packets, I generate ingress traffic on eth0 interface, e.g. ICMP, and it passes on. What am I doing wrong?
TL;DR: You should add direct-action flag to the tc filter command, as in
tc filter add dev eth0 ingress bpf \
object-file ./net-next.git/samples/bpf/parse_simple.o \
section simple direct-action verbose
^^^^^^^^^^^^^
The short help for tc bpf filter bpf help mentions this flag, but is has not made its way to the tc-bpf(8) manual page at this time, if I remember correctly.
So, what is this flag for?
eBPF programs can be attached two ways with tc: as actions, or as classifiers. Classifiers, attached with tc filter add, are supposed to be used for filtering packets, and do not apply an action by default. Which means that their return values have the following meaning (from man tc-bpf):
0 , denotes a mismatch
-1 , denotes the default classid configured from the command line
else , everything else will override the default classid to provide a facility for non-linear matching
Actions attached with tc action add, on the other hand, can drop or mirror or perform other operations with packets, but they are not supposed to actually filter them.
Because eBPF is kind of more flexible than the traditional actions and filters of tc, you can actually do both at once, filter a packet (i.e. identify this packet) and perform an action on it. To reflect this flexibility, the direct-action, or da flag was added (for kernel 4.4 or newer, with matching iproute2 package). It tells the kernel to use the return values of actions (TC_ACT_SHOT, TC_ACT_OK, etc.) for classifiers. And this is what you need here to return TC_ACT_SHOT in a way the kernel understands you want to drop the packet.
If I remember correctly, the reason why we use this flag instead of just dropping filters for actions is that you need a filter anyway with tc to attach you action to? (to be confirmed). So with the direct-action flag you do not have to attach both one filter and one action, the filter can do both operations. This should be the preferred way to go for eBPF programming with tc.
I would like to disable c-states on my computer.
I disabled c-state on BIOS but I don't obtain any result. However, I found an explanation :
"Most newer Linux distributions, on systems with Intel processors, use the “intel_idle” driver (probably compiled into your kernel and not a separate module) to use C-states. This driver uses knowledge of the various CPUs to control C-states without input from system firmware (BIOS). This driver will mostly ignore any other BIOS setting and kernel parameters"
I found two solutions to solve this problem but I don't know how to apply:
1) " so if you want control over C-states, you should use kernel parameter “intel_idle.max_cstate=0” to disable this driver."
I don't know neither how I can check the value (of intel_idle.max_cstate ) and neither how I can change its value.
2) "To dynamically control C-states, open the file /dev/cpu_dma_latency and write the maximum allowable latency to it. This will prevent C-states with transition latencies higher than the specified value from being used, as long as the file /dev/cpu_dma_latency is kept open. Writing a maximum allowable latency of 0 will keep the processors in C0"
I can't read the file cpu_dma_latency.
Thanks for your help.
Computer:
Intel Xeon CPU E5-2620
Gnome 2.28.2
Linux 2.6.32-358
To alter the value at boot time, you can modify the GRUB configuration or edit it on the fly -- the method to modify that varies by distribution. This is the Ubuntu documentation to change kernel parameters either for a single boot, or permanently. For a RHEL-derived distribution, I don't see docs that are quite as clear, but you directly modify /boot/grub/grub.conf to include the parameter on the "kernel" lines for each bootable stanza.
For the second part of the question, many device files are read-only or write-only. You could use a small perl script like this (untested and not very clean, but should work) to keep the file open:
#!/usr/bin/perl
use FileHandle;
my $fd = open (">/dev/cpu_dma_latency");
print $fd "0";
print "Press CTRL-C to end.\n";
while (1) {
sleep 5;
}
Redhat has a C snippet in a KB article here as well and more description of the parameter.
I want to know which core of a multicore processor initializes first when the cpu boots ? ( i mean at the bootloader level ) is the first core ? or random core ?
You want to read the local apic, which you can read about in "volume 2a":
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Each processor has a corresponding local apic, in each local apic there is an apic ID register, which gets assigned a unique value at system init time.
The initial core that comes online is called the bootstrap processor (BSP), and can indeed be any physical core on the die. More info is in "volume 3a", where they talk about the bootstrap processor selection process.
Here is an excerpt from vol3a:
8.4.1 BSP and AP Processors
The MP initialization protocol defines two classes of processors: the bootstrap processor (BSP) and the application processors (APs). Following a power-up or RESET of an MP system, system hardware dynamically selects one of the processors on the system bus as the BSP. The remaining processors are designated as APs.
As part of the BSP selection mechanism, the BSP flag is set in the IA32_APIC_BASE MSR (see Figure 10-5) of the BSP, indicating that it is the BSP. This flag is cleared for all other processors.
The BSP executes the BIOS’s boot-strap code to configure the APIC environment, sets up system-wide data structures, and starts and initializes the APs. When the BSP and APs are initialized, the BSP then begins executing the operating-system initialization code.
This depends on the architecture of the processor itself. There is not really a standard for something like this. For instance the PS3 core has 9 cores one which schedules tasks to the other 8. In this case it is fair to think about it in terms of that one core processing instructions before the other 8. As far as other processors are concerned this is a more difficult thing to discern. It would be sensible to assume that the bootloader sends its instructions to the set of cores at which point whatever logic gates assign instructions to cores do so in whatever manner they always to. In most cases I know of there is not really a difference between task scheduling at boot and at any other time. The most basic task scheduling hardware will just select the first available core which is usually whichever core is considered the "first" one by the machine. But like I keep saying different machines do it differently so I would suggest finding out what core you are using and checking what that one does. Good Luck.
Each processor has its own local APIC with related local APIC ID, this one may be read from local APIC register (the same one give different ID on each processor)