Slots command in hostfile for mpirun not recognised - multiprocessing

I saw another question that seemed similar mpirun: token slots not supported but their solution did not work for me.
I get the error
token slots not supported at this time
when running the command mpirun -hostfile temp.txt hostname
where temp.txt is
hostname1 slots=2
hostname2 slots=2
I have the mpirun version 2021.5
Release Date: 20211102 (id: 9279b7d62).
It did not work to instead write
hostname1:2
hostname2:2
in that case the command runs but it instead does the number of physical processors that are available, which is default.
EDIT: I am adding the full output
[host RAMSES]$ mpirun -hostfile temp.txt hostname
[mpiexec#host] HYD_hostfile_process_tokens (../../../../../src/pm/i_hydra/libhydra/hostfile/hydra_hostfile.c:47): token slots not supported at this time
[mpiexec#host] HYD_hostfile_unique_parse (../../../../../src/pm/i_hydra/libhydra/hostfile/hydra_hostfile.c:232): unable to process token
[mpiexec#host] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:83): match handler returned error
[mpiexec#host] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec#host] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
[mpiexec#host] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1784): error parsing parameters

So I found that on my version of mpi I had to specify processor placement not in the hostfile, as most of the examples I found do, but rather in the machinefile.
So the new command and file look like:
mpirun -machinefile machine.txt hostname
machine.txt:
host1:2
host2:2

Related

snmpd.conf clientaddr not working for sending trap /inform with given IP source address

Given the following sample/simple snmpd.conf (Net-SNMP 5.7.2 on RHEL 7.4)
rwcommunity private 192.168.56.101
trapsess -Ci --clientaddr=192.168.56.128 -v 2c -c private 192.168.56.101:162
when starting a SNMP Daemon
snmpd -f -Lo -D -C -c data/snmpd_test.conf udp:192.168.56.128:161
We obtain ''Start Up'' InformRequest with IP source 192.56.168.1 instead of ...128 (WireShark snapshot below)
It is not surprising as the -D option allows us to output the debug information saying that
trace: netsnmp_config_process_memory_list(): read_config.c, 696:
read_config:mem: processing memory: clientaddr 192.168.56.128
trace: run_config_handler(): read_config.c, 562:
9:read_config:parser: clientaddr handler not registered for this time
Web sources however say:
snmp.conf
...This value is also used by snmpd when generating notifications.
snmpd.conf
trapsess [SNMPCMD_ARGS] HOST
provides a more generic mechanism for defining notification destinations.
SNMPCMD_ARGS should be the command-line options required for an equivalent
snmptrap (or snmpinform) command to send the desired notification
I read also some old threads like this one
However this option is working well with snmptrap
snmptrap -D -Lo -Ci --clientaddr=192.168.56.128 -M+path_to_my_mibs -v 2c -c private 192.168.56.101:162 "" .1.3.6.1.4.1.a.b.c.d.e.f.0 i 0
This option is also working when placed in snmp.conf ( mind there is no 'd' here ) and then it applies to snmpset and snmpget (and maybe other)
So my question is: Is it a documentation error, a bug, a misuse of the Net-SNMP stack ?
After a long struggle I may have an answer and I write a short note as I just found a trick
It seems that clientaddr is not parsed correctly wherever in the snmpd.conf
(I tried not also inside the trapsess line)
But it seems to be a valid option in the command line of snmpd
like it was a valid option in the snmptrap command line. So I assumed it could be the same parsing mechanism for both.
a condition also is that the IP addres must be valid one
which means that
snmpd -f -Lo -D -C -c data/snmpd_test.conf --clientaddr=192.168.56.128 udp:192.168.56.128:161
seems to fully solve my problem.
I will perform more tests and if accurate format this answer a little bit better but it seems a good hint.

command output not captured by shell script when invoked by snmp pass

The problem
SNMPD is correctly delegating SNMP polling requests to another program but the response from that program is not valid. A manual run of the program with the same arguments is responding correctly.
The detail
I've installed the correct LSI raid drivers on a server and want to configure SNMP. As per the instructions, I've added the following to /etc/snmp/snmpd.conf to redirect SNMP polling requests with a given OID prefix to a program:
pass .1.3.6.1.4.1.3582 /usr/sbin/lsi_mrdsnmpmain
It doesn't work correctly for SNMP polling requests:
snmpget -v1 -c public localhost .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.1
I get the following response:
Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: SNMPv2-SMI::enterprises.3582.5.1.4.2.1.2.1.32.1
What I've tried
SNMPD passes two arguments, -g and <oid> and expects a three line response <oid>, <data-type> and <data-value>.
If I manually run the following:
/usr/sbin/lsi_mrdsnmpmain -g .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0
I correctly get a correct three line response:
.1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0
integer
30
This means that the pass command is working correctly and the /usr/sbin/lsi_mrdsnmpmain program is working correctly in this example
I tried replacing /usr/sbin/lsi_mrdsnmpmain with a bash script. The bash script delegates the call and logs the supplied arguments and output from the delegated call:
#!/bin/bash
echo "In: '$#" > /var/log/snmp-pass-test
RETURN=$(/usr/sbin/lsi_mrdsnmpmain $#)
echo "$RETURN"
echo "Out: '$RETURN'" >> /var/log/snmp-pass-test
And modified the pass command to redirect to the bash script. If I run the bash script manually /usr/sbin/snmp-pass-test -g .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0 I get the correct three line response as I did when I ran /usr/sbin/lsi_mrdsnmpmain manually and I get the following logged:
In: '-g .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0
Out: '.1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0
integer
30'
When I rerun the snmpget test, I get the same Error in packet... error and the bash script's logging shows that the captured delegated call output is empty:
In: '-g .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.0
Out: ''
If I modify the bash script to only echo an empty line I also get the same Error in packet... message.
I've also tried ensuring that the environment variables that are present when I manually call /usr/sbin/lsi_mrdsnmpmain are the same for the bash script but I get the same empty output.
Finally, my questions
Why would the bash script behave differently in these two scenarios?
Is it likely that the problem that exists with the bash scripts is the same as originally noticed (manually running program has different output to SNMPD run program)?
Updates
eewanco's suggestions
What user is running the program in each scenario?
I added echo "$(whoami)" > /var/log/snmp-pass-test to the bash script and root was added to the logs
Maybe try executing it in cron
Adding the following to root's crontab and the correct three line response was logged:
* * * * * /usr/sbin/lsi_mrdsnmpmain -g .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.1 >> /var/log/snmp-test-cron 2>&1
Grisha Levit's suggestion
Try logging the stderr
There aren't any errors logged
Checking /var/log/messages
When I run it via SNMPD, I get MegaRAID SNMP AGENT: Error in getting Shared Memory(lsi_mrdsnmpmain) logged. When I run it directly, I don't. I've done a bit of googling and I may need lm_sensors installed; I'll try this.
I installed lm_sensors & compat-libstdc++-33.i686 (the latter because it said it was a pre-requisite from the instructions and I was missing it), uninstalled and reinstalled the LSI drivers and am experiencing the same issue.
SELinux
I accidently stumbled upon a page about extending snmpd with scripts and it says to check the script has the right SELinux context. I ran grep AVC /var/log/audit/audit.log | grep snmp before and after running a snmpget and the following entry is added as a direct result from running snmpget:
type=AVC msg=audit(1485967641.075:271): avc: denied { unix_read unix_write } for pid=5552 comm="lsi_mrdsnmpmain" key=558265 scontext=system_u:system_r:snmpd_t:s0 tcontext=system_u:system_r:initrc_t:s0 tclass=shm
I'm now assuming that SELinux is causing the call to fail; I'll dig further...see answer for solution.
strace (eewanco's suggestion)
Try using strace with and without snmp and see if you can catch a system call failure or some additional hints
For completeness, I wanted to see if strace would have hinted that SELinux was denying. I had to remove the policy packages using semodule -r <policy-package-name> to reintroduce the problem then ran the following:
strace snmpget -v1 -c public localhost .1.3.6.1.4.1.3582.5.1.4.2.1.2.1.32.1 >> strace.log 2>&1
The end of strace.log is as follows and unless I'm missing something, it doesn't seem to provide any hints:
...
sendmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("127.0.0.1")}, msg_iov(1)= [{"0;\2\1\0\4\20public\240$\2\4I\264-m\2"..., 61}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=, ...}, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 61
select(4, [3], NULL, NULL, {0, 999997}) = 1 (in [3], left {0, 998475})
brk(0xab9000) = 0xab9000
recvmsg(3, {msg_name(16)={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("127.0.0.1")}, msg_iov(1)= [{"0;\2\1\0\4\20public\242$\2\4I\264-m\2"..., 65536}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 61
write(2, "Error in packet\nReason: (noSuchN"..., 81Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
) = 81
write(2, "Failed object: ", 15Failed object: ) = 15
write(2, "SNMPv2-SMI::enterprises.3582.5.1"..., 48SNMPv2- SMI::enterprises.3582.5.1.4.2.1.2.1.32.1
) = 48
write(2, "\n", 1
) = 1
brk(0xaa9000) = 0xaa9000
close(3) = 0
exit_group(2) = ?
+++ exited with 2 +++
It was SELinux that was denying snmpd a delegated call to /usr/sbin/lsi_mrdsnmpmain (and probably beyond).
To identify it, I ran grep AVC /var/log/audit/audit.log and for each entry, I ran the following:
echo "<grepped-output>" | audit2allow -a -M <filename>
This creates a SELinux policy package that should allow the delegated call through. The package is then loaded using the following:
semodule -i <filename>.pp
I had to do this 5 times as there were different causes of denial (unix_read unix_write, associate, read write). I'll look to combine the modules into one.
Now when I run snmpget I get the correct delegated output:
SNMPv2-SMI::enterprises.3582.5.1.4.2.1.2.1.32.1 = INTEGER: 34

libvirt image based provisioning using logical volumes

Are there known issues with image based provisioning using logical volumes in libvirt? I am getting this error while trying to do the same
Unable to save
Failed to create a compute kvm2 (Libvirt) instance test3.xxx.local: Call
to virNetworkCreateXML failed:
internal error: Child process (/usr/sbin/lvcreate --name
test3.xxx.local-disk1 -L 1K --type snapshot --virtualsize 10485760K -s
/vm-images-pool/images-vol/template_minimal) unexpected exit status 3: 2017-
01-05 00:42:08.133+0000: 12330: debug : virFileClose:102 : Closed fd 29
2017-01-05 00:42:08.133+0000: 12330: debug : virFileClose:102 : Closed fd 31
2017-01-05 00:42:08.133+0000: 12330: debug : virFileClose:102 : Closed fd 27
Volume group name expected (no slash) Run `lvcreate --help' for more
information
This link from Red Hat flags it as a known issue:
https://access.redhat.com/solutions/1995053
That doc has a date of October 20 2015. Not sure if anythig changed after that to support LV.
I tried to satisfy the requirement in that doc by creating a pool based on dir like this:
Setup:
Storage pool vm-images-pool-dir of type dir
Storage pool vm-images-pool of type logical
template_minimal is the image template.
[root#kvm2 libvirt]# virsh vol-list vm-images-pool-dir
Name Path
----------------------------------------------------------------------------
template_minimal /vm-images-pool/images-vol/template_minimal
vm-images-pool storage pool is of type VG with one volume:
images-vol vm-images-pool -wi-ao---- 249.00g
images-vol is mounted under /vm-images-pool/images-vol/
Any insight is appreciated.
Thanks,
TG
=======================================
more details.
Daniel, Thanks. I am a bit confused. I couldn't put the actual commands earlier since I had cleaned them up. I recreated the setup. Here are the commands I used:
virsh pool-define-as vm-images-pool logical --source-dev /dev/mapper/mpathd
virsh pool-build vm-images-pool
virsh pool-start vm-images-pool
virsh vol-create-as vm-images-pool images-vol --capacity 249G
virsh pool-define-as vm-images-pool-dir dir - - - - /vm-images-pool/images- vol/
virsh pool-build vm-images-pool-dir
virsh pool-start vm-images-pool-dir
[root#kvm2 ~]# virsh vol-list vm-images-pool-dir
Name Path
---------------------------------------------------------------------------- --
lost+found /vm-images-pool/images-vol/lost+found
template_minimal /vm-images-pool/images-vol/template_minimal
=======================================
/vm-images-pool/images-vol/template_minimal is the path used for template image
==================================
more tests:
mounted the logical volume at a mount point to match the directory based storage pool:
[root#kvm2 ~]# df -h /vm-images-pool-dir/images-vol
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vm--images--pool-images--vol 245G 1.2G 232G 1% /vm-images- pool-dir/images-vol
[root#kvm2 ~]# virsh vol-list vm-images-pool-dir
Name Path
------------------------------------------------------------------------------
lost+found /vm-images-pool-dir/images-vol/lost+found
template_minimal /vm-images-pool-dir/images-vol/template_minimal
[root#kvm2 ~]#
used /vm-images-pool-dir/images-vol/template_minimal as the template path
same result
Unable to save
Failed to create a compute kvm2 (Libvirt) instance test3.xxx.local: Call
to virNetworkCreateXML failed: internal error: Child process
(/usr/sbin/lvcreate --name test3.xxx.local-disk1 -L 1K --type
snapshot --virtualsize 10485760K - s /vm-images-pool-dir/images-
vol/template_minimal) unexpected exit status 3: 2017-01-05
16:45:10.694+0000: 40712: debug : virFileClose:102 : Closed fd 27 2017-
01-05 16:45:10.694+0000: 40712: debug : virFileClose:102 : Closed fd 29
2017-01-05 16:45:10.694+0000: 40712: debug : virFileClose:102 : Closed fd 24
Volume group name expected (no slash) Run `lvcreate --help' for more
information.
the source of the image is "/vm-images-pool-dir/images-vol/template_minimal" and the guest's target back end is a LV of 10G on another storage pool called "virtual-machines"
Not understanding what the 'lvcreate' commmand is trying to do, shouldnt it at least use "virtual-machines" as the target VG. The tool I am using is Satellite 6.2. I am thinking its something silly that I am overlooking. Not sure where :)
Thanks
TG
Based on the paths in that command, it seems you wanted to create a new file based volume in the /vm-images-pool/images-vol/, ie your "vm-images-pool-dir" pool. The fact that you are seeing an error from "lvcreate" though, suggests that you mistakenly specified "vm-images-pool" to libvirt as the pool to use, causing it to try to create a logical volume instead. You don't show the actual command / API you are running, but check that you've given the right pool name to it.
I know the question has long been asked, but I just hit the same problem and found the answer. I couldn't find the exact virsh command you are using leading to this error, but here I used the following XML file with virsh vol-create libvirtVG logical.xml
<volume >
<name>vol02</name>
<capacity unit='KiB'>2097152</capacity>
<allocation unit='KiB'>0</allocation>
<backingStore>
<path>/dev/libvirtVG/sles15sp1</path>
</backingStore>
</volume>
To be able to get rid of the error I had to set the allocation to the value of the capacity. You can also see that virt-manager is automatically doing it for you:
https://github.com/virt-manager/virt-manager/blob/master/virtinst/storage.py#L646
The equivalent using the virsh vol-create-as command would be:
virsh vol-create-as libvirtVG vol02 2048MiB --allocation 2048MiB \
--backing-vol /dev/libvirtVG/sles15sp1

How to manage process distribution in mpich

I am running mpi code on host1(quad core) and host2(dual core)
mpiexec -hosts host1,host2 -n 6 ./mytask
I want to assign 4 processes for host1 and 2 for host2. I tried --map-by core but I found that the processes are distributed 3 for each.
This is the mpiexec help output
mpiexec -h
Usage: ./mpiexec [global opts] [local opts for exec1] [exec1] [exec1 args] : [local opts for exec2] [exec2] [exec2 args] : ...
Global options (passed to all executables):
Global environment options:
-genv {name} {value} environment variable name and value
-genvlist {env1,env2,...} environment variable list to pass
-genvnone do not pass any environment variables
-genvall pass all environment variables not managed
by the launcher (default)
Other global options:
-f {name} file containing the host names
-hosts {host list} comma separated host list
-wdir {dirname} working directory to use
-configfile {name} config file containing MPMD launch options
Local options (passed to individual executables):
Local environment options:
-env {name} {value} environment variable name and value
-envlist {env1,env2,...} environment variable list to pass
-envnone do not pass any environment variables
-envall pass all environment variables (default)
Other local options:
-n/-np {value} number of processes
{exec_name} {args} executable name and arguments
Hydra specific options (treated as global):
Launch options:
-launcher launcher to use (ssh rsh fork slurm ll lsf sge manual persist)
-launcher-exec executable to use to launch processes
-enable-x/-disable-x enable or disable X forwarding
Resource management kernel options:
-rmk resource management kernel to use (user slurm ll lsf sge pbs cobalt)
Processor topology options:
-topolib processor topology library (hwloc)
-bind-to process binding
-map-by process mapping
-membind memory binding policy
Checkpoint/Restart options:
-ckpoint-interval checkpoint interval
-ckpoint-prefix checkpoint file prefix
-ckpoint-num checkpoint number to restart
-ckpointlib checkpointing library (none)
Demux engine options:
-demux demux engine (poll select)
Other Hydra options:
-verbose verbose mode
-info build information
-print-all-exitcodes print exit codes of all processes
-iface network interface to use
-ppn processes per node
-profile turn on internal profiling
-prepend-rank prepend rank to output
-prepend-pattern prepend pattern to output
-outfile-pattern direct stdout to file
-errfile-pattern direct stderr to file
-nameserver name server information (host:port format)
-disable-auto-cleanup don't cleanup processes on error
-disable-hostname-propagation let MPICH auto-detect the hostname
-order-nodes order nodes as ascending/descending cores
-localhost local hostname for the launching node
-usize universe size (SYSTEM, INFINITE, <value>)
Please see the intructions provided at
http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
for further details
There are several options.
pi#RPi:~ $ mpiexec -H rpi,rpi,rpi,rpi5,rpi7,rpi7 -np 6 helloworld.py
Hello World! I am process 3 of 6 on RPi5.
Hello World! I am process 5 of 6 on RPi7.
Hello World! I am process 4 of 6 on RPi7.
Hello World! I am process 0 of 6 on RPi.
Hello World! I am process 2 of 6 on RPi.
Hello World! I am process 1 of 6 on RPi.
The hostfile with the -hostfile filename.
pi#RPi:~ $ cat filename
RPi slots=4 max_slots=4
RPi5 slots=2 max_slots=2
RPi7 slots=4 max_slots=4
Also, use the -nooversubscribe option.

Invalid job array specification in slurm

I am submitting a toy array job in slurm. My command line is
$ sbatch -p development -t 0:30:0 -n 1 -a 1-2 j1
where j1 is script:
#!/bin/bash
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
When I submit this, I get an error:
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/03400/myname)...OK
--> Verifying availability of your work dir (/work/03400/myname)...OK
--> Verifying availability of your scratch dir (/scratch/03400/myname)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (PRJ-1234)...OK
sbatch: error: Batch job submission failed: Invalid job array specification
The same job works fine without the array specification:
$ sbatch -p development -t 0:30:0 -n 1 j1
This post is a bit old, but in case it happens for other people, I have had the same issue but the accepted answer did not suggest what was the problem in my case.
This error (sbatch: error: Batch job submission failed: Invalid job array specification) can also be raised when the array size is too large.
From https://slurm.schedmd.com/slurm.conf.html
MaxArraySize
The maximum job array size. The maximum job array task index value will be one less than MaxArraySize to allow for an index value of zero. Configure MaxArraySize to 0 in order to disable job array use. The value may not exceed 4000001. The value of MaxJobCount should be much larger than MaxArraySize. The default value is 1001.
To check the value, the slurm.conf file should be accessible by all slurm users (still according to 1) and may be found somewhere near /etc/slurm.conf (see https://slurm.schedmd.com/slurm.conf.html#lbAM, in my case I found it at path /etc/slurm/slurm.conf).
The syntax for your array specification is correct. But the printout you paste is not standard Slurm, I guess you are working on Stampede ; they have their own sbatch wrapper.
What you could do is use the -vvv option to sbatch to see exactly what Slurm sees:
$ sbatch -vvv -p development -t 0:30:0 -n 1 -a 1-2 j1 |& grep array
This should return
sbatch: array : 1-2
and if it does not it means the information is somehow lost somewhere.
What you can try is remove the array specification from the submission command line and insert it in the submission script, like this:
$ sbatch -p development -t 0:30:0 -n 1 j1
with j1 being
#!/bin/bash
#SBATCH -a 1-2
echo job id is $SLURM_JOB_ID
echo array job id is $SLURM_ARRAY_JOB_ID
echo task id id $SLURM_ARRAY_TASK_ID
The next step is to contact the system administrators with the information you will get from running the above tests and ask for help.

Resources