I'm trying to get write size distribution by process. I ran:
sudo dtrace -n 'sysinfo:::writech { #dist[execname] = quantize(arg0); }'
and got the following error:
dtrace: invalid probe specifier sysinfo:::writech...
This is Mac OSX. Please help.
The error message is telling you that Mac OS X doesn't support the sysinfo::: provider. Perhaps you meant to use one of these?
# dtrace -ln sysinfo::writech:
ID PROVIDER MODULE FUNCTION NAME
dtrace: failed to match sysinfo::writech:: No probe matches description
# dtrace -ln sysinfo:::
ID PROVIDER MODULE FUNCTION NAME
dtrace: failed to match sysinfo:::: No probe matches description
# dtrace -ln 'syscall::write*:'
ID PROVIDER MODULE FUNCTION NAME
147 syscall write entry
148 syscall write return
381 syscall writev entry
382 syscall writev return
933 syscall write_nocancel entry
934 syscall write_nocancel return
963 syscall writev_nocancel entry
964 syscall writev_nocancel return
The following script works for me:
# dtrace -n 'syscall::write:entry {#dist[execname] = quantize(arg0)}'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C
activitymonitor
value ------------- Distribution ------------- count
2 | 0
4 |######################################## 4
8 | 0
Activity Monito
value ------------- Distribution ------------- count
2 | 0
4 |######################################## 6
8 | 0
...
Related
01/23/23-10:07:16 :: adnodemgrctl.sh version 120.11.12020000.12
Calling txkChkEBSDependecies.pl to perform dependency checks for ALL MANAGED SERVERS
ERROR DESCRIPTION:
(FATAL ERROR
PROGRAM : /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl(/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/bin/txkrun.pl)
TIME : Mon Jan 23 10:07:17 2023
FUNCTION: TXK::Process::run [ Level 3 ]
MESSAGES:
Command error: = 2304, = unzip -o /oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/txkChkEBSDependecies_Mon_Jan_23_10_07_17_2023/formsapp.ear -d /oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/txkChkEBSDependecies_Mon_Jan_23_10_07_17_2023
STACK TRACE
at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Error.pm line 168
TXK::Error::abort('TXK::Error', 'HASH(0xd432e8)') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Common.pm line 299
TXK::Common::doError('TXK::Process=HASH(0x19224d0)', 'Command error: = 2304, = unzip -o /oracle/u02/...', undef) called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Common.pm line 314
TXK::Common::setError('TXK::Process=HASH(0x19224d0)', 'Command error: = 2304, = unzip -o /oracle/u02/...') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Process.pm line 449
TXK::Process::run('TXK::Process=HASH(0x19224d0)', 'HASH(0x1a22af0)') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 1163
TXK::RunScript::unzipFile('/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/tx...', '/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK/tx...') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 815
TXK::RunScript::extractLatestFrmSrvJar() called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 1248
TXK::RunScript::checkFormsApp() called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 276
eval {...} called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl line 182
require /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txkChkEBSDependecies.pl called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/RunScript.pm line 105
TXK::RunScript::require('TXK::RunScript', '/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txk...') called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Script.pm line 177
eval {...} called at /oracle/u02/VIS/fs2/EBSapps/appl/au/12.0.0/perl/TXK/Script.pm line 177
TXK::Script::run('TXK::Script=HASH(0x139e460)', '/oracle/u02/VIS/fs2/inst/apps/VIS_vision/logs/appl/rgf/TXK', '/oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/patch/115/bin/txk...') called at /oracle/u02/VIS/fs2/EBSapps/appl/fnd/12.0.0/bin/txkrun.pl line 174
)
W A R N I N G *
Error while executing the perl script txkChkEBSDependecies.pl
We have determined that one or more checks failed for MANAGED SERVERS.
Please check logs for more details.
Validated the passed arguments for the option ebs-get-nmstatus
Connecting to Node Manager ...
Successfully Connected to Node Manager.
The Node Manager is up.
NodeManager log is located at /oracle/u02/VIS/fs2/FMW_Home/wlserver_10.3/common/nodemanager/nmHome1
01/23/23-10:07:21 :: adnodemgrctl.sh: exiting with status 2
CAN ANY ONE HELP ME WITH THIS PROBLEM ?
I encountered a problem when I wanted to spawn child processes on HPC using slurm script. The parent process is a python script, and the child process is a C++ program. Now I put the source code below:
parent process:
# parent process : mpitest.py
from mpi4py import MPI
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=x)
child process:
// child process: test_mpi.cpp, in which parallelization is implemented using boost::mpi
#include <boost/mpi/environment.hpp>
#include <boost/mpi/communicator.hpp>
#include <boost/mpi.hpp>
#include <iostream>
using namespace boost;
int main(int argc, char* argv[]){
boost::mpi::environment env(argc, argv);
boost::mpi::communicator world;
int commrank;
MPI_Comm_rank(MPI_COMM_WORLD, &commrank);
std::cout << commrank << std::endl;
return 0;
}
the slurm script:
#!/bin/bash
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun -np y python mpitest.py
In the above code, there are 2 parameters x (maxprocs in MPI.COMM_SELF.Spawn) and y (-np in mpirun). The slurm script only ran normally when x = 1, y = 1. However, when I tried to increase x and y, the following error occurred (x = 1, y = 2):
[c05n04:13182] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n04:13182] [[31687,1],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c05n11:07942] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c05n11:07942] [[31687,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[c05n11:07942] *** An error occurred in MPI_Init
[c05n11:07942] *** reported by process [2076639234,0]
[c05n11:07942] *** on a NULL communicator
[c05n11:07942] *** Unknown error
[c05n11:07942] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c05n11:07942] *** and potentially your MPI job)
Similarly, when x = 2, y = 2, the following error occurred:
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes
[c18n08:16481] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n08:16481] [[54742,1],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01329] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01329] [[54742,2],0] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[c18n11:01332] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[c18n11:01332] [[54742,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=2)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[c18n11:01332] *** An error occurred in MPI_Init
[c18n11:01332] *** reported by process [3587571714,1]
[c18n11:01332] *** on a NULL communicator
[c18n11:01332] *** Unknown error
[c18n11:01332] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n11:01332] *** and potentially your MPI job)
[c18n08:16469] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:16469] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:16469] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
What I want to do is to use several nodes to run the C++ program in parallel using this python script, i.e. x = 1, y > 1. What should I modify in the python code or the slurm script?
Update: according to the advice by
Gilles Gouaillarde, I modified the slurm script:
#!/bin/bash
#SBATCH --job-name=mpitest
#SBATCH --partition=day
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6
#SBATCH --mem 5G
#SBATCH -t 01-00:00:00
#SBATCH --output="mpitest.out"
#SBATCH --error="mpitest.error"
#run program
module load Boost/1.74.0-gompi-2020b
#the MPI version is: OpenMPI/4.0.5
mpirun --mca pml ^ucx --mca btl ^ucx --mca osc ^ucx -np 2 python mpitest.py
It still shows the following error: (here mpirun -np 2 and maxprocs=1 were used)
[c18n08:21555] [[49573,1],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03381] [[49573,3],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
[c18n08:21556] [[49573,1],1] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
[c18n10:03380] [[49573,2],0] ORTE_ERROR_LOG: Not found in file dpm/dpm.c at line 493
Traceback (most recent call last):
File "mpitest.py", line 4, in <module>
sub_comm = MPI.COMM_SELF.Spawn('test_mpi', args=[], maxprocs=1)
File "mpi4py/MPI/Comm.pyx", line 1534, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[c18n10:03380] *** An error occurred in MPI_Init
[c18n10:03380] *** reported by process [3248816130,0]
[c18n10:03380] *** on a NULL communicator
[c18n10:03380] *** Unknown error
[c18n10:03380] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c18n10:03380] *** and potentially your MPI job)
[c18n08:21542] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c18n08:21542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c18n08:21542] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
I have a script like this:
script.sh
#!/bin/bash
clang -v
If I do a dtruss on it then I would expect to see an execve call to clang.
$ sudo dtruss -f -a -e ./script.sh
However, the trace does not contain an execve. Instead there is an error:
...
1703/0x16931: 856 4 0 sigaction(0x15, 0x7FFEE882A3B8, 0x7FFEE882A3F8) = 0 0
1703/0x16931: 858 4 0 sigaction(0x16, 0x7FFEE882A3C8, 0x7FFEE882A408) = 0 0
1703/0x16931: 874 4 0 sigaction(0x2, 0x7FFEE882A3C8, 0x7FFEE882A408) = 0 0
1703/0x16931: 881 4 0 sigaction(0x3, 0x7FFEE882A3C8, 0x7FFEE882A408) = 0 0
1703/0x16931: 883 4 0 sigaction(0x14, 0x7FFEE882A3C8, 0x7FFEE882A408) = 0 0
dtrace: error on enabled probe ID 2149 (ID 280: syscall::execve:return): invalid address (0x7fc2b5502c30) in action #12 at DIF offset 12
1703/0x16932: 2873: 0: 0 fork() = 0 0
1703/0x16932: 2879 138 5 thread_selfid(0x0, 0x0, 0x0) = 92466 0
1703/0x16932: 2958 8 0 issetugid(0x0, 0x0, 0x0) = 0 0
1703/0x16932: 2975 8 1 csrctl(0x0, 0x7FFEEE21DC3C, 0x4) = 0 0
1703/0x16932: 2985 12 6 csops(0x0, 0x0, 0x7FFEEE21E550) = 0 0
1703/0x16932: 3100 13 3 shared_region_check_np(0x7FFEEE21DA98, 0x0, 0x0)
...
What is causing this error?
How can I get the execve command to show so that I can see the program called and its arguments?
This means that the DTrace script that dtruss is using internally is accessing an invalid memory address, which is happening while it's trying to trace the execve call you're curious about. So basically, dtruss (or possibly DTrace itself) appears to have a bug which is preventing you from getting the information you want. Apple hasn't been the best about keeping DTrace and the tools that depend on it working well on macOS, unfortunately :-/.
For Bash / shell scripts in particular, you can make it print every command that it runs by adding set -x at the top of your script (more info in this other answer).
If you want, you could also try using DTrace directly instead -- this is a pretty simple one-liner (haven't tried running this myself so apologies if there are typos):
sudo dtrace -n 'proc:::exec-success /ppid == $target/ { trace(curpsinfo->pr_psargs); }' -c './script.sh'
The way this works is:
proc:::exec-success: Trace all exec-success events in the system, which fire in the subprocess when an exec*-family syscall returns successfully.
/ppid == $target/: Filter which means this only fires when the parent process's PID (ppid) matches the PID returned for the process started by the -c option we passed to the dtrace command ($target).
{ trace(curpsinfo->pr_psargs); }: This is the action to take when the event fires and it matches our filter. We simply print (trace) the arguments passed to the process, which is stored in the curpsinfo variable.
(If that fails with a similar-looking error, it's likely that the bug is in macOS's implementation of curpsinfo somewhere.)
I am running an application that runs with a process Id 423.
Basically want to debug this process.
The problem is that,
using the command sudo dtruss -a -t open_nocancel -p 423 I dont see print messages executed and also systems signals like sudo kill -30 423 dont seem to show in the stack trace. Am I missing something?. How do I achieve this?.
Sample Stack trace below
PID/THRD RELATIVE ELAPSD CPU SYSCALL(args) = return
423/0xcf5: 109498638 14 9 open_nocancel("/Users/krishna/.rstudio-desktop/sdb/s-3F25A09C/373AE888\0", 0x0, 0x1B6) = 21 0
423/0xcf5: 109509540 20 16 open_nocancel("/Users/krishna/.rstudio-desktop/history_database\0", 0x209, 0x1B6) = 20 0
423/0xcf5: 109510342 56 44 open_nocancel(".\0", 0x0, 0x1) = 20 0
423/0xcf5: 109516113 19 15 open_nocancel("/Users/krishna/.rstudio-desktop/history_database\0", 0x209, 0x1B6) = 20 0
423/0xcf5: 109517099 35 30 open_nocancel(".\0", 0x0, 0x1) = 20 0
423/0xcf5: 109576820 16 11 open_nocancel("/Users/krishna/.rstudio-desktop/sdb/s-3F25A09C/373AE888\0", 0x0, 0x1B6) = 21 0
423/0xcf5: 109673038 16 10 open_nocancel("/Users/krishna/.rstudio-desktop/sdb/s-3F25A09C/373AE888\0", 0x0, 0x1B6) = 21 0
The command sudo dtruss -a -t open_nocancel -p 423 will trace only the open_nocancel system call. Per the OS X man page for dtruss:
NAME
dtruss - process syscall details. Uses DTrace.
SYNOPSIS
dtruss [-acdeflhoLs] [-t syscall] { -p PID | -n name | command }
...
-t syscall
examine this syscall only
If you want to trace other system calls, you need to either change the -t ... argument, or remove it.
I used homebrew to install GNU parallel on my mac so I can run some tests remotely on my University's servers. I was quickly running through the tutorials, but when I ran
parallel -S <username>#$SERVER1 echo running on ::: <username>#$SERVER1
I got the message
parallel: Warning: Could not figure out number of cpus on <username#server> (). Using 1.
Possibly related, I never added parallel to my path and got the warning that "parallel" wasn't a recognized command, but parallel ran anyways and still echo'd correctly. This particular server has 16 cores, how can I get parallel to recognize them?
GNU Parallel is less tested on OS X as I do not have access to an OS X installation, so you have likely found a bug.
GNU Parallel has since 20120322 used these to find the number of CPUs:
sysctl -n hw.physicalcpu
sysctl -a hw 2>/dev/null | grep [^a-z]physicalcpu[^a-z] | awk '{ print \$2 }'
And the number of cores:
sysctl -n hw.logicalcpu
sysctl -a hw 2>/dev/null | grep [^a-z]logicalcpu[^a-z] | awk '{ print \$2 }'
Can you test what output you get from those?
Which version of GNU Parallel are you using?
As a work around you can force GNU Parallel to detect 16 cores:
parallel -S 16/<username>#$SERVER1 echo running on ::: <username>#$SERVER1
Since version 20140422 you have been able to export your path to the remote server:
parallel --env PATH -S 16/<username>#$SERVER1 echo running on ::: <username>#$SERVER1
That way you just need to add the dir where parallel lives on the server to your path on local machine. E.g. parallel on the remote server is in /home/u/user/bin/parallel:
PATH=$PATH:/home/u/user/bin parallel --env PATH -S <username>#$SERVER1 echo running on ::: <username>#$SERVER1
Information for Ole
My iMac (OSX MAvericks on Intel core i7) gives the following, which all looks correct:
sysctl -n hw.physicalcpu
4
sysctl -a hw
hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 17179869184
hw.activecpu: 8
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 7
hw.cpusubtype: 4
hw.cpu64bit_capable: 1
hw.cpufamily: 1418770316
hw.cacheconfig: 8 2 2 8 0 0 0 0 0 0
hw.cachesize: 17179869184 32768 262144 8388608 0 0 0 0 0 0
hw.pagesize: 4096
hw.busfrequency: 100000000
hw.busfrequency_min: 100000000
hw.busfrequency_max: 100000000
hw.cpufrequency: 3400000000
hw.cpufrequency_min: 3400000000
hw.cpufrequency_max: 3400000000
hw.cachelinesize: 64
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 8388608
hw.tbfrequency: 1000000000
hw.packages: 1
hw.optional.floatingpoint: 1
hw.optional.mmx: 1
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.supplementalsse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.x86_64: 1
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.optional.rdrand: 0
hw.optional.f16c: 0
hw.optional.enfstrg: 0
hw.optional.fma: 0
hw.optional.avx2_0: 0
hw.optional.bmi1: 0
hw.optional.bmi2: 0
hw.optional.rtm: 0
hw.optional.hle: 0
hw.cputhreadtype: 1
hw.machine = x86_64
hw.model = iMac12,2
hw.ncpu = 8
hw.byteorder = 1234
hw.physmem = 2147483648
hw.usermem = 521064448
hw.pagesize = 4096
hw.epoch = 0
hw.vectorunit = 1
hw.busfrequency = 100000000
hw.cpufrequency = 3400000000
hw.cachelinesize = 64
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2settings = 1
hw.l2cachesize = 262144
hw.l3settings = 1
hw.l3cachesize = 8388608
hw.tbfrequency = 1000000000
hw.memsize = 17179869184
hw.availcpu = 8
sysctl -n hw.logicalcpu
8