Diagnosing high CPU usage on Docker for Mac - macos

How do I diagnose the cause of Docker on MacOS, specifically com.docker.hyperkit using 100% of CPU?
Docker stats
Docker stats shows all the running containers have low CPU, memory, net IO and block IO.
iosnoop
iosnoop shows that com.docker.hyperkit performs about 50 writes per second totaling 500KB per second to the file Docker.qcow2. According to What is Docker.qcow2?, Docker.qcow2 is a sparse file that's the persistent storage for all Docker containers.
In my case the file isn't that sparse. The physical size matches the logical size.
dtrace (dtruss)
dtruss sudo dtruss -p $DOCKER_PID shows a large number of psynch_cvsignal and psynch_cvwait calls.
psynch_cvsignal(0x7F9946002408, 0x4EA701004EA70200, 0x4EA70100) = 257 0
psynch_mutexdrop(0x7F9946002318, 0x5554700, 0x5554700) = 0 0
psynch_mutexwait(0x7F9946002318, 0x5554702, 0x5554600) = 89474819 0
psynch_cvsignal(0x10BF7B470, 0x4C8095004C809600, 0x4C809300) = 257 0
psynch_cvwait(0x10BF7B470, 0x4C8095014C809600, 0x4C809300) = 0 0
psynch_cvwait(0x10BF7B470, 0x4C8096014C809700, 0x4C809600) = -1 Err#316
psynch_cvsignal(0x7F9946002408, 0x4EA702004EA70300, 0x4EA70200) = 257 0
psynch_cvwait(0x7F9946002408, 0x4EA702014EA70300, 0x4EA70200) = 0 0
psynch_cvsignal(0x10BF7B470, 0x4C8097004C809800, 0x4C809600) = 257 0
psynch_cvwait(0x10BF7B470, 0x4C8097014C809800, 0x4C809600) = 0 0
psynch_cvwait(0x10BF7B470, 0x4C8098014C809900, 0x4C809800) = -1 Err#316
Update: top on Docker host
From https://stackoverflow.com/a/58293240/30900:
docker run -it --rm --pid host busybox top
The CPU usage on docker embedded host is ~3%. CPU usage on my MacBook was ~100%. So, the docker embedded host isn't causing the CPU usage spike.
Update: running dtrace scripts of most common stack traces
Stack traces from the dtrace scripts in the answer below: https://stackoverflow.com/a/58293035/30900.
These kernel stack traces look innocuous.
AppleIntelLpssGspi`AppleIntelLpssGspi::regRead(unsigned int)+0x1f
AppleIntelLpssGspi`AppleIntelLpssGspi::transferMmioDuplexMulti(void*, void*, unsigned long long, unsigned int)+0x91
AppleIntelLpssSpiController`AppleIntelLpssSpiController::transferDataMmioDuplexMulti(void*, void*, unsigned int, unsigned int)+0xb2
AppleIntelLpssSpiController`AppleIntelLpssSpiController::_transferDataSubr(AppleInfoLpssSpiControllerTransferDataRequest*)+0x5bc
AppleIntelLpssSpiController`AppleIntelLpssSpiController::_transferData(AppleInfoLpssSpiControllerTransferDataRequest*)+0x24f
kernel`IOCommandGate::runAction(int (*)(OSObject*, void*, void*, void*, void*), void*, void*, void*, void*)+0x138
AppleIntelLpssSpiController`AppleIntelLpssSpiDevice::transferData(IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, unsigned int, AppleIntelSPICompletion*)+0x151
AppleHSSPISupport`AppleHSSPIController::transferData(IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, unsigned int, AppleIntelSPICompletion*)+0xcc
AppleHSSPISupport`AppleHSSPIController::doSPITransfer(bool, AppleHSSPITransferRetryReason*)+0x97
AppleHSSPISupport`AppleHSSPIController::InterruptOccurred(IOInterruptEventSource*, int)+0xf8
kernel`IOInterruptEventSource::checkForWork()+0x13c
kernel`IOWorkLoop::runEventSources()+0x1e2
kernel`IOWorkLoop::threadMain()+0x2c
kernel`call_continuation+0x2e
53
kernel`waitq_wakeup64_thread+0xa7
pthread`__psynch_cvsignal+0x495
pthread`_psynch_cvsignal+0x28
kernel`psynch_cvsignal+0x38
kernel`unix_syscall64+0x27d
kernel`hndl_unix_scall64+0x16
60
kernel`hndl_mdep_scall64+0x4
113
kernel`ml_set_interrupts_enabled+0x19
524
kernel`ml_set_interrupts_enabled+0x19
kernel`hndl_mdep_scall64+0x10
5890
kernel`machine_idle+0x2f8
kernel`call_continuation+0x2e
43395
The most common stack traces in user space over 17 seconds clearly implicate com.docker.hyperkit. There 1365 stack traces in 17 seconds in which com.docker.hyperkit created threads which averages to 80 threads per second.
com.docker.hyperkit`0x000000010cbd20db+0x19f9
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
19
Hypervisor`hv_vmx_vcpu_read_vmcs+0x1
com.docker.hyperkit`0x000000010cbd4c4f+0x2a
com.docker.hyperkit`0x000000010cbd20db+0x174a
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
22
Hypervisor`hv_vmx_vcpu_read_vmcs
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
34
com.docker.hyperkit`0x000000010cbd878d+0x36
com.docker.hyperkit`0x000000010cbd20db+0x42f
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
47
Hypervisor`hv_vcpu_run+0xd
com.docker.hyperkit`0x000000010cbd20db+0x6b6
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
135
Related issues
Github - docker/for-mac: com.docker.hyperkit 100% cpu usage is back again #3499
. One comment suggests adding volume caching described here: https://www.docker.com/blog/user-guided-caching-in-docker-for-mac/. I tried this and got a small ~10% reduction in CPU usage.

I have the same problem. My CPU % went back down to normal after I removed all my volumes.
docker system prune --volumes
I also manually removed some named volumes:
docker volume rm NameOfVolumeHere
That doesn't solve the overall issue of not being able to use volumes with Docker for mac. Right now I'm just being careful about the amount of volumes I use and closing Docker desktop when not in use.

My suspicion is that the issue is IO related. With MacOS volumes, this involves osxfs where there is some performance tuning you can perform. Mainly, if you can accept fewer consistency checks, you can set the volume mode to delegated for faster performance. See the docs for more details: https://docs.docker.com/docker-for-mac/osxfs-caching/. However, if your image contains a large number of small files, performance will suffer, especially if you also have lots of image layers.
You can also try the following command to debug any process issues within the embedded VM that docker uses:
docker run -it --rm --pid host busybox top
(To exit, use <ctrl>-c)
To track down if it's IO, you can also try the following:
$ docker run -it --rm --pid host alpine /bin/sh
$ apk add sysstat
$ pidstat -d 5 12
That will run inside the alpine container running in the VM pid namespace, showing any IO happening from any process, whether or not that process is inside of a container. The stats are every 5 seconds for one minute (12 times) and then it will give you an average table per process. You can then <ctrl>-d to destroy the alpine container.
From the comments and edits, these stats may check out. A 4 core MBP has 8 threads, so full CPU utilization should be 800% if MacOS is reporting the same as other Unix based systems. Inside the VM there's over 100% load shown in the top command for the average in the past minute (though less from the 5 and 15 averages) which is roughly what you see for the hyperkit process on the host. The instantaneous usage is over 12% from top, not 3%, since you need to add the system and user percentages. And the IO numbers shown in pidstat align roughly with what you see written to the qcow2 image.
If the docker engine itself is thrashing (e.g. restarting containers, or running lots of healthchecks), then you can debug that by watching the output of:
docker events

EDIT: after a few weeks, my cpu issues have come back - so the below solutions probably aren't worth it
My CPU was always running crazy high, and it wasn't I/O, as determined using docker stats
I did a bunch of stuff, but had it suddenly decrease to reasonable levels and stay that way for over a week now, after doing the following:
Ensure you have the right # of CPU's set - not what you have, but HALF that amount. Mine was more than half, and I feel this was the real problem, in Preferences | Resources
decrease # of file shares if possible - Preferences | Resources, /private, /tmp/, /var/folders
disable use gRPC FUSE for file sharing - Preferences | Resources

Changing the volumes to use a delegated configuration worked for me and resulted in a drastic drop in CPU usage.
see the document: https://docs.docker.com/docker-for-mac/osxfs-caching/#delegated
how set in my docker-compose.yml:
version: "3"
services:
my_service:
image: python3.6
ports:
- "80:10000"
volumes:
- ./code:/www/code:cached
For me this worked, macOS 10.15.5, Docker Desktop 2.3.0

This is a small dTrace script I use to find where the kernel is spending its time (it's from Solaris, and dates back to the early days of Solaris 10):
#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg0/
{
#[ stack() ] = count();
}
It simply samples kernel stack traces and counts each one it encounters in the # aggregation.
Run it as root:
... # ./kernelhotspots.d > /tmp/kernel_hot_spots.txt
Let it run for a decent amount of time while you're having CPU issues, then hit CTRL-C to break the script. It will emit all the kernel stack traces it encountered, the most common last. If you need more (or less) stack frames from the default with
#[ stack( 15 ) ] = count();
That will show a stack frame 15 calls deep.
The last few stack traces will be where your kernel is spending most of its time. That may or may not be informative.
This script will do the same for user-space stack traces:
#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg1/
{
#[ ustack() ] = count();
}
Run it similarly:
... # ./userspacehotspots.d > /tmp/userspace_hot_spots.txt
ustack() is a bit slower - to emit the actual function names, dTrace has to do a lot more work to get them from the address spaces of the appropriate processes.
Disabling System Integrity Protection might help you get better stack traces.
See DTrace Action Basics for some more details.

Had same issue with docker today in Big Sur (tried pruning images, changing to apple virtualization, nothing helped). However, disabling the docker desktop to startup in preferences and never opening the desktop gui seems to fix it for me. Docker now runs with only 10%cpu usage even after starting a few containers. However, once I open the desktop gui it slowly rises again to +90% cpu and keeps on hogging the cpu even after closing the DockerDesktop process. Docker version 20.10.13, build a224086.

The solution I found was to increase the resources given to Docker. I increased the Memory from 2GB to 8GB, the Swap from 1GB to 2GB, and the disk image size to 160GB. Completely solved the problem for me, and it's an easy one for readers to try.

to disable use gRPC FUSE for file sharing might not good idea. I found the feedback from another issue made by docker community. see bellow:
So we'll look into that. However,
osxfs will not be supported long term.
We can't maintain two solutions.
hier to docker issue thread

There is an open issue here https://github.com/docker/for-mac/issues/6166
It seems there are a few bugs going on
For some people (me including) unchecking the "Open Docker Dashboard at startup" and manually restarting docker do the job.
For other people increasing resources like CPU and Memory works

Related

How to share memory in cluster machine (qsub openmpi)

dear all!
I have a question about sharing memory in cluster. I am a new to cluster, and fail to solve my problem after trying about several weeks, so I look for help here, any suggestion would be grateful!
I want to use soapdenovo, a software that was used to assemble human genome to assemble my data. However, it failed in one step because shortage of memory (the memory is 512G in my machine). So I turned to cluster machine (which have three big nodes, each node have 512 memory too), and started to learn submit job with qsub. Considering that one node couldn't solve my problem, I googled and found that openmpi may help, but when I running openmpi with demo data, it seemed it only run the command several times. Then I found to use openmpi, the software must include library of openmpi, and I didn't know whether soapdenovo is support openmpi, I had asked the question but the author didn't give me answer yet. Suppose soapdenovo support the openmpi, how should I solve my problem. If it didn't support openmpi, can I use memory in different nodes to run the software?
The problem had tortured my so much, thanks for any help. Following is what had I do and some information about the cluster machine:
Install openmpi and submit the job
1) The script of job:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
2) Submit the job:
qsub -pe orte 60 mpi.qsub
3) The log in ass.err
a) It seemed it run soapdenovo several times according to the log
cat ass.err | grep "Pregraph" | wc -l
60
b) detail information
less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix
In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
and so on
c) information of stdin
cat ass.log:
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.
Host: smp03
PID: 75035
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Information about cluster:
1) qconf -sql
all.q
smp.q
2) qconf -spl
mpi
mpich
orte
zhongxm
3) qconf -sp zhongxm
pe_name zhongxm
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
4) qconf -sq smp.q
qname smp.q
hostlist #smp.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
5) qconf -sq all.q
qname all.q
hostlist #allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 16,[c0219.local=32]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists mobile
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
According to https://hpc.unt.edu/soapdenovo the software doesn't support MPI:
This code is NOT compiled with MPI, and should only be used in parallel on a SINGLE node, via a threaded model.
So, you can't just start the software with mpiexec on cluster to have access to more memory. Cluster machines are connected with non-coherent networks (Ethernet, Infiniband) which are slower than memory bus, and PCs in cluster do not share their memory. Clusters use MPI libraries (OpenMPI or MPICH) to work with network, and all requests between nodes is explicit: program calls MPI_Send in one process and MPI_Recv in other. There are also one-way calls like MPI_Put/MPI_Get to access remote memory (RDMA - remote direct memory access), but this is not the same as local memory.
osgx, thank you for your reply very much and sorry for the delay of this message.
Since I don't major in computer, I think I can't understand some glossary very well, like ELF. So there are some new questions and I list my question as follow, thanks for help advace:
1) When I "ldd SOAPdenovo-63mer", it outputed "not a dynamic executable", did this mean "the code is not complied with MPI" that you mentioned?
2) In short, I can't solve the problem with the cluster, and I have to look for a machine with more than 512G memory?
3) Also, I used another software called ALLPATHS-LG (http://www.broadinstitute.org/software/allpaths-lg/blog/) that was also failed for shortage of memory, and according to FAQ C1 (http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336), what "it uses share memory parallelization" mean, did it means it can use memory in cluster, or only memory in a node, and I have to find a machine with enough memory?
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
By the way, this is first time I posted here, I think I should use commit to reply, considering so many words, I use "Answer Your Question".

Oracle Database CPU Usage on AIX

I want to find the CPU process usage for all Oracle processes on an AIX box.
On Solaris I can do the following:
prstat -n 400 -c -s cpu -p 9013 1 1
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
9013 oracle 3463M 2928M sleep 53 0 0:00:35 0.9% oracle/2
Total: 1 processes, 2 lwps, load averages: 2.25, 2.32, 2.40
This basically reports the CPU usage for a given process ID (in this case 9013). Given a list of all Oracle PID’s I can use this command to get the CPU usage for each one, sum them up and hey presto I have my Oracle database CPU usage.
How can I get the same with AIX?
Thanks
You can try nmon or topas, which will show the current %CPU. You might also want to look into using WLM to create a class for all the Oracle processes, then use wlmstat to see the CPU usage for that class. That would save you the trouble of adding them up manually.

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html

How to detect non-busy machines over a LAN automatically?

I'm writing an MPI program to be run over a local area network. These machines can be ssh'd to by any student at any time.
Although I always test my program at night, the performance has been very inconsistent. My guess is that some nodes were busy when I ran the program.
So my question is: can I write a script to detect non-busy machines and update the machine file? What's an easy way to write it?
Thanks a lot.
SSH into each machine, then read the /proc/loadavg file or determine the "business" in some other way.
I think the easiest way would be installing the check_load[1] script from Nagios to every node you want to check and call it via ssh with some sensible parameters:
# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;
CRITICAL would mean "really busy", WARNING could be "is kinda busy" and OK would mean "the machine is idle".
You have to pay attention for the tresholds you have to give as 1/5/15 minute for warning and critical; for instance, a machine with 16 cores having a load of 3 is perfectly ok, while a load of 3 on a single-core machine would mean it's really really busy.
Good luck!
Alex.
[1] http://nagiosplugins.org/man/check_load

Difference between memory_get_peak_usage and actual php process' memory usage

Why result of php memory_get_peak_usage differs so much from memory size that is shown as allocated to process when using 'top' or 'ps' commands in Linux?
I've set 2 Mb of memory_limit in php.ini
My single-string php-script with
echo memory_get_peak_usage(true);
says that it is using 786432 bytes (768 Kb)
If I try to ask system about current php process
echo shell_exec('ps -p '.getmypid().' -Fl');
it gives me
F S UID PID PPID C PRI NI ADDR SZ WCHAN RSS PSR STIME TTY TIME CMD
5 S www-data 14599 14593 0 80 0 - 51322 pipe_w 6976 2 18:53 ? 00:00:00 php-fpm: pool www
RSS param is 6976, so memory usage is 6976 * 4096 = 28573696 = ~28 Mb
Where that 28 Mb come from and is there any way to decrease memory size that is being used by php-fpm process?
The memory size is mostly used by the PHP process itself. memory_get_peak_usage() returns the memory used by your specific script. Ways to reduce the memory overhead is to remove the number of extensions, statically compile PHP, etc.. But don't forget that php-fpm (should) fork and that a lot of the memory usage that's not different between PHP process is in fact shared (until it changes).
PHP itself may only be set to a 2meg limit, but it's running WITHIN a Apache child process, and that process will have a much higher memory footprint.
If you were running the script from the command line, you'd get memory usage of PHP by itself, as it's not wrapped within Apache and is running on its own.
The peak memory usage is for the current script only.

Resources