I'm writing an MPI program to be run over a local area network. These machines can be ssh'd to by any student at any time.
Although I always test my program at night, the performance has been very inconsistent. My guess is that some nodes were busy when I ran the program.
So my question is: can I write a script to detect non-busy machines and update the machine file? What's an easy way to write it?
Thanks a lot.
SSH into each machine, then read the /proc/loadavg file or determine the "business" in some other way.
I think the easiest way would be installing the check_load[1] script from Nagios to every node you want to check and call it via ssh with some sensible parameters:
# /usr/lib64/nagios/plugins/check_load -w 1,2,3 -c 3,4,5
OK - load average: 0.20, 0.43, 0.50|load1=0.200;1.000;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.1,2,3 -c 3,4,5
WARNING - load average: 0.18, 0.43, 0.50|load1=0.180;0.100;3.000;0; load5=0.430;2.000;4.000;0; load15=0.500;3.000;5.000;0;
# /usr/lib64/nagios/plugins/check_load -w 0.01,2,3 -c
0.1,4,5
CRITICAL - load average: 0.41, 0.46, 0.51|load1=0.410;0.010;0.100;0; load5=0.460;2.000;4.000;0; load15=0.510;3.000;5.000;0;
CRITICAL would mean "really busy", WARNING could be "is kinda busy" and OK would mean "the machine is idle".
You have to pay attention for the tresholds you have to give as 1/5/15 minute for warning and critical; for instance, a machine with 16 cores having a load of 3 is perfectly ok, while a load of 3 on a single-core machine would mean it's really really busy.
Good luck!
Alex.
[1] http://nagiosplugins.org/man/check_load
Related
Current setup that we do have ~2000 servers (in 1 group)
I would like to know if there is a way to run x.yml on all the group (where all the 2k servers are in ) but with multiple plays (threaded , or something)
ansible-playbook -i prod.ini -l my_group[50%] x.yml
ansible-playbook -i prod.ini -l my_group[other 50%] x.yml
solutions with awx or ansible-tower are not relevant.
using even 500-1000 forks didn't gave any improvement
try to combine forks, and the free strategy.
the default behavior of Ansible is:
Ansible runs each task on all hosts affected by a play before starting the next task on any host, using 5 forks.
So event if your increase the forks number, the tasks on special forks will still wait any host finish to go ahead. The free strategy allows each host to run until the end of the play as fast as it can
- hosts: all
strategy: free
tasks:
# ...
ansible-playbook -i prod.ini -f 500 -l my_group x.yml
As mentioned above, you should preferably increase fork and set the strategy to free. Increasing fork will help you run the playbook on more server and setting the strategy to free would allow you to run a task for servers independently without waiting for others.
Please refer to below doc for more clarifaction.
docs
resolved by using patterns my_group[:1000] and my_group[999:]
forks didnt give any time decrease in my case.
also free strategy did multiplied the time which was pretty weird.
also debugging free strategy summary is free difficult when u have 2k servers and about 50 tasks in playbook .
thanks everyone for sharing
much appreciated
I am trying to download data from the European Nucleotide Archive (ENA) using Aspera CLI however my downloads are getting stalled. I have downloaded several files earlier using the same tool but this is happening since last one month. I usually use the following command:
ascp -QT -P33001 -k 1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp#fasp.sra.ebi.ac.uk:/vol1/fastq/ERR192/009/ERR1924229/ERR1924229.fastq.gz .
From a post on Beta Science, I learnt that this might be due to not limiting the download speed and hence tried usng the -l argument but was of no help.
ascp -QT -l 300m -P33001 -k 1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp#fasp.sra.ebi.ac.uk:/vol1/fastq/ERR192/009/ERR1924229/ERR1924229.fastq.gz .
Your command works.
you might be overdriving your local network ?
how much bandwidth do you have ?
here "-l 300m" sets a target rate to 300 Mbps, if you have less than 30, this can cause such problems.
try to reduce the target rate to what you actually have.
(using wired ? Wifi ?)
How do I diagnose the cause of Docker on MacOS, specifically com.docker.hyperkit using 100% of CPU?
Docker stats
Docker stats shows all the running containers have low CPU, memory, net IO and block IO.
iosnoop
iosnoop shows that com.docker.hyperkit performs about 50 writes per second totaling 500KB per second to the file Docker.qcow2. According to What is Docker.qcow2?, Docker.qcow2 is a sparse file that's the persistent storage for all Docker containers.
In my case the file isn't that sparse. The physical size matches the logical size.
dtrace (dtruss)
dtruss sudo dtruss -p $DOCKER_PID shows a large number of psynch_cvsignal and psynch_cvwait calls.
psynch_cvsignal(0x7F9946002408, 0x4EA701004EA70200, 0x4EA70100) = 257 0
psynch_mutexdrop(0x7F9946002318, 0x5554700, 0x5554700) = 0 0
psynch_mutexwait(0x7F9946002318, 0x5554702, 0x5554600) = 89474819 0
psynch_cvsignal(0x10BF7B470, 0x4C8095004C809600, 0x4C809300) = 257 0
psynch_cvwait(0x10BF7B470, 0x4C8095014C809600, 0x4C809300) = 0 0
psynch_cvwait(0x10BF7B470, 0x4C8096014C809700, 0x4C809600) = -1 Err#316
psynch_cvsignal(0x7F9946002408, 0x4EA702004EA70300, 0x4EA70200) = 257 0
psynch_cvwait(0x7F9946002408, 0x4EA702014EA70300, 0x4EA70200) = 0 0
psynch_cvsignal(0x10BF7B470, 0x4C8097004C809800, 0x4C809600) = 257 0
psynch_cvwait(0x10BF7B470, 0x4C8097014C809800, 0x4C809600) = 0 0
psynch_cvwait(0x10BF7B470, 0x4C8098014C809900, 0x4C809800) = -1 Err#316
Update: top on Docker host
From https://stackoverflow.com/a/58293240/30900:
docker run -it --rm --pid host busybox top
The CPU usage on docker embedded host is ~3%. CPU usage on my MacBook was ~100%. So, the docker embedded host isn't causing the CPU usage spike.
Update: running dtrace scripts of most common stack traces
Stack traces from the dtrace scripts in the answer below: https://stackoverflow.com/a/58293035/30900.
These kernel stack traces look innocuous.
AppleIntelLpssGspi`AppleIntelLpssGspi::regRead(unsigned int)+0x1f
AppleIntelLpssGspi`AppleIntelLpssGspi::transferMmioDuplexMulti(void*, void*, unsigned long long, unsigned int)+0x91
AppleIntelLpssSpiController`AppleIntelLpssSpiController::transferDataMmioDuplexMulti(void*, void*, unsigned int, unsigned int)+0xb2
AppleIntelLpssSpiController`AppleIntelLpssSpiController::_transferDataSubr(AppleInfoLpssSpiControllerTransferDataRequest*)+0x5bc
AppleIntelLpssSpiController`AppleIntelLpssSpiController::_transferData(AppleInfoLpssSpiControllerTransferDataRequest*)+0x24f
kernel`IOCommandGate::runAction(int (*)(OSObject*, void*, void*, void*, void*), void*, void*, void*, void*)+0x138
AppleIntelLpssSpiController`AppleIntelLpssSpiDevice::transferData(IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, unsigned int, AppleIntelSPICompletion*)+0x151
AppleHSSPISupport`AppleHSSPIController::transferData(IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, IOMemoryDescriptor*, void*, unsigned long long, unsigned long long, unsigned int, AppleIntelSPICompletion*)+0xcc
AppleHSSPISupport`AppleHSSPIController::doSPITransfer(bool, AppleHSSPITransferRetryReason*)+0x97
AppleHSSPISupport`AppleHSSPIController::InterruptOccurred(IOInterruptEventSource*, int)+0xf8
kernel`IOInterruptEventSource::checkForWork()+0x13c
kernel`IOWorkLoop::runEventSources()+0x1e2
kernel`IOWorkLoop::threadMain()+0x2c
kernel`call_continuation+0x2e
53
kernel`waitq_wakeup64_thread+0xa7
pthread`__psynch_cvsignal+0x495
pthread`_psynch_cvsignal+0x28
kernel`psynch_cvsignal+0x38
kernel`unix_syscall64+0x27d
kernel`hndl_unix_scall64+0x16
60
kernel`hndl_mdep_scall64+0x4
113
kernel`ml_set_interrupts_enabled+0x19
524
kernel`ml_set_interrupts_enabled+0x19
kernel`hndl_mdep_scall64+0x10
5890
kernel`machine_idle+0x2f8
kernel`call_continuation+0x2e
43395
The most common stack traces in user space over 17 seconds clearly implicate com.docker.hyperkit. There 1365 stack traces in 17 seconds in which com.docker.hyperkit created threads which averages to 80 threads per second.
com.docker.hyperkit`0x000000010cbd20db+0x19f9
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
19
Hypervisor`hv_vmx_vcpu_read_vmcs+0x1
com.docker.hyperkit`0x000000010cbd4c4f+0x2a
com.docker.hyperkit`0x000000010cbd20db+0x174a
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
22
Hypervisor`hv_vmx_vcpu_read_vmcs
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
34
com.docker.hyperkit`0x000000010cbd878d+0x36
com.docker.hyperkit`0x000000010cbd20db+0x42f
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
47
Hypervisor`hv_vcpu_run+0xd
com.docker.hyperkit`0x000000010cbd20db+0x6b6
com.docker.hyperkit`0x000000010cbdb98c+0x157
com.docker.hyperkit`0x000000010cbf6c2d+0x4bd
libsystem_pthread.dylib`_pthread_body+0x7e
libsystem_pthread.dylib`_pthread_start+0x42
libsystem_pthread.dylib`thread_start+0xd
135
Related issues
Github - docker/for-mac: com.docker.hyperkit 100% cpu usage is back again #3499
. One comment suggests adding volume caching described here: https://www.docker.com/blog/user-guided-caching-in-docker-for-mac/. I tried this and got a small ~10% reduction in CPU usage.
I have the same problem. My CPU % went back down to normal after I removed all my volumes.
docker system prune --volumes
I also manually removed some named volumes:
docker volume rm NameOfVolumeHere
That doesn't solve the overall issue of not being able to use volumes with Docker for mac. Right now I'm just being careful about the amount of volumes I use and closing Docker desktop when not in use.
My suspicion is that the issue is IO related. With MacOS volumes, this involves osxfs where there is some performance tuning you can perform. Mainly, if you can accept fewer consistency checks, you can set the volume mode to delegated for faster performance. See the docs for more details: https://docs.docker.com/docker-for-mac/osxfs-caching/. However, if your image contains a large number of small files, performance will suffer, especially if you also have lots of image layers.
You can also try the following command to debug any process issues within the embedded VM that docker uses:
docker run -it --rm --pid host busybox top
(To exit, use <ctrl>-c)
To track down if it's IO, you can also try the following:
$ docker run -it --rm --pid host alpine /bin/sh
$ apk add sysstat
$ pidstat -d 5 12
That will run inside the alpine container running in the VM pid namespace, showing any IO happening from any process, whether or not that process is inside of a container. The stats are every 5 seconds for one minute (12 times) and then it will give you an average table per process. You can then <ctrl>-d to destroy the alpine container.
From the comments and edits, these stats may check out. A 4 core MBP has 8 threads, so full CPU utilization should be 800% if MacOS is reporting the same as other Unix based systems. Inside the VM there's over 100% load shown in the top command for the average in the past minute (though less from the 5 and 15 averages) which is roughly what you see for the hyperkit process on the host. The instantaneous usage is over 12% from top, not 3%, since you need to add the system and user percentages. And the IO numbers shown in pidstat align roughly with what you see written to the qcow2 image.
If the docker engine itself is thrashing (e.g. restarting containers, or running lots of healthchecks), then you can debug that by watching the output of:
docker events
EDIT: after a few weeks, my cpu issues have come back - so the below solutions probably aren't worth it
My CPU was always running crazy high, and it wasn't I/O, as determined using docker stats
I did a bunch of stuff, but had it suddenly decrease to reasonable levels and stay that way for over a week now, after doing the following:
Ensure you have the right # of CPU's set - not what you have, but HALF that amount. Mine was more than half, and I feel this was the real problem, in Preferences | Resources
decrease # of file shares if possible - Preferences | Resources, /private, /tmp/, /var/folders
disable use gRPC FUSE for file sharing - Preferences | Resources
Changing the volumes to use a delegated configuration worked for me and resulted in a drastic drop in CPU usage.
see the document: https://docs.docker.com/docker-for-mac/osxfs-caching/#delegated
how set in my docker-compose.yml:
version: "3"
services:
my_service:
image: python3.6
ports:
- "80:10000"
volumes:
- ./code:/www/code:cached
For me this worked, macOS 10.15.5, Docker Desktop 2.3.0
This is a small dTrace script I use to find where the kernel is spending its time (it's from Solaris, and dates back to the early days of Solaris 10):
#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg0/
{
#[ stack() ] = count();
}
It simply samples kernel stack traces and counts each one it encounters in the # aggregation.
Run it as root:
... # ./kernelhotspots.d > /tmp/kernel_hot_spots.txt
Let it run for a decent amount of time while you're having CPU issues, then hit CTRL-C to break the script. It will emit all the kernel stack traces it encountered, the most common last. If you need more (or less) stack frames from the default with
#[ stack( 15 ) ] = count();
That will show a stack frame 15 calls deep.
The last few stack traces will be where your kernel is spending most of its time. That may or may not be informative.
This script will do the same for user-space stack traces:
#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg1/
{
#[ ustack() ] = count();
}
Run it similarly:
... # ./userspacehotspots.d > /tmp/userspace_hot_spots.txt
ustack() is a bit slower - to emit the actual function names, dTrace has to do a lot more work to get them from the address spaces of the appropriate processes.
Disabling System Integrity Protection might help you get better stack traces.
See DTrace Action Basics for some more details.
Had same issue with docker today in Big Sur (tried pruning images, changing to apple virtualization, nothing helped). However, disabling the docker desktop to startup in preferences and never opening the desktop gui seems to fix it for me. Docker now runs with only 10%cpu usage even after starting a few containers. However, once I open the desktop gui it slowly rises again to +90% cpu and keeps on hogging the cpu even after closing the DockerDesktop process. Docker version 20.10.13, build a224086.
The solution I found was to increase the resources given to Docker. I increased the Memory from 2GB to 8GB, the Swap from 1GB to 2GB, and the disk image size to 160GB. Completely solved the problem for me, and it's an easy one for readers to try.
to disable use gRPC FUSE for file sharing might not good idea. I found the feedback from another issue made by docker community. see bellow:
So we'll look into that. However,
osxfs will not be supported long term.
We can't maintain two solutions.
hier to docker issue thread
There is an open issue here https://github.com/docker/for-mac/issues/6166
It seems there are a few bugs going on
For some people (me including) unchecking the "Open Docker Dashboard at startup" and manually restarting docker do the job.
For other people increasing resources like CPU and Memory works
dear all!
I have a question about sharing memory in cluster. I am a new to cluster, and fail to solve my problem after trying about several weeks, so I look for help here, any suggestion would be grateful!
I want to use soapdenovo, a software that was used to assemble human genome to assemble my data. However, it failed in one step because shortage of memory (the memory is 512G in my machine). So I turned to cluster machine (which have three big nodes, each node have 512 memory too), and started to learn submit job with qsub. Considering that one node couldn't solve my problem, I googled and found that openmpi may help, but when I running openmpi with demo data, it seemed it only run the command several times. Then I found to use openmpi, the software must include library of openmpi, and I didn't know whether soapdenovo is support openmpi, I had asked the question but the author didn't give me answer yet. Suppose soapdenovo support the openmpi, how should I solve my problem. If it didn't support openmpi, can I use memory in different nodes to run the software?
The problem had tortured my so much, thanks for any help. Following is what had I do and some information about the cluster machine:
Install openmpi and submit the job
1) The script of job:
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
export PATH=/tools/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/tools/openmpi/lib:$LD_LIBRARY_PATH
soapPath="/tools/SOAPdenovo2/SOAPdenovo-63mer"
workPath="/NGS"
outputPath="assembly/soap/demo"
/tools/openmpi/bin/mpirun $soapPath all -s $workPath/$outputPath/config_file -K 23 -R -F -p 60 -V -o $workPath/$outputPath/graph_prefix > $workPath/$outputPath/ass.log 2> $workPath/$outputPath/ass.err
2) Submit the job:
qsub -pe orte 60 mpi.qsub
3) The log in ass.err
a) It seemed it run soapdenovo several times according to the log
cat ass.err | grep "Pregraph" | wc -l
60
b) detail information
less ass.err (it seemed it only run soapdenov several times, because when I run it in my machine, it would only output one Pregraph):
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
Parameters: pregraph -s /NGS/assembly/soap/demo/config_file -K 23 -p 16 -R -o /NGS/assembly/soap/demo/graph_prefix
In /NGS/assembly/soap/demo/config_file, 1 lib(s), maximum read length 35, maximum name length 256.
Version 2.04: released on July 13th, 2012
Compile Apr 27 2016 15:50:02
********************
Pregraph
********************
and so on
c) information of stdin
cat ass.log:
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.
Host: smp03
PID: 75035
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 58 with PID 0 on node c0214.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Information about cluster:
1) qconf -sql
all.q
smp.q
2) qconf -spl
mpi
mpich
orte
zhongxm
3) qconf -sp zhongxm
pe_name zhongxm
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
4) qconf -sq smp.q
qname smp.q
hostlist #smp.q
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
5) qconf -sq all.q
qname all.q
hostlist #allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make zhongxm
rerun FALSE
slots 16,[c0219.local=32]
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists mobile
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
According to https://hpc.unt.edu/soapdenovo the software doesn't support MPI:
This code is NOT compiled with MPI, and should only be used in parallel on a SINGLE node, via a threaded model.
So, you can't just start the software with mpiexec on cluster to have access to more memory. Cluster machines are connected with non-coherent networks (Ethernet, Infiniband) which are slower than memory bus, and PCs in cluster do not share their memory. Clusters use MPI libraries (OpenMPI or MPICH) to work with network, and all requests between nodes is explicit: program calls MPI_Send in one process and MPI_Recv in other. There are also one-way calls like MPI_Put/MPI_Get to access remote memory (RDMA - remote direct memory access), but this is not the same as local memory.
osgx, thank you for your reply very much and sorry for the delay of this message.
Since I don't major in computer, I think I can't understand some glossary very well, like ELF. So there are some new questions and I list my question as follow, thanks for help advace:
1) When I "ldd SOAPdenovo-63mer", it outputed "not a dynamic executable", did this mean "the code is not complied with MPI" that you mentioned?
2) In short, I can't solve the problem with the cluster, and I have to look for a machine with more than 512G memory?
3) Also, I used another software called ALLPATHS-LG (http://www.broadinstitute.org/software/allpaths-lg/blog/) that was also failed for shortage of memory, and according to FAQ C1 (http://www.broadinstitute.org/software/allpaths-lg/blog/?page_id=336), what "it uses share memory parallelization" mean, did it means it can use memory in cluster, or only memory in a node, and I have to find a machine with enough memory?
C1. Can I run ALLPATHS-LG on a cluster?
You can, but it will only use one machine, not the entire cluster. That machine would need to have enough memory to fit the entire assembly. ALLPATHS-LG does not support distributed computing using MPI, instead it uses Shared Memory Parallelization.
By the way, this is first time I posted here, I think I should use commit to reply, considering so many words, I use "Answer Your Question".
We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html