Julia program stalls when run from crontab scheduler (Linux) - bash

I have a really specific and tricky bug that I can't figure out how to fix/work around and I can't find a similar case on here.
I have a bash script that invokes a Julia script partway through to generate animation frames, then calls ffmpeg to render the animation. When I run from the terminal everything works great. I wanted to automate the process so I got a fun random simulation once a day, so I added it to my crontab and it runs--but only to a certain point. The animation always stops at a specific frame, then the rest of the script continues and spits out the chopped off animation.
I thought maybe cron was the problem, so I installed jobber and ran the job from there--with jobber the script just stalls at the Julia part. From the resource manager I can see the Julia process still using memory (although well beneath the limit) but it's just gone to sleep.
Another strange thing that I have noticed is that when I invoke the script manually from the command line it runs ~2-4x faster in generating the animation frames than when its running automatically via crontab/jobber.
Is this a weird resource issue? To get the longer animations to render initially I had to modify my ulimit settings, but I changed the config file so they should be set higher for everything? How can I debug this further and/or rectify it?
If you want to see an example of the code being run (both the shell script and julia script being invoked) it's pretty much up to date on my github here. In the threeBodyProb.jl file the I'm pretty sure the hang up is with the frame function in the for looop at the end of the file.
I am running Linux Mint 19.1 Cinnamon. Thanks in advance for the help!
Here is the part of the bash script where it hangs up:
./threeBodyProb.jl
echo animation generated, running ffmpeg >> /home/kirk/Documents/3Body/cron_log.txt
cd tmpPlots
</dev/null ffmpeg -framerate 30 -i "%06d.png" -c:v libx264 -preset slow -coder 1 -movflags +faststart -g 15 -crf 18 -pix_fmt yuv420p -profile:v high -y -bf 2 -fs 15M -vf "scale=720:720,setdar=1/1" "/home/kirk/Documents/3Body/3Body_fps30.mp4"
And here is the for loop that hangs up in Julia:
plotLoadPath="/home/kirk/Documents/3Body/tmpPlots/"
threeBodyAnim=Animation(plotLoadPath,String[])
for i=1:35:length(t)
gr(legendfontcolor = plot_color(:white)) #legendfontcolor=:white plot arg broken right now (at least in this backend)
print("$(#sprintf("%.2f",i/length(t)*100)) % complete\r") #output percent tracker
pos=[plotData[1][i],plotData[2][i],plotData[3][i],plotData[4][i],plotData[5][i],plotData[6][i]] #current pos
limx,limy=getLims(pos./1.5e11,10) #convert to AU, 10 AU padding
p=plot(plotData[1][1:i]./1.5e11,plotData[2][1:i]./1.5e11,label="",linecolor=colors[1]) #plot orbits up to i
p=plot!(plotData[3][1:i]./1.5e11,plotData[4][1:i]./1.5e11,label="",linecolor=colors[2])
p=plot!(plotData[5][1:i]./1.5e11,plotData[6][1:i]./1.5e11,label="",linecolor=colors[3])
p=scatter!(starsX,starsY,markercolor=:white,markersize=:1,label="") #fake background stars
star1=makeCircleVals(rad[1],[plotData[1][i],plotData[2][i]]) #generate circles with appropriate sizes for each star
star2=makeCircleVals(rad[2],[plotData[3][i],plotData[4][i]]) #at current positions
star3=makeCircleVals(rad[3],[plotData[5][i],plotData[6][i]])
p=plot!(star1[1]./1.5e11,star1[2]./1.5e11,label="$(#sprintf("%.1f", m[1]./2e30))",color=colors[1],fill=true) #plot star circles with labels
p=plot!(star2[1]./1.5e11,star2[2]./1.5e11,label="$(#sprintf("%.1f", m[2]./2e30))",color=colors[2],fill=true)
p=plot!(star3[1]./1.5e11,star3[2]./1.5e11,label="$(#sprintf("%.1f", m[3]./2e30))",color=colors[3],fill=true)
p=plot!(background_color=:black,background_color_legend=:transparent,foreground_color_legend=:transparent,
background_color_outside=:white,aspect_ratio=:equal,legendtitlefontcolor=:white) #formatting for plot frame
p=plot!(xlabel="x: AU",ylabel="y: AU",title="Random Three Body Problem\nt: $(#sprintf("%0.2f",t[i]/365/24/3600)) yrs after start",
legend=:best,xaxis=("x: AU",(limx[1],limx[2]),font(9,"Courier")),yaxis=("y: AU",(limy[1],limy[2]),font(9,"Courier")),
grid=false,titlefont=font(14,"Courier"),size=(720,721),legendfontsize=8,legendtitle="Mass (in solar masses)",legendtitlefontsize=8) #add in axes/title/legend with formatting
frame(threeBodyAnim,p) #generate the frame
end
If it helps, when run from cron or jogger it always generates 407 frames and fails at the 408th.
UPDATE: Following #TasosPapastylianou's suggestion below I think the problem may be in differing Julia environments when run directly in the terminal vs from a background process like crontab or jobber. I've added the output when a test script that gets the Julia environment is run both from crontab/jobber and directly from the command line. I’m not sure if this is the problem, and if it is how I should tell cron to work in this environment for this job (tried sourcing .bashrc and .profile in script but that had no effect on env output).
From cron:
SHLVL=1
HOME=/home/kirk
LOGNAME=kirk
_=/home/kirk/bashTest.sh
PATH=/opt/someApp/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LANG=en_US.UTF-8
SHELL=/bin/bash
PWD=/home/kirk
OPENBLAS_MAIN_FREE=1
From jobber:
MAIL=/var/mail/kirk
USER=kirk
HOME=/home/kirk
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
LOGNAME=kirk
XDG_SESSION_ID=c4
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
XDG_RUNTIME_DIR=/run/user/1000
LANG=en_US.UTF-8
SHELL=/bin/sh
PWD=/home/kirk
XDG_DATA_DIRS=/home/kirk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
OPENBLAS_MAIN_FREE=1
And when run manually from command line:
GJS_DEBUG_TOPICS=JS ERROR;JS LOG
LESSOPEN=| /usr/bin/lesspipe %s
PERLBREW_VERSION=0.86
PGPLOT_DIR=/home/kirk/Documents/research/MESA/mesasdk/lib/pgplot
USER=kirk
LANGUAGE=en_US
XDG_SEAT=seat0
SSH_AGENT_PID=1786
XDG_SESSION_TYPE=x11
SHLVL=1
CONDA_SHLVL=0
HOME=/home/kirk
DESKTOP_SESSION=cinnamon
GTK_MODULES=gail:atk-bridge
XDG_SEAT_PATH=/org/freedesktop/DisplayManager/Seat0
PERLBREW_ROOT=/home/kirk/perl5/perlbrew
PERLBREW_MANPATH=/home/kirk/perl5/perlbrew/perls/perl-5.24.1/man
MESA_DIR=/home/kirk/Documents/research/MESA/mesa-r11701
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
CINNAMON_VERSION=4.0.10
COLORTERM=truecolor
_CE_M=
MANDATORY_PATH=/usr/share/gconf/cinnamon.mandatory.path
QT_QPA_PLATFORMTHEME=qt5ct
HEADAS=/home/kirk/Documents/research/HEASOFT/heasoft-6.26/x86_64-pc-linux-gnu-libc2.27
LOGNAME=kirk
_=./bashTest.sh
DEFAULTS_PATH=/usr/share/gconf/cinnamon.default.path
GIO_EXTRA_MODULES=/usr/lib/x86_64-linux-gnu/gio/modules/
GTK_OVERLAY_SCROLLING=1
XDG_SESSION_ID=c12
TERM=xterm-256color
MESASDK_VERSION=x86_64-linux-20190503
XMM_DIR=/home/kirk/Documents/research/XMM_Newton/xmmsas_20190531_1155
_CE_CONDA=
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
PATH=/home/kirk/Documents/research/MESA/mesasdk/bin:/home/kirk/anaconda3/bin:/home/kirk/anaconda3/condabin:/home/kirk/perl5/perlbrew/bin:/home/kirk/perl5/perlbrew/perls/perl-5.24.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
GDM_LANG=en_US
PERLBREW_HOME=/home/kirk/.perlbrew
SESSION_MANAGER=local/kirk-Inspiron-7352:#/tmp/.ICE-unix/1709,unix/kirk-Inspiron-7352:/tmp/.ICE-unix/1709
GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/bc8ec572_68ae_4d79_88a2_3cb33f74d86c
XDG_RUNTIME_DIR=/run/user/1000
XDG_SESSION_PATH=/org/freedesktop/DisplayManager/Session0
DISPLAY=:0
VALGRIND_LIB=/home/kirk/Documents/research/MESA/mesasdk/lib/valgrind
LANG=en_US.UTF-8
XDG_CURRENT_DESKTOP=X-Cinnamon
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
PERLBREW_PATH=/home/kirk/perl5/perlbrew/bin:/home/kirk/perl5/perlbrew/perls/perl-5.24.1/bin
XDG_SESSION_DESKTOP=cinnamon
GNOME_TERMINAL_SERVICE=:1.62
XAUTHORITY=/home/kirk/.Xauthority
SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
XDG_GREETER_DATA_DIR=/var/lib/lightdm-data/kirk
MESASDK_ROOT=/home/kirk/Documents/research/MESA/mesasdk
CONDA_PYTHON_EXE=/home/kirk/anaconda3/bin/python
SHELL=/bin/bash
QT_ACCESSIBILITY=1
GDMSESSION=cinnamon
LESSCLOSE=/usr/bin/lesspipe %s %s
QT_LOGGING_RULES=qt5ct.debug=false
PERLBREW_PERL=perl-5.24.1
GJS_DEBUG_OUTPUT=stderr
GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1
XDG_VTNR=7
PWD=/home/kirk
CONDA_EXE=/home/kirk/anaconda3/bin/conda
XDG_DATA_DIRS=/usr/share/cinnamon:/usr/share/gnome:/home/kirk/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
XDG_CONFIG_DIRS=/etc/xdg/xdg-cinnamon:/etc/xdg
OMP_NUM_THREADS=2
PERLBREW_SHELLRC_VERSION=0.82
VTE_VERSION=5202
MANPATH=/home/kirk/Documents/research/MESA/mesasdk/share/man:/home/kirk/perl5/perlbrew/perls/perl-5.24.1/man:/usr/local/man:/usr/local/share/man:/usr/share/man
OPENBLAS_MAIN_FREE=1
UPDATE 2: Again following #TasosPapastylianou's suggestion, after telling the Julia script to log any errors when run from crontab I get the following stacktrace when it attempts to generate frame 408:
ERROR: LoadError: SystemError: opening file "/tmp/juliaFCI2yw.png": No such file or directory
Stacktrace:
[1] #systemerror#43(::Nothing, ::Function, ::String, ::Bool) at ./error.jl:134
[2] systemerror at ./error.jl:134 [inlined]
[3] #open#309(::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Function, ::String) at ./iostream.jl:289
[4] open at ./iostream.jl:281 [inlined]
[5] #open#310(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Base, Symbol("##274#275")){String}, ::String) at ./iostream.jl:373
[6] open at ./iostream.jl:373 [inlined]
[7] read at ./io.jl:297 [inlined]
[8] _show(::IOStream, ::MIME{Symbol("image/png")}, ::Plots.Plot{Plots.GRBackend}) at /home/kirk/.julia/packages/Plots/h3o4c/src/backends/gr.jl:1603
[9] show(::IOStream, ::MIME{Symbol("image/png")}, ::Plots.Plot{Plots.GRBackend}) at /home/kirk/.julia/packages/Plots/h3o4c/src/output.jl:198
[10] png(::Plots.Plot{Plots.GRBackend}, ::String) at /home/kirk/.julia/packages/Plots/h3o4c/src/output.jl:8
[11] frame(::Animation, ::Plots.Plot{Plots.GRBackend}) at /home/kirk/.julia/packages/Plots/h3o4c/src/animation.jl:20
[12] top-level scope at /home/kirk/Documents/3Body/threeBodyProb.jl:265
[13] include at ./boot.jl:326 [inlined]
[14] include_relative(::Module, ::String) at ./loading.jl:1038
[15] include(::Module, ::String) at ./sysimg.jl:29
[16] exec_options(::Base.JLOptions) at ./client.jl:267
[17] _start() at ./client.jl:436
I'm unsure how to diagnose this--does cron limit the number of files a process can create or something like that? In my bash script I have also manually added the following settings (just in case) but that still resulted in the stacktrace above:
ulimit -n 4096
ulimit -t unlimited

Thanks so much for the help #TasosPapastylianou--that error message eventually led me to this post which fixed my problem (and also significantly sped up the animation rendering process as a nice byproduct).
Ultimately it appears the problem was not with cron or the bash script, but instead with Julia's GR backend. I added the line
GR.inline("png")
To the top of the for loop generating the plots to explicitly tell it I was making png files and apparently that fixes everything--not really sure why this is needed and why it's only needed when running from crontab/jobber so if anyone has further insight I'd love to know, but I'm glad it works!
Thanks again to everyone for their help and insights--this tip should be helpful to anyone making animations in a similar way with Julia due to the dramatic improvement in performance that came from this one line!

Related

ffmpeg - Hyperthreading causes "Conversion failed" with multiple parallel instances

I'm trying to extract images from multiple videos in parallel, using ffmpeg.
Here's my bash script:
for video in *.MOV; do
base=`basename "$video" .MOV`
ffmpeg -i "$video" -r 0.02 "$base"/out_%02d.png > logs/"$base" 2>&1 &
done
When running this (on 60 videos), I check the logs/ files and 40 of them have crashed at the beginning with the following error:
Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height
Conversion failed!
However it works fine with a smaller amount of videos (around 5, even on videos that didn't work before).
EDIT: I tried to disable hyperthreading and it works fine now. Why is hyperthreading causing ffmpeg to fail ?
The secondary "hyper" thread is probably usually getting stuck in the ready-to-execute state because the executing thread's stream processing doesn't need to pause to wait for more data to come in. Without disabling hyperthreading, adding -threads 1 to your ffmpeg command may help with the parallelized usage.

Julia parallel computing over multiple nodes in cluster

I am running some jobs on a shared cluster and I've been trying to use more than 1 node at a time. While using julia -p #processors works for the cores on one node, it doesn't find the other nodes.
The cluster is using SGE and I tried a lot of different ways to make the nodes work, but only one was working. Is there an easy way built in Julia to launch Julia with julia -mpi 32 or something similar?
Using
using ClusterManagers
println(nworkers(),nprocs(),Sys.CPU_CORES)
ClusterManagers.addprocs_sge(16)
ClusterManagers.addprocs_sge(15)
println(nworkers(),nprocs(),Sys.CPU_CORES)
doesn't work (I have submitted a job reserving 2 nodes with 16 cores each on the SGE), the output file of the job is empty and instead I get 16 different output files julia-70755.o8252776.* (* = 1...16) with the following text:
julia_worker:9009#192.168.17.206
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Starting Julia with julia --machinefile $PE_HOSTFILE also failed with:
Warning: Permanently added the RSA host key for IP address '192.168.18.10' to th
e list of known hosts.
ERROR: connect: invalid argument (EINVAL)
in uv_error at ./libuv.jl:68 [inlined]
in connect!(::TCPSocket, ::IPv4, ::UInt16) at ./socket.jl:652
in connect!(::TCPSocket, ::SubString{String}, ::UInt16) at ./socket.jl:688
in connect at ./stream.jl:959 [inlined]
in connect_to_worker(::SubString{String}, ::Int16) at ./managers.jl:483
in connect(::Base.SSHManager, ::Int64, ::WorkerConfig) at ./managers.jl:425
in create_worker(::Base.SSHManager, ::WorkerConfig) at ./multi.jl:1786
in setup_launched_worker(::Base.SSHManager, ::WorkerConfig, ::Array{Int64,1}) a
t ./multi.jl:1733
in (::Base.##669#673{Base.SSHManager,Array{Int64,1}})() at ./task.jl:360
in sync_end() at ./task.jl:311
in macro expansion at ./task.jl:327 [inlined]
in #addprocs_locked#665(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./mul
ti.jl:1688
in (::Base.#kw##addprocs_locked)(::Array{Any,1}, ::Base.#addprocs_locked, ::Bas
e.SSHManager) at ./<missing>:0
in #addprocs#664(::Array{Any,1}, ::Function, ::Base.SSHManager) at ./multi.jl:1
658
in (::Base.#kw##addprocs)(::Array{Any,1}, ::Base.#addprocs, ::Base.SSHManager)
at ./<missing>:0
in #addprocs#764(::Bool, ::Cmd, ::Int64, ::Array{Any,1}, ::Function, ::Array{An
y,1}) at ./managers.jl:112
in process_options(::Base.JLOptions) at ./client.jl:227
in _start() at ./client.jl:321
UndefRefError()
I was suggested to use the MPI.jl package, but it doesn't look to me like it really supports the julia parallel syntax, at the way I'm using it by just writing #sync #parallel before a for loop that I want to run in parallel (i.e. Metropolis-Montecarlo).
The IT team got back to me and told me that the SGE does not allow passwordless ssh, that's why addprocs_sge() wouldn't work. However they now added a file for the job that I can pass to Julia and told me to run the job with this script:
qlogin -pe mpi_28_tasks_per_node 56
module load julia/0.5.1
julia --machinefile $TMPDIR/machines
The machines file looks like this:
::::::::::::::
/scratch/8548498.1.u/machines
::::::::::::::
{hostname1}
{hostname1}
...
{hostname2}
{hostname2}
You might want to read the julia docs on parallel computing where there is a section on cluster managers. Also, take a look at ClusterManagers.jl where SGE is supported:
julia> using ClusterManagers
julia> ClusterMangers.addprocs_sge(5)

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html

Wrong coloring w/ Powerline in Terminal.app

I setup tmux powerline, and installed all the corresponding fonts. The problem I am running into now is colors not appearing the same when acting as a background in the hardline.
I made sure to set tmux to use 256 color mode
tmux.conf: http://hastebin.com/durehunuge.conf
Any ideas on how to get the colors to match?
Sadly the only way that I was able to fix this was by switching to iTerm2.
I assume you have trouble with the "arrow" symbols?
If so then you can easily fix that by using the correct symbol.
In your theme file you have some lines that look like that:
if patched_font_in_use; then
TMUX_POWERLINE_SEPARATOR_LEFT_BOLD="<U+2B82>"
TMUX_POWERLINE_SEPARATOR_LEFT_THIN="<U+2B83>"
TMUX_POWERLINE_SEPARATOR_RIGHT_BOLD="<U+2B80>"
TMUX_POWERLINE_SEPARATOR_RIGHT_THIN="<U+2B81>"
else
TMUX_POWERLINE_SEPARATOR_LEFT_BOLD="◀"
TMUX_POWERLINE_SEPARATOR_LEFT_THIN="❮"
TMUX_POWERLINE_SEPARATOR_RIGHT_BOLD="▶"
TMUX_POWERLINE_SEPARATOR_RIGHT_THIN="❯"
fi
Those are used in your segments, eg:
"weather 89 211" \
"date 235 136" \
"time 235 136 ${TMUX_POWERLINE_SEPARATOR_LEFT_THIN}" \
which would render on my machine as:
⮂ ☼ -1°C ⮂ 02.03.2013 ⮃ 10:02
As you can see the time arrow is thin without background.

system(): why do I not have the same permissions when using R in EMACS as I do in the bash terminal?

update: the error only occurs when logged into R from within emacs
what works:
When I ssh into a remote server and run
$ ./foo.rb
from the bash shell, it works. Furthermore, if I launch R and execute
$ R
system('./foo.rb')
I am in a group with permission to read/write/execute the file. File permissions are -rwxrwx---
what doesn't work:
Launch emacs and start an R session:
M-x R
ssh-myserver:.
system('./foo.rb')
I get the following error:
ruby: Permission denied -- foo.rb (LoadError)
why is this? Is there a way to work around this?
I can not find any information from ?system or ?system2
Here is the output from sessionInfo()
> sessionInfo()
R version 2.12.2 (2011-02-25)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] PECAn_0.1.1 xtable_1.5-6 gridExtra_0.7 RMySQL_0.7-5
[5] DBI_0.2-5 ggplot2_0.8.9 proto_0.3-8 reshape_0.8.3
[9] plyr_1.6 rjags_2.2.0-2 coda_0.13-5 lattice_0.19-17
[13] randtoolbox_1.09 rngWELL_0.9 MASS_7.3-11 XML_3.2-0
loaded via a namespace (and not attached):
[1] digest_0.4.2
Warning message:
'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible
output of 'id' and 'env' from ssh and emacs, per comment by #sarnold (changed user names, group names, and ip addresses)
1. server
1.1 'id'
uid=1668(dleb) gid=1668(dleb) groups=117(ebusers),159(lab_admin),166(lab),1340(pal_web),1668(dleb)
1.2 'env'
LC_PAPER=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
SHELL=/usr/local/bin/system-specific
KDE_NO_IPV6=1
SSH_CLIENT=888.888.888.88 51857 22
NCARG_FONTCAPS=/usr/lib64/ncarg/fontcaps
LC_NUMERIC=en_US.UTF-8
USER=dleb
LS_COLORS=
LC_TELEPHONE=en_US.UTF-8
KDEDIR=/usr
NCARG_GRAPHCAPS=/usr/lib64/ncarg/graphcaps
MAIL=/var/mail/dleb
PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin
LC_IDENTIFICATION=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
R_LIBS=/home/a-m/dleb/lib/R
PWD=/home/dleb
NCARG_ROOT=/usr
KDE_IS_PRELINKED=1
LANG=en_US.UTF-8
NCARG_DATABASE=/usr/lib64/ncarg/database
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
LOADEDMODULES=
LC_MEASUREMENT=en_US.UTF-8
NCARG_LIB=/usr/lib64/ncarg
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
NCARG_NCARG=/usr/share/ncarg
SHLVL=1
HOME=/home/a-m/dleb
LOGNAME=dleb
CVS_RSH=ssh
SSH_CONNECTION=888.888.888.88 51857 999.999.999.99 22
LC_CTYPE=en_US.UTF-8
MODULESHOME=/usr/share/Modules
LESSOPEN=|/usr/bin/lesspipe.sh %s
DISPLAY=localhost:15.0
LC_TIME=en_US.UTF-8
G_BROKEN_FILENAMES=1
LC_NAME=en_US.UTF-8
_=/bin/env
emacs/ess R session
2.1 system('id')
uid=1668(dleb) gid=1668(dleb) groups=117(ebusers),159(lab_admin),166(lab),1340(pal_web),1668(dleb)
2.2 system('env')
LN_S=ln -s
R_TEXI2DVICMD=/usr/bin/texi2dvi
LC_PAPER=en_US.UTF-8
SED=/bin/sed
LC_ADDRESS=en_US.UTF-8
R_PDFVIEWER=/usr/bin/xdg-open
LC_MONETARY=en_US.UTF-8
HOSTNAME=ebi-forecast
R_INCLUDE_DIR=/usr/include/R
R_PRINTCMD=lpr
SHELL=/usr/local/bin/system-specific
TERM=dumb
AWK=gawk
HISTSIZE=1
R_RD4DVI=ae
SSH_CLIENT=888.888.888.88 51159 22
KDE_NO_IPV6=1
R_RD4PDF=times,hyper
R_PAPERSIZE=a4
NCARG_FONTCAPS=/usr/lib64/ncarg/fontcaps
PERL=/usr/bin/perl
LC_NUMERIC=en_US.UTF-8
SSH_TTY=/dev/pts/14
LC_ALL=C
EMACS=t
USER=dleb
LC_TELEPHONE=en_US.UTF-8
LS_COLORS=
LD_LIBRARY_PATH=/usr/lib64/R/lib:/usr/local/lib64:/usr/lib/jvm/jre/lib/amd64/server:/usr/lib/jvm/jre/lib/amd64:/usr/lib/jvm/java/lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
TAR=/bin/gtar
ENV=
R_ZIPCMD=/usr/bin/zip
KDEDIR=/usr
PAGER=/usr/bin/less
NCARG_GRAPHCAPS=/usr/lib64/ncarg/graphcaps
R_GZIPCMD=/usr/bin/gzip
PATH=/bin:/usr/bin:/usr/sbin:/usr/local/bin
LC_COLLATE=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
EGREP=/bin/grep -E
PWD=/home/a-m/dleb/pecan
INPUTRC=/etc/inputrc
R_LIBS=/home/a-m/dleb/lib/R
NCARG_ROOT=/usr
R_SHARE_DIR=/usr/share/R
WHICH=/usr/bin/which
EDITOR=vi
LANG=en_US.UTF-8
KDE_IS_PRELINKED=1
R_LIBS_SITE=/usr/local/lib/R/site-library:/usr/local/lib/R/library:/usr/lib64/R/library:/usr/share/R/library
M ODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
NCARG_DATABASE=/usr/lib64/ncarg/database
LC_MEASUREMENT=en_US.UTF-8
LOADEDMODULES=
PS3=
R_BROWSER=/usr/bin/xdg-open
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
NCARG_LIB=/usr/lib64/ncarg
HOME=/home/a-m/dleb
SHLVL=1
NCARG_NCARG=/usr/share/ncarg
R_ARCH=
TR=/usr/bin/tr
MAKE=make
R_UNZIPCMD=/usr/bin/unzip
LOGNAME=dleb
CVS_RSH=ssh
LC_CTYPE=en_US.UTF-8
SSH_CONNECTION=888.888.888.88 51159 999.999.999.99 22
R_BZIPCMD=/usr/bin/bzip2
MODULESHOME=/usr/share/Modules
LESSOPEN=|/usr/bin/lesspipe.sh %s
PROMPT_COMMAND=
R_HOME=/usr/lib64/R
DISPLAY=localhost:22.0
R_PLATFORM=x86_64-redhat-linux-gnu
INSIDE_EMACS=23.2.1,tramp:2.1.18-23.2
R_LIBS_USER=~/R/x86_64-redhat-linux-gnu-library/2.12
LC_TIME=en_US.UTF-8
R_DOC_DIR=/usr/share/doc/R-2.12.2
R_SESSION_TMPDIR=/tmp/RtmpqA6bpJ
HISTFILE=/home/a-m/dleb/.tramp_history
G_BROKEN_FILENAMES=1
LC_NAME=en_US.UTF-8
_=/bin/env
Assuming you started up R as the same user, you do. You error is not coming from a permissions problem for foo.rb, however, or else your shell would be giving the error. (i.e. sh: ./test.rb: Permission denied; see example below). Here, ruby itself is giving the error. Without knowing exactly what is in your foo.rb, I would suggest digging in there to see what it is trying to load/source, and checking the permissions on those.
#!/usr/bin/env ruby
puts 'Hello world'
Now in R....
> system('ls -l test.rb')
-rw-r--r-- 1 jcolby staff 40 Oct 21 08:23 test.rb
> system('./test.rb')
sh: ./test.rb: Permission denied
> system('chmod a+x test.rb')
> system('./test.rb')
Hello world
I presume the M ODULEPATH in the Emacs-derived output is simply a copy and paste typo.
The differences between the two env outputs is much greater than I expected; I've selected the ones that look slightly suspicious to me:
$ diff -u works fails
--- works 2011-10-24 15:04:02.000000000 -0700
+++ fails 2011-10-24 15:12:36.000000000 -0700
...
+LD_LIBRARY_PATH=/usr/lib64/R/lib:/usr/local/lib64:/usr/lib/jvm/jre/lib/amd64/server:/usr/lib/jvm/jre/lib/amd64:/usr/lib/jvm/java/lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
...
-PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/opt/dell/srvadmin/bin
-PWD=/home/dleb
...
+PATH=/bin:/usr/bin:/usr/sbin:/usr/local/bin
...
+PWD=/home/a-m/dleb/pecan
...
In the emacs-derived session, your LD_LIBRARY_PATH environment variable may be changing specifics of which dynamically linked libraries are being used when executing ruby. If you ssh in to your server and execute your foo.rb with the changed LD_LIBRARY_PATH, does it work or fail?
LD_LIBRARY_PATH=/usr/lib64/R/lib:/usr/local/lib64:/usr/lib/jvm/jre/lib/amd64/server:/usr/lib/jvm/jre/lib/amd64:/usr/lib/jvm/java/lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib ./foo.rb
The PATH environment variable between the two sessions is different; perhaps you have permission to execute /usr/local/bin/ruby (or the libraries in /usr/local/lib/ruby/) but not /usr/bin/ruby (or the libraries in /usr/lib/ruby/). Does your script use #!env ruby or does it use #!/usr/bin/ruby (or some other fixed path)?
Your pwd in one instance is /home/dleb, the other /home/a-m/dleb/pecan -- but HOME is set to /home/a-m/dleb on both systems. Is /home/dleb a symbolic link or does it actually exist separate from /home/a-m/dleb? (This really is grasping at straws -- I don't think this is it, but this problem is baffling.)
One last thing to consider: is your server confined with a tool such as AppArmor, SELinux, TOMOYO, or SMACK? Any of these mandatory access control tools can prevent an application from writing in specific locations, perhaps they aren't yet configured for your site. Check dmesg(1) output to see if there are any rejection messages, most or all these tools log to dmesg(1) if auditd(8) isn't running.

Resources