warning using Parallel::ForkManager but only in Windows - windows

I sometimes get this warning when using Parallel::ForkManager but only in Windows, not on a Unix based system. What does it mean and should I worry about it?
child process '-17108' disappeared. A call to waitpid outside of
Parallel::ForkManager might have reaped it.
Here is the sample code from the docs that my code is based on:
use LWP::Simple;
use Parallel::ForkManager;
my #links=(
["http://www.foo.bar/rulez.data","rulez_data.txt"],
["http://new.host/more_data.doc","more_data.doc"],
);
# Max 30 processes for parallel download
my $pm = Parallel::ForkManager->new(30);
LINKS:
foreach my $linkarray (#links) {
$pm->start and next LINKS; # do the fork
my ($link, $fn) = #$linkarray;
warn "Cannot get $fn from $link"
if getstore($link, $fn) != RC_OK;
$pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

I had the similar issue and placing a sleep 1 before "$pm->start and next LINKS;"
fixed the issue. I guess its due to continues forking, where Perl lost track of the fork processes. I may be wrong!

Related

Ruby, strange forked processes behaviour on MacOS vs Debian

Using Ruby (tested with versions 2.6.9, 2.7.5, 3.0.3, 3.1.1) and forking processes to handle socket communication there seems to be a huge difference between MacOS OSX and a Debian Linux.
While running on Debian, the forked processes get called in a balanced manner - that mean: if having 10 tcp server forks and running 100 client calls, each fork will get 10 calls. The order of the pid call stack is also always the same even not ordered by pid (caused by load when instantiating the forks).
Doing the same on a MacOS OSX (Catalina) the forked processes will not get called balanced - that mean: "pid A" might get called 23 or whatever times while e.g. "pid G" was never used.
Sample code (originally from: https://relaxdiego.com/2017/02/load-balancing-sockets.html)
#!/usr/bin/env ruby
# server.rb
require 'socket'
# Open a socket
socket = TCPServer.open('0.0.0.0', 9999)
puts "Server started ..."
# For keeping track of children pids
wpids = []
# Forward any relevant signals to the child processes.
[:INT, :QUIT].each do |signal|
Signal.trap(signal) {
wpids.each { |wpid| Process.kill(:KILL, wpid) }
}
end
5.times {
wpids << fork do
loop {
connection = socket.accept
connection.puts "Hello from #{ Process.pid }"
connection.close
}
end
}
Process.waitall
Run some netcat to the server on a second terminal:
for i in {1..20}; do nc -d localhost 9999; done
As said: if running on Linux each forked process will get 4 calls - doing same on MacOS OSX its a random usage per forked process.
Any solution or correction to make it work on MacOS OSX in a balanced manner also?
The problem is that the default socket backlog size is 5 on MacOS and 128 on Linux. You can change the backlog size by passing it as the second argument to TCPServer#listen:
socket.listen(128)
Or you can use the backlog size from the environment variable SOMAXCONN:
socket.listen(ENV.fetch('SOMAXCONN', 128).to_i)

Does $argv behave the same between Centos and RHEL systems

I am trying to troubleshoot an old TCL accounting script called GOTS - Grant Of The System. What it does is creates a time stamped logfile entry for each user login and another for the logout. The problem is it is not creating the second log file entry on logout. I think I tracked down the area where it is going wrong and I have attached it here. FYI the log file exists and it does not exit with the error "GOTS was called incorrectly!!". It should be executing the if then for [string match "$argv" "end_session"]
This software runs properly on RHEL Linux 6.9 but fails as described on Centos 7. I am thinking that there is a system variable or difference in the $argv argument vector for the different systems that creates this behavior.
Am I correct in suspecting $argv and if not does anyone see the true problem?
How do I print or display the $argv values on logout?
# Find out if we're beginning or ending a session
if { [string match "$argv" "end_session"] } {
if { ![file writable $Log] } {
onErrorNotify "4 LOG"
}
set ifd [open $Log a]
puts $ifd "[clock format [clock seconds]]\t$Instrument\t$LogName\t$GroupName"
close $ifd
unset ifd
exit 0
} elseif { [string match "$argv" "begin_session"] == 0 } {
puts stderr "GOTS was called incorrectly!!"
exit -1
}
end_session is populated by the /etc/gdm/PostSession/Default file
#!/bin/sh
### Begin GOTS PostSession
# Do not run GOTS if root is logging out
if test "${USER}" == "root" ; then
exit 0
fi
/usr/local/lib/GOTS/gots end_session > /var/tmp/gots_postsession.log 2> /var/tmp/gots_postsession.log
exit 0
### End GOTS PostSession
This is the postsession log file:
Application initialization failed: couldn't connect to display ":1"
Error in startup script: invalid command name "option"
while executing
"option add *Font "-adobe-new century schoolbook-medium-r-*-*-*-140-*-*-*-*-*-*""
(file "/usr/local/lib/GOTS/gots" line 26)
After a lot of troubleshooting we have determined that for whatever reason Centos is not allowing part of the /etc/gdm/PostSession/default file to execute:
fi
/usr/local/lib/GOTS/gots end_session
But it does update the PostSession.log file as it should .. . Does anyone have any idea what could be interfering with only part of the PostSession/default?
Does anyone have any idea what could be interfereing with PostSession/default?
Could it be that you are hitting Bug 851769?
That said, am I correct in stating that, as your investigation shows, this is not a Tcl-related issue or question anymore?
So it turns out that our script has certain elements that depend upon the Xserver running on logout to display some of the GUI error messages. This from:
Gnome Configuration
"When a user terminates their session, GDM will run the PostSession script. Note that the Xserver will have been stopped by the time this script is run, so it should not be accessed.
Note that the PostSession script will be run even when the display fails to respond due to an I/O error or similar. Thus, there is no guarantee that X applications will work during script execution."
We are having to rewrite those error message callouts so they simply write the errors to a file instead of depending on the display. The errors are for things that should be there in the beginning anyway.

Ruby: running external processes in parallel and keeping track of exit codes

I have a smoke test that I run against my servers before making them live. At the moment it runs in serial form and takes around 60s per server. I can run these in parallel and I've done it with Thread.new which is great as it runs them a lot faster but I lose track of whether the test actually passed or not.
I'm trying to improve this by using Process.spawn to manage my processes.
pids = []
uris.each do |uri|
command = get_http_tests_command("Smoke")
update_http_tests_config( uri )
pid = Process.spawn( system( command ) )
pids.push pid
Process.detach pid
end
# make sure all pids return a passing status code
# results = Process.waitall
I'd like to kick off all my tests but then afterwards make sure that all the tests return a passing status code.
I tried using Process.waitall but I believe that to be incorrect and used for forks, not spawns.
After all the process have completed I'd like to return the status of true if they all pas or false if any one of them fails.
Documentation here
Try:
statuses = pids.map { |pid| Process.wait(pid, 0); $? }
This waits for each of the process ids to finish, and checks for the result status set in $? for each process

Random corruption in file created/updated from shell script on a singular client to NFS mount

We have bash script (job wrapper) that writes to a file, launches a job, then at job completion it appends to the file information about the job. The wrapper is run on one of several thousand batch nodes, but has only cropped up with several batch machines (I believe RHEL6) accessing one NFS server, and at least one known instance of a different batch job on a different batch node using a different NFS server. In all cases, only one client host is writing to the files in question. Some jobs take hours to run, others take minutes.
In the same time period that this has occurred, there seems to be 10-50 issues out of 100,000+ jobs.
Here is what I believe to effectively be the (distilled) version of the job wrapper:
#!/bin/bash
## cwd is /nfs/path/to/jobwd
## This file is /nfs/path/to/jobwd/job_wrapper
gotEXIT()
{
## end of script, however gotEXIT is called because we trap EXIT
END="EndTime: `date`\nStatus: Ended”
echo -e "${END}" >> job_info
cat job_info | sendmail jobtracker#example.com
}
trap gotEXIT EXIT
function jobSetVar { echo "job.$1: $2" >> job_info; }
export -f jobSetVar
MSG=“${email_metadata}\n${job_metadata}”
echo -e "${MSG}\nStatus: Started" | sendmail jobtracker#example.com
echo -e "${MSG}" > job_info
## At the job’s end, the output from `time` command is the first non-corrupt data in job_info
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
## 10-360 minutes later…
RC=$?
echo -e "ExitCode: ${RC}" >> job_info
So I think there are two possibilities:
echo -e "${MSG}" > job_info
This command throws out corrupt data.
/usr/bin/time -f "Elapsed: %e\nUser: %U\nSystem: %S" -a -o job_info job_command
This corrupts the existing data, then outputs it’s data correctly.
However, some job, but not all, call jobSetVar, which doesn't end up being corrupt.
So, I dig into time.c (from GNU time 1.7) to see when the file is open. To summarize, time.c is effectively this:
FILE *outfp;
void main (int argc, char** argv) {
const char **command_line;
RESUSE res;
/* internally, getargs opens “job_info”, so outfp = fopen ("job_info", "a”) */
command_line = getargs (argc, argv);
/* run_command doesn't care about outfp */
run_command (command_line, &res);
/* internally, summarize calls fprintf and putc on outfp FILE pointer */
summarize (outfp, output_format, command_line, &res); /
fflush (outfp);
}
So, time has FILE *outfp (job_info handle) open the entire time of the job. It then writes the summary at the end of the job, and then doesn’t actually appear to close the file (not sure if this is necessary with fflush?) I've no clue if bash also has the file handle open concurrently as well.
EDIT:
A corrupted file will typically end consist of the corrupted part, followed with the non-corrupted part, which may look like this:
The corrupted section, which would occur before the non-corrupted section, is typically largely a bunch of 0x0000, with maybe some cyclic garbage mixed in:
Here's an example hexdump:
40000000 00000000 00000000 00000000
00000000 00000000 C8B450AC 772B0000
01000000 00000000 C8B450AC 772B0000
[ 361 x 0x00]
Then, at the 409th byte, it continues with the non-corrupted section:
Elapsed: 879.07
User: 0.71
System: 31.49
ExitCode: 0
EndTime: Fri Dec 6 15:29:27 PST 2013
Status: Ended
Another file looks like this:
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
[96 x 0x00]
[Repeat above 3 times ]
01000000 04000000 805443FC 9D2B0000 E04144FC 9D2B0000 E04144FC 9D2B0000
Followed by the non corrupted section:
Elapsed: 12621.27
User: 12472.32
System: 40.37
ExitCode: 0
EndTime: Thu Nov 14 08:01:14 PST 2013
Status: Ended
There are other files that have much more random corruption sections, but more than a few were cyclical similar to above.
EDIT 2: The first email, sent from the echo -e statement goes through fine. The last email is never sent due to no email metadata from corruption. So MSG isn't corrupted at that point. It's assumed that job_info probably isn't corrupt at that point either, but we haven't been able to verify that yet. This is a production system which hasn't had major code modifications and I have verified through audit that no jobs have been ran concurrently which would touch this file. The problem seems to be somewhat recent (last 2 months), but it's possible it's happened before and slipped through. This error does prevent reporting which means jobs are considered failed, so they are typically resubmitted, but one user in specific has ~9 hour jobs in which this error is particularly frustrating. I wish I could come up with more info or a way of reproducing this at will, but I was hoping somebody has maybe seen a similar problem, especially recently. I don't manage the NFS servers, but I'll try to talk to the admins to see what updates the NFS servers at the time of these issues (RHEL6 I believe) were running.
Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.
You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.
Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.
File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.
Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..
Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)
It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.
A bug fix was supplied to the linux-nfs implementation:
http://www.spinics.net/lists/linux-nfs/msg41357.html

How can I check from Ruby whether a process with a certain pid is running?

If there is more than one way, please list them. I only know of one, but I'm wondering if there is a cleaner, in-Ruby way.
The difference between the Process.getpgid and Process::kill approaches seems to be what happens when the pid exists but is owned by another user. Process.getpgid will return an answer, Process::kill will throw an exception (Errno::EPERM).
Based on that, I recommend Process.getpgid, if just for the reason that it saves you from having to catch two different exceptions.
Here's the code I use:
begin
Process.getpgid( pid )
true
rescue Errno::ESRCH
false
end
If it's a process you expect to "own" (e.g. you're using this to validate a pid for a process you control), you can just send sig 0 to it.
>> Process.kill 0, 370
=> 1
>> Process.kill 0, 2
Errno::ESRCH: No such process
from (irb):5:in `kill'
from (irb):5
>>
#John T, #Dustin: Actually, guys, I perused the Process rdocs, and it looks like
Process.getpgid( pid )
is a less violent means of applying the same technique.
For child processes, other solutions like sending a signal won't behave as expected: they will indicate that the process is still running when it actually exited.
You can use Process.waitpid if you want to check on a process that you spawned yourself. The call won't block if you're using the Process::WNOHANG flag and nil is going to be returned as long as the child process didn't exit.
Example:
pid = Process.spawn('sleep 5')
Process.waitpid(pid, Process::WNOHANG) # => nil
sleep 5
Process.waitpid(pid, Process::WNOHANG) # => pid
If the pid doesn't belong to a child process, an exception will be thrown (Errno::ECHILD: No child processes).
The same applies to Process.waitpid2.
This is how I've been doing it:
def alive?(pid)
!!Process.kill(0, pid) rescue false
end
You can try using
Process::kill 0, pid
where pid is the pid number, if the pid is running it should return 1.
Under Linux you can obtain a lot of attributes of running programm using proc filesystem:
File.read("/proc/#{pid}/cmdline")
File.read("/proc/#{pid}/comm")
A *nix-only approach would be to shell-out to ps and check if a \n (new line) delimiter exists in the returned string.
Example IRB Output
1.9.3p448 :067 > `ps -p 56718`
" PID TTY TIME CMD\n56718 ttys007 0:03.38 zeus slave: default_bundle \n"
Packaged as a Method
def process?(pid)
!!`ps -p #{pid.to_i}`["\n"]
end
I've dealt with this problem before and yesterday I compiled it into the "process_exists" gem.
It sends the null signal (0) to the process with the given pid to check if it exists. It works even if the current user does not have permissions to send the signal to the receiving process.
Usage:
require 'process_exists'
pid = 12
pid_exists = Process.exists?(pid)

Resources