How to free up memory after running a process in a shell - bash

I have a script that runs a JAVA process that loads data into a database every 10 secs using a loop. This script seems to work perfectly, but after a couple of days I start getting Memory issues. If I stop the script everything frees up, I can start it again and it will run happily for another couple of days.
RUNME=Y
PROPERTIES=someproprties.properties
CHECKFILE=somelockfile.lock
touch $CHECKFILE
while [ "$RUNME" = "Y" ]; do
if [ -f $CHECKFILE ]
then
#Run Process
$DR_HOME/bin/dr -cp $CP_PLUGIN -Xmx64g --engine parallelism=1 --runjson $HOME_DIR/workflows/some_dataflow.dr --overridefile $PROPERTIES 1> /dev/null 2>> $LOG_FILE
#Give Process a little time to finish up before moving on
sleep 10s
else
RUNME=N
fi
done
I had assumed that once the process had run it would make any memory that it had allocated for the process available again, so that the next iteration of the loop could use this. Given that this does not seem to be the case, is there a way I can force the release of memory post the running the process. I appreciate that this may be something that I need to address in the actual JAVA Process rather than in a Shell - but as this is the area I have more control over, I thought I would at least ask.

To check the processes which are running and memory used
sid=$(ps -p $$ -osid=)
while ....
ps --sid $sid -opid,tty,cpu,vsz,etime,command
vsz shows the virtual size used by the process
Then if it's really bash process, it may be environment which is growing, but from the script it can't be that.

Related

Why does bash "forget" about my background processes?

I have this code:
#!/bin/bash
pids=()
for i in $(seq 1 999); do
sleep 1 &
pids+=( "$!" )
done
for pid in "${pids[#]}"; do
wait "$pid"
done
I expect the following behavior:
spin through the first loop
wait about a second on the first pid
spin through the second loop
Instead, I get this error:
./foo.sh: line 8: wait: pid 24752 is not a child of this shell
(repeated 171 times with different pids)
If I run the script with shorter loop (50 instead of 999), then I get no errors.
What's going on?
Edit: I am using GNU bash 4.4.23 on Windows.
POSIX says:
The implementation need not retain more than the {CHILD_MAX} most recent entries in its list of known process IDs in the current shell execution environment.
{CHILD_MAX} here refers to the maximum number of simultaneous processes allowed per user. You can get the value of this limit using the getconf utility:
$ getconf CHILD_MAX
13195
Bash stores the statuses of at most twice as that many exited background processes in a circular buffer, and says not a child of this shell when you call wait on the PID of an old one that's been overwritten. You can see how it's implemented here.
The way you might reasonably expect this to work, as it would if you wrote a similar program in most other languages, is:
sleep is executed in the background via a fork+exec.
At some point, sleep exits leaving behind a zombie.
That zombie remains in place, holding its PID, until its parent calls wait to retrieve its exit code.
However, shells such as bash actually do this a little differently. They proactively reap their zombie children and store their exit codes in memory so that they can deallocate the system resources those processes were using. Then when you wait the shell just hands you whatever value is stored in memory, but the zombie could be long gone by then.
Now, because all of these exit statuses are being stored in memory, there is a practical limit to how many background processes can exit without you calling wait before you've filled up all the memory you have available for this in the shell. I expect that you're hitting this limit somewhere in the several hundreds of processes in your environment, while other users manage to make it into the several thousands in theirs. Regardless, the outcome is the same - eventually there's nowhere to store information about your children and so that information is lost.
I can reproduce on ArchLinux with docker run -ti --rm bash:5.0.18 bash -c 'pids=; for ((i=1;i<550;++i)); do true & pids+=" $!"; done; wait $pids' and any earlier. I can't reproduce with bash:5.1.0 .
What's going on?
It looks like a bug in your version of Bash. There were a couple of improvements in jobs.c and wait.def in Bash:5.1 and Make sure SIGCHLD is blocked in all cases where waitchld() is not called from a signal handler is mentioned in the changelog. From the look of it, it looks like an issue with handling a SIGCHLD signal while already handling another SIGCHLD signal.

checking per ssh if a specific program is still running, in parallel

I have several machines where I have a program running. Every 30 seconds or so I want to check if those programs are still running. I use the following command to do that.
ssh ${USER}#${HOSTS[i]} "bash -c 'if [[ -z \"\$(pgrep -u ${USER} program)\" ]]; then exit 1; else exit 0; fi'"
Now running this on >100 machines takes a long time and I want to speed that up by checking in parallel. I am aware of '&' and 'parallel', but I am unsure how to retreive the return value (task completed or not).
The following lets all connections complete before starting any in the next batch, and thus can potentially wait for more than 30 seconds -- but should give you a good idea of how to do what you're looking for:
hosts=( host1 host2 host3 )
user=someuser
script="script you want to run on each remote host"
last_time=$(( SECONDS - 30 ))
while (( ( SECONDS - last_time ) >= 30 )) || \
sleep $(( 30 - (SECONDS - last_time) )); do
last_time=$SECONDS
declare -A pids=( )
for host in "${hosts[#]}"; do
ssh "${user}#${host}" "$script" & pids[$!]="$host"
done
for pid in "${!pids[#]}"; do
wait "$pid" || {
echo "Failure monitoring host ${pids[$pid]} at time $SECONDS" >&2
}
done
done
Now, bigger picture: Don't do that.
Almost every operating system has a process supervision framework. Ubuntu has Upstart; Fedora and CentOS 7 have systemd; MacOS X has launchd; runit, daemontools, and others can be installed anywhere (and are very, very easy to use -- look at the run scripts at http://smarden.org/runit/runscripts.html for examples).
Using these tools are the Right Way to monitor a process and ensure that it restarts whenever it exits: Unlike this (very high-overhead) solution they have almost no overhead at all, since they rely on the operating system notifying a process's parent when that process exits, rather than doing the work of polling for a process (and that only after all the overhead of connecting via SSH, negotiating a pair of session keys, starting a shell to run your script, etc, etc, etc).
Yes, this may be a small private project. Still, you're making extra complexity (and thus, extra bugs) for yourself -- and if you learn to use the tools to do this right, you'll know how to do things right when you have something that isn't a small private project.

How can I create a process in Bash that has zero overhead but which gives me a process ID

For those of you who know what you're talking about I apologise for butchering the way that I'm going to phrase this question. I know nothing about bash whatsoever. With that caveat out of the way, let me get out my cleaver...
I am building a Rails app which has what's called a procfile which sets up any processes that need to be run in different environments
e.g.
web: bundle exec unicorn -p $PORT -c ./config/unicorn.rb
redis: redis-server
worker: bundle exec sidekiq
proxylocal: bin/proxylocal_local
Each one of these lines specs a process to be run. It also expects a pid to be returned after the process spins up. The syntax is
process_name: process_invokation_script
However the last process, proxylocal, only actually starts a process in development. In production it doesn't do anything.
Unfortunately that causes the Procfile to choke as it needs a process ID returned. So is there some super-simple, zero-overhead process that I can spawn in that case just to keep the procfile happy?
The sleep command does nothing for a specified period of time, with very low overhead. Give it an argument longer than your code will run.
For example
sleep 2147483647
does nothing for 231-1 seconds, just over 68 years. I picked that number because any reasonable implementation of sleep should be able to handle it.
In the unlikely event that that doesn't work (say if you're on an old 16-bit system that can't sleep for more than 216-1 seconds), you can do a sleep in an infinite loop:
sh -c 'while : ; do sleep 30000 ; done'
This assumes that you need the process to run for a very long time; that depends on what your application needs to do with the process ID. If it's required to be unique as long as the application is running, you need something that will continue to run for a long time; if the process terminates, its PID can be re-used by another process.
If that's not a requirement, you can use sleep 0 or true, which will terminate immediately.
If you need to give the application a little time to get the process ID before the process terminates, something like sleep 10 or even sleep 1 might work, though determining just how long it needs to run can be tricky and error-prone.
If Heroku isn't doing anything with proxylocal I'm not sure why you'd even want this in your Procifle. I'm also a bit confused about whether you want to change the Procfile or what bin/proxylocal_local does and how you would even do that.
That being said, if you are able to do anything you like for production your script can just call cat and it will create a pid and then just sit waiting for the next command (which never comes).
For truly minimal overhead, you don't want to run any external commands. When the shell starts a command, it first forks itself, then the child shell execs the external command. If the forked child can run a builtin, you can skip the exec.
Start by creating a read-only fifo somewhere.
mkfifo foo
chmod 400 foo
Then, whenever you need a do-nothing process, just fork a shell which tries to read from the fifo. It's read-only, so no one can write to it, so all reads will block.
read < foo &

Understanding the behavior of processes - why all process run together and sleep together?

I have written a script to initiate multi-processing
for i in `seq 1 $1`
do
/usr/bin/php index.php name&
done
wait
A cron run every min - myscript.sh 3 now three background process get initiated and after some time I see list of process via ps command. I see all the processes are together in "Sleep" or "Running" mode...Now I wanted to achieve that when one goes to sleep other processes must process..how can I achieve it?. Or this is normal.
This is normal. A program that can run will be given time by the operating system... when possible. If all three are sleeping, then the system is most likely busy and time is being given to other processes.

Introduce timeout in a bash for-loop

I have a task that is very well inside of a bash for loop. The situation is though, that a few of the iterations seem to not terminate. What I'm looking for is a way to introduce a timeout that if that iteration of command hasn't terminated after e.g. two hours it will terminate, and move on to the next iteration.
Rough outline:
for somecondition; do
while time-run(command) < 2h do
continue command
done
done
One (tedious) way is to start the process in the background, then start another background process that attempts to kill the first one after a fixed timeout.
timeout=7200 # two hours, in seconds
for somecondition; do
command & command_pid=$!
( sleep $timeout & wait; kill $command_pid 2>/dev/null) & sleep_pid=$!
wait $command_pid
kill $sleep_pid 2>/dev/null # If command completes prior to the timeout
done
The wait command blocks until the original command completes, whether naturally or because it was killed after the sleep completes. The wait immediately after sleep is used in case the user tries to interrupt the process, since sleep ignores most signals, but wait is interruptible.
If I'm understanding your requirement properly, you have a process that needs to run, but you want to make sure that if it gets stuck it moves on, right? I don't know if this will fully help you out, but here is something I wrote a while back to do something similar (I've since improved this a bit, but I only have access to a gist at present, I'll update with the better version later).
#!/bin/bash
######################################################
# Program: logGen.sh
# Date Created: 22 Aug 2012
# Description: parses logs in real time into daily error files
# Date Updated: N/A
# Developer: #DarrellFX
######################################################
#Prefix for pid file
pidPrefix="logGen"
#output direcory
outDir="/opt/Redacted/logs/allerrors"
#Simple function to see if running on primary
checkPrime ()
{
if /sbin/ifconfig eth0:0|/bin/grep -wq inet;then isPrime=1;else isPrime=0;fi
}
#function to kill previous instances of this script
killScript ()
{
/usr/bin/find /var/run -name "${pidPrefix}.*.pid" |while read pidFile;do
if [[ "${pidFile}" != "/var/run/${pidPrefix}.${$}.pid" ]];then
/bin/kill -- -$(/bin/cat ${pidFile})
/bin/rm ${pidFile}
fi
done
}
#Check to see if primary
#If so, kill any previous instance and start log parsing
#If not, just kill leftover running processes
checkPrime
if [[ "${isPrime}" -eq 1 ]];then
echo "$$" > /var/run/${pidPrefix}.$$.pid
killScript
commands && commands && commands #Where the actual command to run goes.
else
killScript
exit 0
fi
I then set this script to run on cron every hour. Every time the script is run, it
creates a lock file named after a variable that describes the script that contains the pid of that instance of the script
calls the function killScript which:
uses the find command to find all lock files for that version of the script (this lets more than one of these scripts be set to run in cron at once, for different tasks). For each file it finds, it kills the processes of that lock file and removes the lock file (it automatically checks that it's not killing itself)
Starts doing whatever it is I need to run and not get stuck (I've omitted that as it's hideous bash string manipulation that I've since redone in python).
If this doesn't get you squared let me know.
A few notes:
the checkPrime function is poorly done, and should either return a status, or just exit the script itself
there are better ways to create lock files and be safe about it, but this has worked for me thus far (famous last words)

Resources