gnu parallel to parallelize a for loop - bash

I have seen several questions about this topic, but I lack the ability to translate this to my specific problem. I have a for loop that loops through sub directories and then executes a .sh script on a compressed text file inside each directory. I want to parallelize this process, but I'm struggling to apply gnu parallel.
Here is my loop:
for d in ./*/ ; do (cd "$d" && script.sh); done
I understand I need to input a list into parallel, so i have been trying this:
ls -d */ | parallel cd && script.sh
While this appears to get started, I get an error when gzip tries to unzip one of the txt files inside the directory, saying the file does not exist:
gzip: *.txt.gz: No such file or directory
However, when I run the original for loop, I have no issues aside from it taking a century to finish. Also, I only get the gzip error once when using parallel, which is so weird considering I have over 1000 sub-directories.
My questions are:
How do I get Parallel to work in my case? How do I get parallel to parallelize the application of a .sh script to 1000s of files in their own sub-directories? ie- what is the solution to my problem? I gotta make progress.
What am I missing? Syntax, loop, bad script? I want to learn.
Is Parallel actually attempting to run all these .sh scripts in parallel? Why dont I get an error for every .txt.gz file?
Is parallel the best option for the application? Is there another option that is better suited to my needs?

Two problems:
In:
ls -d */ | parallel cd && script.sh
what is paralleled is just cd, not script.sh. script.sh is only executed once, after all parallel cd jobs have run, if there was no error. It is the same as:
ls -d */ | parallel cd
if [ $? -eq 0 ]; then script.sh; fi
You do not pass the target directory to cd. So, what is executed by parallel is just cd, which just changes the current directory to your home directory. The final script.sh is executed in the current directory (from where you invoked the command) where there are probably no *.txt.gz files, thus the error.
You can check yourself the effect of the first problem with:
$ mkdir /tmp/foobar && cd /tmp/foobar && mkdir a b c
$ ls -d */ | parallel cd && pwd
/tmp/foobar
The output of pwd is printed only once, even if you have more than one input directory. You can fix it by quoting the command and then check the second problem with:
$ ls -d */ | parallel 'cd && pwd'
/homes/myself
/homes/myself
/homes/myself
You should see as many pwd outputs as there are input directories but it is always the same output: your home directory. You can fix the second problem by using the {} replacement string that is substituted with the current input. Check it with:
$ ls -d */ | parallel 'cd {} && pwd'
/tmp/foobar/a
/tmp/foobar/b
/tmp/foobar/c
Now, you should have all input directories properly listed in the output.
For your specific problem this should work:
ls -d */ | parallel 'cd {} && script.sh'

Related

Fish shell - advanced control flow

Normally, fish shell processes commands like this:
1 && 3 (2)
This is perfectly useful and it mirrors the order of execution that I would want most of the time.
I was wondering if a different syntax exists to get a slightly different order of execution?
Sometimes I want this:
2 && 3 (1)
is that possible without using multiple lines ?
This is a trivial example:
cd ~ && cat (pwd | psub)
In this example I want to run pwd first then run cd and then run cat
edit: oh! This seems to work:
cat (pwd | psub && cd ~)
This is one of those cases where I'm going to recommend just using multiple lines [0].
It's cleaner and clearer:
set -l dir (pwd)
cd ~ && cat (printf '%s\n' $dir | psub)
This is completely ordinary and straightforward, and that's a good thing. It's also easily extensible - want to run the cd only if the pwd succeded?
set -l dir (pwd)
and cd ~ && cat (printf '%s\n' $dir | psub)
as set passes on the previous $status, so here it passes on the status of pwd.
The underlying philosophy here is that fish script isn't built for code golf. It doesn't have many shortcuts, even ones that posix shell script or especially shells like bash and zsh have. The fish way is to simply write the code.
Your answer of
cat (pwd | psub && cd ~)
doesn't work because that way the cat is no longer only executed if the cd succeeds - command substitutions can fail. Instead the cd is now only done if the psub succeeded - notably this also happens if pwd fails.
(of course that cat (pwd | psub) is fairly meaningless and could just be pwd, I'm assuming you have some actual code you want to run like this)
[0]: Technically this doesn't have to be multiple lines, you can write it as set -l dir (pwd); cd ~ && cat (printf '%s\n' $dir | psub). I would, however, also recommend using multiple lines

mv/cp commands not working as expected wih xargs in bash

Hi I have 2 parent directories with these contents, under /tmp:
Note parent directory names have ";" in it- not recommended in Unix like systems, but those directories are pushed by an external application, and that's the way we have to deal with it.
I need to move these parent directories (along with their contents) to /tmp/archive - on a RHEL 7.9 (Maipo) machine
My simple code:
ARCHIVE="/tmp/archive"
[ -d "${ARCHIVE}" ] && mkdir -p "${ARCHIVE}"
ls -lrth /tmp | awk "\$NF ~ /2021-.*/{print \$NF}" | xargs -I "{}" mv "{}" ${ARCHIVE}/
But when I run this script, mv copies one of the parent directory as it is, but for the other one, it just moves the contents of the parent directory, not the directory itself:
I tried the same script with cp -pvr command in place of mv, and its the same behavior
When I run the same script in a Ubuntu 18 system, the behavior is as expected i.e - the parent directories get moved to archive folder.
Why is there this difference in behavior between a Ubuntu and a RHEL system, for the same script
Try a simpler approach:
mkdir -p /tmp/archive
mv -v /tmp/2021-*\;*\;*/ /tmp/archive

Bash script to check if a new file has been created on a directory after run a command

By using bash script, I'm trying to detect whether a file has been created on a directory or not while running commands. Let me illustrate the problem;
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
FILES_BEFORE= ls $WATCH_DIR
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
FILES_AFTER= ls $WATCH_DIR
# detect changes and if any changes has been occurred exit the program.
After that I've just tried to compare these FILES_BEFORE and FILES_AFTER however couldn't accomplish that. I've tried;
comm -23 <($FILES_AFTER |sort) <($FILES_BEFORE|sort)
diff $FILES_AFTER $FILES_BEFORE > /dev/null 2>&1
cat $FILES_AFTER $FILES_BEFORE | sort | uniq -u
None of them gave me a result to understand there is a change or not. What I need is detecting the change and exiting the program if any. I am not really good at this bash script, searched a lot on the internet however couldn't find what I need. Any help will be appreciated. Thanks.
Thanks to informative comments, I've just realized that I've missed the basics of bash script but finally made that work. I'll leave my solution here as an answer for those who struggle like me.:
WATCH_DIR=./tmp
FILES_BEFORE=$(ls $WATCH_DIR)
echo >$WATCH_DIR/filename
FILES_AFTER=$(ls $WATCH_DIR)
if diff <(echo "$FILES_AFTER") <(echo "$FILES_BEFORE")
then
echo "No changes"
else
echo "Changes"
fi
It outputs "Changes" on the first run and "No Changes" for the other unless you delete the newly added documents.
I'm trying to interpret your script (which contains some errors) into an understanding of your requirements.
I think the simplest way is simply to rediect the ls command outputto named files then diff those files:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
ls $WATCH_DIR > /tmp/watch_dir.after
# detect changes and if any changes has been occurred exit the program.
diff -c /tmp/watch_dir.after /tmp/watch_dir.before
If the any files are modified by the 'commands', i.e. the files exists in the 'before' list, but might change, the above will not show that as a difference.
In this case you might be better off using a 'marker' file created to mark the instance the monitoring started, then use the find command to list any newer/modified files since the market file. Something like this:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
find $WATCH_DIR -type f -newer /tmp/watch_dir.before -exec ls -l {} \;
What this won't do is show any files that were deleted, so perhaps a hybrid list could be used.
Here is how I got it to work. It's also setup up so that you can have multiple watched directories with the same script with cron.
for example, if you wanted one to run every minute.
* * * * * /usr/local/bin/watchdir.sh /makepdf
and one every hour.
0 * * * * /user/local/bin/watchdir.sh /incoming
#!/bin/bash
WATCHDIR="$1"
NEWFILESNAME=.newfiles$(basename "$WATCHDIR")
if [ ! -f "$WATCHDIR"/.oldfiles ]
then
ls -A "$WATCHDIR" > "$WATCHDIR"/.oldfiles
fi
ls -A "$WATCHDIR" > $NEWFILESNAME
DIRDIFF=$(diff "$WATCHDIR"/.oldfiles $NEWFILESNAME | cut -f 2 -d "")
for file in $DIRDIFF
do
if [ -e "$WATCHDIR"/$file ];then
#do what you want to the file(s) here
echo $file
fi
done
rm $NEWFILESNAME

changing directories by using cd

I executed the following command :
cd /mnt/c/Users/Daniel/Documents/Assg/ | cat file.txt
my question is why doesn't it change directory?. The output file.txt is displayed but the directory is not changed. I understand that if we execute the same command in the following order, it won't work because cd changes directory in a child process, so the net result is the same.
cat file.txt | cd /mnt/c/Users/Daniel/Documents/Assg/
Try just cd /mnt/c/Users/Daniel/Documents/Assg/
As was already stated, the following:
cd /mnt/c/Users/Daniel/Documents/Assg/
should do the trick, but I'd like to go a bit more into why the command you presented doesn't work as expected. In Bash (and other shells), you can have multiple "subshells" running under a parent shell. each of these subshells has its own working directory. When you run commands in a pipeline, as you have done, a subshell is created. The working directory of the subshell was changed, but that didn't have any effect on the shell you were working in.
It depends on the shell you use
When you run two commands in a pipeline, typically one or both of the commands is run in a separate child process.
In older shells this would be both, in later shells this can be either
the first or the last.
At one point, the ksh93 team decided to make the last command in the pipeline the parent. This would prevent race conditions, and if the command was a builtin, it allows it to run inside the current shell
process and preserve the results of the pipeline.
Nevertheless, cd is a command that does not consume or produce any input or output (except for diagnostics on stderr), and using it in a pipeline
by itself is just silly. A better, because more predictable, command line would be:
cd /mnt/c/Users/Daniel/Documents/Assg/ && cat file.txt
This will assure that cat only runs if cd succeeds, and will then
show the contents of file.txt from the given directory.
You have different options.
Perform cat after trying to change dir
cd /mnt/c/Users/Daniel/Documents/Assg/ ; cat file.txt
Perform cat only when change dir worked
cd /mnt/c/Users/Daniel/Documents/Assg/ && cat file.txt
Perform cat in the other directory, but return to the current dir when finished.
(cd /mnt/c/Users/Daniel/Documents/Assg/ && cat file.txt)
# or
cat /mnt/c/Users/Daniel/Documents/Assg/file.txt
EDIT:
Your question: "why doesnt cd /mnt/c/Users/Daniel/Documents/Assg/ | cat file.txt, change directory?." can be answered two ways.
The technical explanation is given by #Henk (The pipe introduces a subshell, and environ settings in a subshell get lost when the shell exits).
The functional explanation is that you used the wrong syntax for what you are trying to accomplish.

can Linux Bash search for file every 60 seconds and execute file commands? how would I do this?

Basically I want to do something like this from bash.
if a file exists in a directory rename,move,whatever
if it doesn't exist loop every 60 seconds:
# Create ~/bin
cd ~/
if dir ~/bin does not exist
then mkdir ~/bin
#!/bin/bash
# Create ~/blahhed && ~/blahs
if dir ~/blahhed does not exist
then mkdir ~/blahhed
if dir ~/blahs does not exist
then mkdir ~/blahs
# This will copy a file from ~/blahhed to ~/blahs
if ~/blahhed/file exists
then mv ~/blahhed/file ~/blahs/file
rm ~/blahhed/file
else loop for 60s
# This appends the date and time
# to the end of the file name
date_formatted=$(date +%m_%d_%y-%H,%M,%S)
if ~/blahs/file does exist
then mv ~/blahs/file ~/blahs/file.$date_formatted
rm ~/blahs/file
else loop for 60s
Ok Ive rewritten it like this am I on the right track here?
# Create ~/bin
cd ~/
if [! -d ~/bin]; then
mkdir ~/bin
if [ -d ~/bin]; then
#!/bin/bash
# Create ~/blahhed && ~/blahs
if [! -d ~/blahhed]; then
mkdir ~/blahhed
if [! -d ~/blahs]; then
mkdir ~/blahs
# This will copy a file from ~/blahhed to ~/blahs
while if [ -d ~/blahhed/file]; then
do
mv ~/blahhed/file ~/blahs/file
rm ~/blahhed/file
continue
# This appends the date and time
# to the end of the file name
date_formatted=$(date +%m_%d_%y-%H,%M,%S)
if [! -d ~/blahs/file]; then
mv ~/blahs/file ~/blahs/file.$date_formatted
rm ~/blahs/file
sleep 60 seconds
You could use watch(1) which is able to run a program or script every N seconds.
To run some script every few minutes (not seconds) - or every few hours or days, use some crontab(5) entries. To run it at some given (relative or absolute) time, consider at(1) (which you might use with some here document in your shell terminal, etc...).
However, to execute commands when a file exists or changes, you might use make(1) (which you could run from watch); that command is configurable in a Makefile (see documentation of GNU make)
And if you really care about file appearing or changing (and doing something on such changes), consider using inotify(7) based facilities, e.g. incrond with incrontab(5)
To test existence of directories or files, use test(1) often spelt as a [ , e.g.
## test in a script if directory ~/foo/ exist
if [ -d ~/foo/ ]; then
echo the directory foo exists
fi
Spaces are important above. You could use [ -d "$HOME/foo/" ]
It may look that you want to mimick logrotate(8). See also syslog(3) library function and logger(1) command.
To debug your bash script, start it (-see execve(2) & bash(1) for details- temporarily, while debugging) with
#!/bin/bash -vx
and make your foo.sh script executable with chmod a+x foo.sh
To stop execution of some script for some seconds, use sleep(1)
The mkdir(1) command accepts -p (and then won't create a directory if it already exists). mv(1) has also many options (including for backup).
To search some files in a file tree, use find(1). To search some content inside files, use grep. I also like ack
Read also Advanced Bash Scripting Guide & (if coding in C ...) Advanced Linux Programming and also the documentation of GNU bash (e.g. for shell builtins and control statements).
Did you consider using some revision control system like git ? It is useful to manage the evolution of source files (including shell scripts)
I've seen solutions similar to what you are asking, but using crontab with find -mmin 1 which will search for any files with a modtime <= 60 seconds within specified location.
Something along these lines (untested):
$ -> vi /tmp/file_finder.sh
# Add the following lines
#!/bin/bash
find /path/to/check -mmin 1 -type -f | while read fname; do
echo "$fname"
done
# Change perms
$ -> chmod 755 /tmp/file_finder.sh
$ -> crontab -e
* * * * * /tmp/file_finder.sh
With the above, you have now setup the cron to run every minute, and kick off a script that will search given directory for files with a modtime <= 60 seconds (new or updated).
Caveat: You should look for files with a mod time up to 5 minutes, that way you don't consider a file which may still be in the process of being written too.
I think you answered yourself (kind of)
Some suggestions:
1- use a while loop and at the end add sleep 60
2- write your procedure in a file (ex.; test1)
and then
watch -n 60 ./test1

Resources