Rewriting a for-loop to use it with "parallel" - bash

I am trying to rewrite a for-loop (that works great) and run it as parallel but i having all sorts of problems. Here is my for loop
function no_sam {
filename=$(basename "$file")
extension="${file##*.}"
if [ $extension = "sam" ];
then
filename="${filename%.*}"
feat_out=$filename.out
htseq-count -f $types -r "$pos" -m "$mode" -i "$attribute" -s "$strand" -t "$feature" -a "$qual" "$file" "$input_gff" > "$feat_out"
grep -v "_" "$feat_out" > temp && mv temp "$feat_out"
mv "$feat_out" "$counts_folder"
elif [ $extension = "bam" ];
then
filename="${filename%.*}"
feat_out=$filename.out
htseq-count -f $types -r "$pos" -m "$mode" -i "$attribute" -s "$strand" -t "$feature" -a "$qual" "$file" "$input_gff" > "$feat_out"
grep -v "_" "$feat_out" > temp && mv temp "$feat_out"
mv "$feat_out" "$counts_folder"
fi
}
for file in "${multi[#]}"; do
no_sam
done
And when i replaced the for loop with GNU parallel, i am getting error
"${multi[#]}" no_sam | parallel
testfile.sam: command not found

Try this (based on https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Using-shell-variables and https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Calling-Bash-functions):
function no_sam {
file="$1"
filename=$(basename "$file")
extension="${file##*.}"
if [ $extension = "sam" ];
then
filename="${filename%.*}"
feat_out=$filename.out
htseq-count -f $types -r "$pos" -m "$mode" -i "$attribute" -s "$strand" -t "$feature" -a "$qual" "$file" "$input_gff" > "$feat_out"
grep -v "_" "$feat_out" > temp && mv temp "$feat_out"
mv "$feat_out" "$counts_folder"
elif [ $extension = "bam" ];
then
filename="${filename%.*}"
feat_out=$filename.out
htseq-count -f $types -r "$pos" -m "$mode" -i "$attribute" -s "$strand" -t "$feature" -a "$qual" "$file" "$input_gff" > "$feat_out"
grep -v "_" "$feat_out" > temp && mv temp "$feat_out"
mv "$feat_out" "$counts_folder"
fi
}
export -f no_sam
parallel no_sam ::: "${multi[#]}"
From the way you try to use GNU Parallel I think you will benefit from spending an hour walking through man parallel_tutorial.
Your command line will love you for it.

Related

Looping through each file in directory - bash

I'm trying to perform certain operation on each file in a directory but there is a problem with order it's going through. It should do one file at the time. The long line (unzipping, grepping, zipping) works fine on a single file without a script, so there is a problem with a loop. Any ideas?
Script should grep through through each zipped file and look for word1 or word2. If at least one of them exist then:
unzip file
grep word1 and word2 and save it to file_done
remove unzipped file
zip file_done to /donefiles/ with original name
remove file_done from original directory
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The code snipped you've provided has a few problems, e.g. unneeded nested for cycle and erroneous pipeline
(the whole line gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip...).
Note also your code will work correctly only if *.gz files don't have spaces (or special characters) in names.
Also zgrep -c 'word1\|word2' will also match strings like line_starts_withword1_orword2_.
Here is the working version of the script:
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c -E 'word1|word2' $file) # now counter is the number of word1/word2 occurences in $file
if [[ $counter -gt 0 ]]; then
name=$(basename $file .gz)
zcat $file | grep -E 'word1|word2' > ${name}_done
gzip -f -c ${name}_done > /donefiles/$file
rm -f ${name}_done
else
echo 'nothing to do here'
fi
done
What we can improve here is:
since we unzipping the file anyway to check for word1|word2 presence, we may do this to temp file and avoid double-unzipping
we don't need to count how many word1 or word2 is inside the file, we may just check for their presence
${name}_done can be a temp file cleaned up automatically
we can use while cycle to handle file names with spaces
#!/bin/bash
tmp=`mktemp /tmp/gzip_demo.XXXXXX` # create temp file for us
trap "rm -f \"$tmp\"" EXIT INT TERM QUIT HUP # clean $tmp upon exit or termination
find . -maxdepth 1 -mindepth 1 -type f -name '*.gz' | while read f; do
# quotes around $f are now required in case of spaces in it
s=$(basename "$f") # short name w/o dir
gunzip -f -c "$f" | grep -P '\b(word1|word2)\b' > "$tmp"
[ -s "$tmp" ] && gzip -f -c "$tmp" > "/donefiles/$s" # create archive if anything is found
done
It looks like you have an inner loop inside the outer one :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
for file in *.gz; do #<<< HERE
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
done
else
echo "nothing to do here"
fi
done
The inner loop goes through all the files in the directory if one of them contains file1 or file2. You probably want this :
#!/bin/bash
for file in *.gz; do
counter=$(zgrep -c 'word1\|word2' $file)
if [[ $counter -gt 0 ]]; then
echo $counter
filenoext=${file::-3}
filedone=${filenoext}_done
echo $file
echo $filenoext
echo $filedone
gunzip $file | grep 'word1\|word2' $filenoext > $filedone | rm -f $filenoext | gzip -f -c $filedone > /donefiles/$file | rm -f $filedone
else
echo "nothing to do here"
fi
done

How does CMake detect changed files

I have a "C"/C++ CMake project which works fine. However, I'm sometimes (re)building on a remote cluster where the time is slightly different. This machine runs Linux and I'm building using make. I'm wondering if there is some make/CMake way to change how the changes to the files are detected, e.g. to MD5 or diff rather than using timestamps. Otherwise I guess I'd either have to endure the constant make clean / make -j cycle or have to change my local time every time I'm working with that particular server.
I was poking CMake documentation to see if there is a flag which would change these settings but found none. How would this work on platforms which have no RTC (e.g. Raspberry)?
Right, so knowing that CMake / make does not do what I want and I don't want the hassle of synchronizing the time of my machine to the target, I came up with the following:
#!/bin/bash
touch src_hash.md5
echo -n make "$#" > mymake.sh
find `pwd`/../src `pwd`/../include -print0 |
while IFS= read -r -d $'\0' f; do
if [[ ! -d "$f" ]]; then
MD5=`md5sum "$f" | awk -v fn="$f" '{ print "\"" fn "\" " $1; }'`
echo $MD5 >> src_hash.md5.new
OLDMD5=`grep -e "^\"$f\"" src_hash.md5`
if [[ "$OLDMD5" == "" ]]; then
echo "$MD5 -- [a new file]"
continue # a new file, make can handle that well on its own
fi
HASH=`echo $MD5 | awk '{ print $2; }'`
OLDHASH=`echo $OLDMD5 | awk '{ print $2; }'`
if [[ "$HASH" != "$OLDHASH" ]]; then
echo "$MD5 -- changed from $OLDHASH"
echo -n " \"--what-if=${f}\"" >> mymake.sh
# this is running elsewhere, can't pass stuff via variables
fi
fi
done
touch src_hash.md5.new
mv src_hash.md5.new src_hash.md5
echo using: `cat mymake.sh`
echo >> mymake.sh # add a newline
chmod +x mymake.sh
./mymake.sh
rm -f mymake.sh
This keeps a list of source file hashes in src_hash.md5 and at each time it runs it compares the current files to those hashes (and updates the list accordingly).
At the end, it calls make, passing any arguments you give to the script (such as -j). It makes use of the --what-if= switch which tells make to act like the given file changed - that way the dependences of build targets on sources / headers are handled elegantly.
You might want to also pass the path to source / include files as arguments so that those wouldn't be hardcoded inside.
Or one more iteration on the said script, using touch to change and restore the file timestamps for situations when make is extra stubborn about not rebuilding anything:
#!/bin/bash
if [[ ! -d ../src ]]; then
>&2 echo "error: ../src is not a directory or does not exist"
exit -1
fi
if [[ ! -d ../include ]]; then
>&2 echo "error: ../include is not a directory or does not exist"
exit -1
fi
echo "Scanning for changed files in ../src and ../include"
touch src_hash.md5 # in case this runs for the first time
rm -f mymaketouch.sh
rm -f mymakerestore.sh
touch mymaketouch.sh
touch mymakerestore.sh
echo -n make "$#" > mymake.sh
CWD="`pwd`"
find ../src ../include -print0 |
while IFS= read -r -d $'\0' f; do
if [[ ! -d "$f" ]]; then
fl=`readlink -f "$CWD/$f"`
MD5=`md5sum "$fl" | awk -v fn="$fl" '{ print "\"" fn "\" " $1; }'`
HASH=`echo $MD5 | awk '{ print $2; }'`
echo $MD5 >> src_hash.md5.new
OLDMD5=`grep -e "^\"$fl\"" src_hash.md5`
OLDHASH=`echo $OLDMD5 | awk '{ print $2; }'`
if [[ "$OLDMD5" == "" ]]; then
echo "$f $HASH -- [a new file]"
continue # a new file, make can handle that well on its own
fi
if [[ "$HASH" != "$OLDHASH" ]]; then
echo "$f $HASH -- changed from $OLDHASH"
echo "touch -m \"$fl\"" >> mymaketouch.sh # will touch it and change modification time
stat "$fl" -c "touch -m -d \"%y\" \"%n\"" >> mymakerestore.sh # will restore it later on so that we do not run into problems when copying newer from a different system
echo -n " \"--what-if=$fl\"" >> mymake.sh
# this is running elsewhere, can't pass stuff via variables
fi
fi
done
echo using: `cat mymake.sh`
echo >> mymake.sh # add a newline
echo 'exit $?' >> mymake.sh
chmod +x mymaketouch.sh
chmod +x mymakerestore.sh
chmod +x mymake.sh
control_c() # run if user hits control-c
{
echo -en "\nrestoring modification times\n"
./mymakerestore.sh
rm -f mymaketouch.sh
rm -f mymakerestore.sh
rm -f mymake.sh
rm -f src_hash.md5.new
exit -1
}
trap control_c SIGINT
./mymaketouch.sh
./mymake.sh
RETVAL=$?
./mymakerestore.sh
rm -f mymaketouch.sh
rm -f mymakerestore.sh
rm -f mymake.sh
touch src_hash.md5.new # in case there was nothing new
mv src_hash.md5.new src_hash.md5
# do it now in case someone hits ctrl+c mid-build and not all files are built
exit $RETVAL
Or even run hashing in parallel in case you are building a large project:
#!/bin/bash
if [[ ! -d ../src ]]; then
>&2 echo "error: ../src is not a directory or does not exist"
exit -1
fi
if [[ ! -d ../include ]]; then
>&2 echo "error: ../include is not a directory or does not exist"
exit -1
fi
echo "Scanning for changed files in ../src and ../include"
touch src_hash.md5 # in case this runs for the first time
rm -f mymaketouch.sh
rm -f mymakerestore.sh
touch mymaketouch.sh
touch mymakerestore.sh
echo -n make "$#" > mymake.sh
CWD="`pwd`"
rm -f src_hash.md5.new # will use ">>", make sure to remove the file
find ../src ../include -print0 |
while IFS= read -r -d $'\0' f; do
if [[ ! -d "$f" ]]; then
fl="$CWD/$f"
(echo `md5sum "$f" | awk -v fn="$fl" '{ print "\"" fn "\" " $1; }'` ) & # parallel, echo is atomic (http://stackoverflow.com/questions/9926616/is-echo-atomic-when-writing-single-lines)
# run in parallel (remove the ampersand if you run into trouble)
fi
done >> src_hash.md5.new # >> is atomic but > wouldn't be
# this is fast
cat src_hash.md5 > src_hash.md5.diff
echo separator >> src_hash.md5.diff
cat src_hash.md5.new >> src_hash.md5.diff
# make a compound file for awk (could also read the other file in awk but this seems simpler right now)
cat src_hash.md5.diff | awk 'BEGIN { FS="\""; had_sep = 0; }
{
if(!had_sep && $1 == "separator")
had_sep = 1;
else {
sub(/[[:space:]]/, "", $3);
if(!had_sep)
old_hashes[$2] = $3;
else {
f = $2;
if((idx = index(f, "../")) != 0)
f = substr(f, idx, length(f) - idx + 1);
if($2 in old_hashes) {
if(old_hashes[$2] != $3)
print "\"" f "\" " $3 " -- changed from " old_hashes[$2];
} else
print "\"" f "\" -- a new file " $3;
}
}
}'
# print verbose for the user only
cat src_hash.md5.diff | awk 'BEGIN { FS="\""; had_sep = 0; }
{
if(!had_sep && $1 == "separator")
had_sep = 1;
else {
sub(/[[:space:]]/, "", $3);
if(!had_sep)
old_hashes[$2] = $3;
else {
if($2 in old_hashes) {
if(old_hashes[$2] != $3)
printf($2 "\0"); /* use \0 as a line separator for the below loop */
}
}
}
}' |
while IFS= read -r -d $'\0' fl; do
echo "touch -m \"$fl\"" >> mymaketouch.sh # will touch it and change modification time
stat "$fl" -c "touch -m -d \"%y\" \"%n\"" >> mymakerestore.sh # will restore it later on so that we do not run into problems when copying newer from a different system
echo -n " \"--what-if=$fl\"" >> mymake.sh
# this is running elsewhere, can't pass stuff via variables
done
# run again, handle files that require change
rm -f src_hash.md5.diff
echo using: `cat mymake.sh`
echo >> mymake.sh # add a newline
echo 'exit $?' >> mymake.sh
chmod +x mymaketouch.sh
chmod +x mymakerestore.sh
chmod +x mymake.sh
control_c() # run if user hits control-c
{
echo -en "\nrestoring modification times\n"
./mymakerestore.sh
rm -f mymaketouch.sh
rm -f mymakerestore.sh
rm -f mymake.sh
rm -f src_hash.md5.new
exit -1
}
trap control_c SIGINT
./mymaketouch.sh
./mymake.sh
RETVAL=$?
./mymakerestore.sh
rm -f mymaketouch.sh
rm -f mymakerestore.sh
rm -f mymake.sh
touch src_hash.md5.new # in case there was nothing new
mv src_hash.md5.new src_hash.md5
# do it now in case someone hits ctrl+c mid-build and not all files are built
exit $RETVAL

tail command throws errors when used in bash script

for file in swap_pricer swap_id_marks swaption_id_marks
do
if [ ! -e $file ] && [ "$context" == "INTRADAY" ]
then
cp -f $working_dir/brl/$file $file
else
tail -n 7 $working_dir/brl/$file >> $file
fi
echo "[`date +'%D %T'`] Removing file ${excel_txt_dir}/$file.txt"
if [ -e ${excel_txt_dir}/${file}.txt ]
then
rm -f ${excel_txt_dir}/${file}.txt
fi
cp -f ${file}.txt ${excel_txt_dir}/${file}.txt
cp -f ${file}.txt ${excel_txt_dir}/${file}_${naming_date}.txt
cp -f ${file}.txt ${file}_${naming_date}.txt
cp -f ${file}.txt ${excel_txt_dir}/${file}_${naming_date}_${price_time}.txt
done
The code above is part of a bash script...which has been copied from csh script.
I am getting an error:
tail: cannot open input
Please help me to resolve the error.
In your script, tail is interpreting +7 as a file parameter and complaining that it cant open it.
Try tail -n 7 instead of tail +7

Pip error in virtualenv

I created a virtualenv by typing
virtualenv --no-site-packages newgame
I then initiated the virtualenv by cd'ing into my newgame folder and typing
source bin/active.
This seems to have worked because I now see (newgame)Benjamins-MacBook:newgame test in terminal.
Now THIS is the part where I'm stuck. I type pip install lpthw.web and I get the following
-bash: /Users/test/Python Projects/newgame/bin/pip: "/Users/test/Python: bad interpreter: No such file or directory
Any idea what I'm doing wrong?
It's a bug in virtualenv, it cannot handle paths with spaces in them (your Python Projects).
Here's a quick script to remove spaces in your file names/directories:
function serenity-now() {
for i in *;
do
if [ -f "$i" ];
then
mv "$i" "`echo $i | sed -e 's, ,-,g'`";
# fi;
# if [ ( -d "$i" ) && ( $i != $(echo $i | sed -e 's, ,-,g') ) ];
elif [ -d "$i" ];
then
if [ $i != $(echo $i | sed -e 's, ,-,g') ];
then
mv "$i" "$(echo $i | sed -e 's, ,-,g')";
fi;
fi;
done;
find . -type f -name "*~" -exec rm {} \;
# \rm *.xls* *.doc* *.ppt*;
}
The last line is optional, but does help clean things up.

bash scripting challenge

I need to write a bash script that will iterate through the contents of a directory (including subdirectories) and perform the following replacements:
replace 'foo' in any file names with 'bar'
replace 'foo' in the contents of any files with 'bar'
So far all I've got is
find . -name '*' -exec {} \;
:-)
With RH rename:
find -f \( -exec sed -i s/foo/bar/g \; , -name \*foo\* -exec rename foo bar {} \; \)
find "$#" -depth -exec sed -i -e s/foo/bar/g {} \; , -name '*foo*' -print0 |
while read -d '' file; do
base=$(basename "$file")
mv "$file" "$(dirname "$file")/${base//foo/bar}"
done
UPDATED: 1632 EST
Now handles whitespace but 'while read item' never terminates. Better,
but still not right. Will keep
working on this.
aj#mmdev0:~/foo_to_bar$ cat script.sh
#!/bin/bash
dirty=true
while ${dirty}
do
find ./ -name "*" |sed -s 's/ /\ /g'|while read item
do
if [[ ${item} == "./script.sh" ]]
then
continue
fi
echo "working on: ${item}"
if [[ ${item} == *foo* ]]
then
rename 's/foo/bar/' "${item}"
dirty=true
break
fi
if [[ ! -d ${item} ]]
then
cat "${item}" |sed -e 's/foo/bar/g' > "${item}".sed; mv "${item}".sed "${item}"
fi
dirty=false
done
done
#!/bin/bash
function RecurseDirs
{
oldIFS=$IFS
IFS=$'\n'
for f in *
do
if [[ -f "${f}" ]]; then
newf=`echo "${f}" | sed -e 's/foo/bar/g'`
sed -e 's/foo/bar/g' < "${f}" > "${newf}"
fi
if [[ -d "${f}" && "${f}" != '.' && "${f}" != '..' && ! -L "${f}" ]]; then
cd "${f}"
RecurseDirs .
cd ..
fi
done
IFS=$oldIFS
}
RecurseDirs .
bash 4.0
#!/bin/bash
shopt -s globstar
path="/path"
cd $path
for file in **
do
if [ -d "$file" ] && [[ "$file" =~ ".*foo.*" ]];then
echo mv "$file" "${file//foo/bar}"
elif [ -f "$file" ];then
while read -r line
do
case "$line" in
*foo*) line="${line//foo/bar}";;
esac
echo "$line"
done < "$file" > temp
echo mv temp "$file"
fi
done
remove the 'echo' to commit changes
for f in `tree -fi | grep foo`; do sed -i -e 's/foo/bar/g' $f ; done
Yet another find-exec solution:
find . -type f -exec bash -c '
path="{}";
dirName="${path%/*}";
baseName="${path##*/}";
nbaseName="${baseName/foo/bar}";
#nbaseName="${baseName//foo/bar}";
# cf. http://www.bash-hackers.org/wiki/doku.php?id=howto:edit-ed
ed -s "${path}" <<< $'H\ng/foo/s/foo/bar/g\nwq';
#sed -i "" -e 's/foo/bar/g' "${path}"; # alternative for large files
exec mv -iv "{}" "${dirName}/${nbaseName}"
' \;
correction to find-exec approach by gregb (adding quotes):
# compare
bash -c '
echo $'a\nb\nc'
'
bash -c '
echo $'"'a\nb\nc'"'
'
# therefore we need
find . -type f -exec bash -c '
...
ed -s "${path}" <<< $'"'H\ng/foo/s/foo/bar/g\nwq'"';
...
' \;

Resources