How can I generate multiple output files in for loop - bash

I am new to bash so this is a very basic question:
I am trying to use the below written $for loop to perform a series of commands for several files in a directory, which should end in a new output (f:r.bw) for every single file.
Basically, I have files like chr1.gz, chr2.gz and so on that should end up as chr1.bw, chr2.bw ...
The way it is now, it seems to constantly overwrite the same output file and I cannot figure out what the correct syntax is.
$ for file in *.gz
do
zcat < $file | grep -v ^track > f:r.temp
wigToBigWig -clip -fixedSummaries -keepAllChromosomes f:r.temp hg19.chrom.sizes f:r.bw
rm f:r.temp
done
Thanks for help

Instead of using a fixed filename f:r.temp, base your destination name on $file:
for file in *.gz; do
zcat <"$file" | grep -v '^track' >"${file%.gz}.temp"
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
"${file%.gz}.temp" hg19.chrom.sizes "${file%.gz}.bw"
rm -f "${file%.gz}.temp"
done
${file%.gz} is a parameter expansion operation, which trims .gz off the end of the name; ${file%.gz}.bw, thus, trims the .gz and adds a .bw.
Even better, if wigToBigWig doesn't need a real (seekable) input file, you can give it a pipeline into the zcat | grep process directly and not need any temporary file:
for file in *.gz; do
wigToBigWig -clip -fixedSummaries -keepAllChromosomes \
<(zcat <"$file" | grep -v '^track') \
hg19.chrom.sizes \
"${file%.gz}.bw"
done

Related

Can gzip in a BASH script give an exit status before a file is completely unzipped? How to prevent this?

I am trying to write a script to collect a few statistics on .fastq files sequentially. In my script, I unzip and re-zip each file in a loop (gzip / gzip -d), applying commands to find the stats on each file whilst it is unzipped using i.e.
command "${file_name//.fastq.gz/.fastq}" .
When I do this, my script sometimes fails to collect statistics for certain files, (e.g. giving me line counts of zero). However, it does this seemingly randomly and running the script more than once it sometimes collects stats and sometimes doesn't for the same file.
I believe this is because the gzip is returning an exit status before the file is fully unzipped, meaning that my script continues and sometimes collects stats from half-zipped files. In support of this, I have seen different stats for the file sizes returned for the same file (but not for stats like header line counts, which seem to be either zero or the values I would expect).
Is there a best way to tell BASH to wait until the file is completely unzipped? I have tried using variants of until [ -f "${file_name//.fastq.gz/.fastq}" ] and until [ command "${file_name//.fastq.gz/.fastq}" != 0 ] but I still don't always get the right results this way (I have checked by unzipping filesand applying each command manually), then comparing values to the script each time.
I've posted the script below and linked a picture showing output from two different runs of the script on four files to highlight the issue. For the record, I have been running this script on a sun grid engine and it has been returning no error messages to accompany this.
#!/bin/bash
#$ -cwd
#$ -l h_rt=01:00:00
#$ -l h_vmem=1G
#$ -o fastq_qc_stats_job_out_file.txt
#$ -e fastq_qc_stats_job_error_file.txt
#First, make a file to store the output
if echo "${PWD}/" | grep -iq "/[a-Z]*_[0-9]*/" # if in numbered batches (i.e. the data to be analysed is split across multiple numbered files)
then batch=`echo "${PWD}/" | grep -o "/[a-Z]*_[0-9]*/" | cut -d '_' -f 2 | cut -d '/' -f 1` # get the batch number to name the output file,
else batch=`basename $PWD`; fi # otherwise just use the final part of the directory name.
header_line=`echo {'FILE','RUN','SIZE','READ_COUNT','MEAN_READ_LENGTH'} | sed 's/ /,/g'` # make a header line
echo $header_line > "QC_FASTQ_stats_${batch}.csv" # make a .csv file with the header line (by batch)
#Now loop through the FASTQ files and add the following information for each of them
for file in `ls *.fastq.gz`
do gzip -d $file # unzip the file
f="${file//.fastq.gz/.fastq}"
accession=`echo ${f} | cut -d '.' -f 1 | cut -d '_' -f 1`
filesize=`du -h ${f} | awk '{print $1}'`
readcount=`grep -E '^#[EDS]RR' ${f} | grep -E ' length=' | wc -l`
averagelength=`grep ' length=' ${f} | awk '{print $NF}' | cut -d '=' -f 2 | awk '{ total += $1 } END { print total/NR }'` # calculates mean
filestats=`echo $file $accession $filesize $readcount $averagelength | sed 's/ /,/g'`
echo $filestats >> "QC_FASTQ_stats_${batch}.csv" # add stats for each .fastq file to the .csv file
gzip ${f} # re-zip the file
done
An example of the output variation when run twice for the same files - see 4th file

Add a prefix to logs with AWK

I am facing a problem with a script I need to use for log analysis; let me explain the question:
I have a gzipped file like:
5555_prova.log.gz
Inside the file there are mali lines of log like this one:
2018-06-12 03:34:31 95.245.15.135 GET /hls.playready.vod.mediasetpremium/farmunica/2018/06/218742_163f10da04c7d2/hlsrc/w12/21.ts
I need a script read the gzipped log file which is capable to output on the stdout a modified log line like this one:
5555 2018-06-12 03:34:31 95.245.15.135 GET /hls.playready.vod.mediasetpremium/farmunica/2018/06/218742_163f10da04c7d2/hlsrc/w12/21.ts
As you can see the line of log now start with the number read from the gzip file name.
I need this new line to feed a logstash data crunching chain.
I have tried with a script like this:
echo "./5555_prova.log.gz" | xargs -ISTR -t -r sh -c "gunzip -c STR | awk '{$0="5555 "$0}' "
this is not exactly what I need (the prefix is static and not captured with a regular expression from the file name) but even with this simplified version I receive an error:
sh -c gunzip -c ./5555_prova.log.gz | awk '{-bash=5555 -bash}'
-bash}' : -c: line 0: unexpected EOF while looking for matching `''
-bash}' : -c: line 1: syntax error: unexpected end of file
As you can see from the above output the $0 is no more the whole line passed via pipe to awk but is a strange -bash.
I need to use xargs because the list of gzipped file is fed the the command line from an another tool (i.e. an instantiated inotifywait listening to a directory where the files are written via ftp).
What I am missing? do you have some suggestions to point me in the right direction?
Regards,
S.
Trying to following the #Charles Duffy suggestion I have written this code:
#/bin/bash
#
# Usage: sendToLogstash.sh [pattern]
#
# Executes a command whenever files matching the pattern are closed in write
# mode or moved to. "{}" in the command is replaced with the matching filename (via xargs).
# Requires inotifywait from inotify-tools.
#
# For example,
#
# whenever.sh '/usr/local/myfiles/'
#
#
DIR="$1"
PATTERN="\.gz$"
script=$(cat <<'EOF'
awk -v filename="$file" 'BEGIN{split(filename,array,"_")}{$0=array[1] OFS $0} 1' < $(gunzip -dc "$DIR/$file")
EOF
)
inotifywait -q --format '%f' -m -r -e close_write -e moved_to "$DIR" \
| grep --line-buffered $PATTERN | xargs -I{} -r sh -c "file={}; $script"
But I got the error:
[root#ms-felogstash ~]# ./test.sh ./poppo
gzip: /1111_test.log.gz: No such file or directory
gzip: /1111_test.log.gz: No such file or directory
sh: $(gunzip -dc "$DIR/$file"): ambiguous redirect
Thanks for your help, I feel very lost writing bash scripts.
Regards,
S.
EDIT: Also in case you are dealing with multiple .gz files and want to print their content along with their file names(first column _ delimited) then following may help you.
for file in *.gz; do
awk -v filename="$file" 'BEGIN{split(filename,array,"_")}{$0=array[1] OFS $0} 1' <(gzip -dc "$file")
done
I haven't tested your code(couldn't completely understand also), so trying to give here a way like in case your code could pass file name to awk then it will be pretty simple to append the file's first digits like as follows(just an example).
awk 'FNR==1{split(FILENAME,array,"_")} {$0=array[1] OFS $0} 1' 5555_prova.log_file
So here I am taking FILENAME out of the box variable for awk(only in first line of file) and then by splitting it into array named array and then adding it in each line of the file.
Also wrap "gunzip -c STR this with ending " which seems to be missing before you pass its output to awk too.
NEVER, EVER use xargs -I with a string substituted into sh -c (or bash -c or any other context where that string is interpreted as code). This allows malicious filenames to run arbitrary commands -- think about what happens if someone runs touch $'$(rm -rf ~)\'$(rm -rf ~)\'.gz', and gets that file into your log.
Instead, let xargs append arguments after your script text, and write your script to iterate over / read those arguments as data, rather than having them substituted into code.
To show how to use xargs safely (well, safely if we assume that you've filtered out filenames with literal newlines):
# This way you don't need to escape the quotes in your script by hand
script=$(cat <<'EOF'
for arg; do gunzip -c <"$arg" | awk '{$0="5555 "$0}'; done
EOF
)
# if you **did** want to escape them by hand, it would look like this:
# script='for arg; do gunzip -c <"$arg" | awk '"'"'{$0="5555 "$0}'"'"'; done'
echo "./5555_prova.log.gz" | xargs -d $'\n' sh -c "$script" _
To be safer with all possible filenames, you'd instead use:
printf '%s\0' "./5555_prova.log.gz" | xargs -0 sh -c "$script" _
Note the use of NUL-delimited input (created with printf '%s\0') and xargs -0 to consume it.

Bash - change the filename by changing the filename variable

I want to save the results of a multiple grep in a .txt format. I do
for i in GO_*.txt; do
grep -o "GO:\w*" ${i} | grep -f - ../PFAM2GO.txt > ${i}_PFAM+GO.txt
done
The thing is that, obviously, the final filename comprehends the original file extension too, being GO_*.txt_PFAM+GO.txt.
Now, I'd like to only have GO_*_PFAM+GO.txt. Is there a way to modify the ${i} as to cancel the .txt without having to perform a rename or a mv afterwards?
Note: the * part has variable length.
You can use parameter expansion to remove the extension from the filename:
for i in GO_*.txt; do
name="${i%.txt}"
grep -o "GO:\w*" "${i}" | grep -f - ../PFAM2GO.txt > "${name}_PFAM+GO.txt"
done

Faster grep in many files from several strings in a file

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"
grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?
this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

Trying to write a script to clean <script.aa=([].slice+'hjkbghkj') from multiple htm files, recursively

I am trying to modify a bash script to remove a glob of malicious code from a large number of files.
The community will benefit from this, so here it is:
#!/bin/bash
grep -r -l 'var createDocumentFragm' /home/user/Desktop/infected_site/* > /home/user/Desktop/filelist.txt
for i in $(cat /home/user/Desktop/filelist.txt)
do
cp -f $i $i.bak
done
for i in $(cat /home/user/Desktop/filelist.txt)
do
$i | sed 's/createDocumentFragm.*//g' > $i.awk
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
This is where the script bombs out with this message:
+ for i in '$(cat /home/user/Desktop/filelist.txt)'
+ sed 's/createDocumentFragm.*//g'
+ /home/user/Desktop/infected_site/index.htm
I get 2 errors and the script stops.
/home/user/Desktop/infected_site/index.htm: line 1: syntax error near unexpected token `<'
/home/user/Desktop/infected_site/index.htm: line 1: `<html><head><script>(function (){ '
I have the first 2 parts done.
The files containing createDocumentfragm have been enumerated in a text file correctly.
The files in the textfile.txt have been duplicated, in their original location with a .bak added to them IE: infected_site/some_directory/infected_file.htm and infected_file.htm.bak
effectively making sure we have a backup.
All I need to do now is write an AWK command that will use the list of files in filelist.txt, use the entire glob of malicious text as a pattern, and remove it from the files. Using just the uppercase script as the starting point, and the lower case script is too generic and could delete legitimate text
I suspect this may help me, but I don't know how to use it correctly.
http://backreference.org/2010/03/13/safely-escape-variables-in-awk/
Once I have this part figured out, and after you have verified that the files weren't mangled you can do this to clean out the bak files:
for i in $(cat /home/user/Desktop/filelist.txt)
do
rm -f $i.bak
done
Several things:
You have:
$i | sed 's/var createDocumentFragm.*//g' > $i.awk
You should probably meant this (using your use of cat which we'll talk about in a moment):
cat $i | sed 's/var createDocumentFragm.*//g' > $i.awk
You're treating each file in your file list as if it was a command and not a file.
Now, about your use of cat. If you're using cat for almost anything but concatenating multiple files together, you probably are doing something not quite right. For example, you could have done this:
sed 's/var createDocumentFragm.*//g' "$i" > $i.awk
I'm also a bit confused about the awk statement. Exactly what file are you using awk on? Your awk statement is using STDIN and STDOUT, so it's reading file names from the for loop and then printing the output on the screen. Is the sed statement suppose to feed into the awk statement?
Note that I don't have to print out my file to STDOUT, then pipe that into sed. The sed command can take the file name directly.
You also want to avoid for loops over a list of files. That is very inefficient, and can cause problems with the command line getting overloaded. Not a big issue today, but can affect you when you least suspect it. What happens is that your $(cat /home/user/Desktop/filelist.txt) must execute first before the for loop can even start.
A little rewriting of your program:
cd ~/Desktop
grep -r -l 'var createDocumentFragm' infected_site/* > filelist.txt
while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
We can use one loop, and we made it a while loop. I could even feed the grep into that while loop:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" > "$i.awk"
awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p'
done < filelist.txt
and then I don't even have to create a temporary file.
Let me know what's going on with the awk. I suspect you wanted something like this:
grep -r -l 'var createDocumentFragm' infected_site/* | while read file
do
cp -f "$file" "$file.bak"
sed 's/var createDocumentFragm.*//g' "$file" \
| awk '/<\/SCRIPT>/{p=1;print}/<\/script>/{p=0}!p' > "$i.awk"
done < filelist.txt
Also note I put quotes around file names. This helps prevent problems if file name has a space in it.

Resources