How to iterate over files in many folders - bash

I have 15 folders and each folder contained a *.gz file. I would like to use that file for one of the package to do some filtering.
For this I would like to write something that can open that folder and read the that specific file and do the actions as mentioned and than save the results in the same folder with different extension.
What I did is(PBS Script):
#!/bin/bash
#PBS -N Trimmomatics_filtering
#PBS -l nodes=1:ppn=8
#PBS -l walltime=04:00:00
#PBS -l vmem=23gb
#PBS -q ext_chem_guest
# Go to the Trimmomatics directory
cd /home/tb44227/bioinfo_packages/Trimmomatic/Trimmomatic-0.36
# Java module load
module load java/1.8.0-162
# Input File (I have a list of 15 folders and each contained fastq.gz file)
**inputFile= for f in /home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017/754_{1..15}/*fastq.gz; $f**
# Start the code to filter the file and save the results in the same folder where the input file is
java -jar trimmomatic-0.36.jar SE -threads ${PBS_NUM_PPN} -phred33 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:17 $inputFile $outputFile
# Output File
outputFile=$inputFile{.TRIMMIMG}
My question is How could I define $inputFile and $outputfile so that it can read for all the 15 files.
Thanks

If your application does only process a single input file at a time, you have two options:
Process all files in one single job
Process each file in a different job
From the user's perspective you are usually more interested in the second option, as multiple jobs may run simultaneously if there are resources available. However, this depends on the number of files you need to process and your system usage policy, as sending too many jobs in a short amount of time can cause problems in the job scheudler.
The first option is, more or less, what you already got. You can use find program and a simple bash loop. You basically store find output into a variable, and then iterate over it, like in this example:
#!/bin/bash
# PBS job parameters
module load java
root_dir=/home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017
# Get all files to be processed
files=$(find $root_dir -type f -name "*fastq.gz")
for inputfile in $files; do
outputfile="$inputFile{.TRIMMIMG}"
# Process one file at a time
java -jar ... $inputfile $outputfile
done
Then, you just submit your job script, which will generate a single job.
$ qsub myjobscript.sh
The second option is more powerful, but requires you to change the jobscript for each file. Most job managers let you pass the job script by standard input. This is really helpful because it avoids us to generate intermediate files, which pollute your directories.
#!/bin/bash
function submit_job() {
# Submit job. Jobscript passed through standard input using a HEREDOC.
# Must define $inputfile and $outputfile before calling the function.
qsub - <<- EOF
# PBS job parameters
module load java
# Process a single file only
java -jar ... $inputfile $outputfile
EOF
}
root_dir=/home/tb44227/nobackup/small_RNAseq_260917/support.igatech.it/sequences-export/536-RNA-seq_Disco_TuDO/delivery_25092017
# Get all files to be processed
files=$(find $root_dir -type f -name "*fastq.gz")
for inputfile in $files; do
outputfile="$inputFile{.TRIMMIMG}"
submit_job
done
Since you are calling qsub inside the script, you just need to call the script itself, like any regular shell script file.
$ bash multijobscript.sh

Related

Creating variables in running script

I'm trying convert some files to read only in backup environment. Data Domain has retention-lock feature that can lock files with external trigger which touch -a -t "dateuntillocked" /backup/foo.
In this situation there is also metadata files in folder that should not be locked otherwise next backup job cannot update metadata file and fails.
I extracted metadata file names but file count can be changed. For exp.
foo1.meta foo2.meta . . fooN.meta
Is it possible to create a variable for each entry and add to command dynamically?
Like:
var1=/backup/foo234.meta
var2=/backup/foo322.meta
.
.
varN=/backup/fooNNN.meta
<find command> | grep -v $var1 $var2....varN | while read line; do touch -a -t "$dateuntillocked" "$line"; done
another elaboration of the case is
for example you executed a ls in a folder but amount of file can differs in time. script will create a variable for every file and use in a touch command with while loop. if 3 files in folder, script will create 3 variable and use 3 variable with touch in while loop. if "ls" result find 4 files, script dynamically create 4 variable fof files and use all in while loop etc. I am not a programmer so my logic can differ. May be another way to do this with easier way.
Just guessing what your intentions might be.
You can combine find | grep | command into a single command:
find /backup -name 'foo*.meta' -exec touch -a -t "$dateuntillocked" {} +

Bash script to check if a new file has been created on a directory after run a command

By using bash script, I'm trying to detect whether a file has been created on a directory or not while running commands. Let me illustrate the problem;
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
FILES_BEFORE= ls $WATCH_DIR
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
FILES_AFTER= ls $WATCH_DIR
# detect changes and if any changes has been occurred exit the program.
After that I've just tried to compare these FILES_BEFORE and FILES_AFTER however couldn't accomplish that. I've tried;
comm -23 <($FILES_AFTER |sort) <($FILES_BEFORE|sort)
diff $FILES_AFTER $FILES_BEFORE > /dev/null 2>&1
cat $FILES_AFTER $FILES_BEFORE | sort | uniq -u
None of them gave me a result to understand there is a change or not. What I need is detecting the change and exiting the program if any. I am not really good at this bash script, searched a lot on the internet however couldn't find what I need. Any help will be appreciated. Thanks.
Thanks to informative comments, I've just realized that I've missed the basics of bash script but finally made that work. I'll leave my solution here as an answer for those who struggle like me.:
WATCH_DIR=./tmp
FILES_BEFORE=$(ls $WATCH_DIR)
echo >$WATCH_DIR/filename
FILES_AFTER=$(ls $WATCH_DIR)
if diff <(echo "$FILES_AFTER") <(echo "$FILES_BEFORE")
then
echo "No changes"
else
echo "Changes"
fi
It outputs "Changes" on the first run and "No Changes" for the other unless you delete the newly added documents.
I'm trying to interpret your script (which contains some errors) into an understanding of your requirements.
I think the simplest way is simply to rediect the ls command outputto named files then diff those files:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
ls $WATCH_DIR > /tmp/watch_dir.after
# detect changes and if any changes has been occurred exit the program.
diff -c /tmp/watch_dir.after /tmp/watch_dir.before
If the any files are modified by the 'commands', i.e. the files exists in the 'before' list, but might change, the above will not show that as a difference.
In this case you might be better off using a 'marker' file created to mark the instance the monitoring started, then use the find command to list any newer/modified files since the market file. Something like this:
#!/bin/bash
# give base directory to watch file changes
WATCH_DIR=./tmp
# get list of files on that directory
ls $WATCH_DIR > /tmp/watch_dir.before
# actually a command is running here but lets assume I've created a new file there.
echo >$WATCH_DIR/filename
# and I'm getting new list of files.
find $WATCH_DIR -type f -newer /tmp/watch_dir.before -exec ls -l {} \;
What this won't do is show any files that were deleted, so perhaps a hybrid list could be used.
Here is how I got it to work. It's also setup up so that you can have multiple watched directories with the same script with cron.
for example, if you wanted one to run every minute.
* * * * * /usr/local/bin/watchdir.sh /makepdf
and one every hour.
0 * * * * /user/local/bin/watchdir.sh /incoming
#!/bin/bash
WATCHDIR="$1"
NEWFILESNAME=.newfiles$(basename "$WATCHDIR")
if [ ! -f "$WATCHDIR"/.oldfiles ]
then
ls -A "$WATCHDIR" > "$WATCHDIR"/.oldfiles
fi
ls -A "$WATCHDIR" > $NEWFILESNAME
DIRDIFF=$(diff "$WATCHDIR"/.oldfiles $NEWFILESNAME | cut -f 2 -d "")
for file in $DIRDIFF
do
if [ -e "$WATCHDIR"/$file ];then
#do what you want to the file(s) here
echo $file
fi
done
rm $NEWFILESNAME

how to write a bash script that creates new scripts iteratively

How would I write a script that loops through all of my subjects and creates a new script per subject? The goal is to create a script that runs a program called FreeSurfer per subject on a supercomputer. The supercomputer queue restricts how long each script/job will take, so I will have each job run 1 subject. Ultimately I would like to automate the job submitting process since I cannot submit all the jobs at the same time. In my subjects folder I have three subjects: 3123, 3315, and 3412.
I am familiar with MATLAB scripting, so I was envisioning something like this
for i=1:length(subjects)
nano subjects(i).sh
<contents of FreeSurfer script>
input: /subjects(i)/scan_name.nii
output: /output/subjects(i)/<FreeSurfer output folders>
end
I know I mixed aspects of MATLAB and linux but hopefully it's relatively clear what the goal is. Please let me know if there is a better method.
Here is an example of the FreeSurfer script for a given subject
#!/bin/bash
#PBS -l walltime=25:00:00
#PBS -q long
export FREESURFER_HOME=/gpfs/software/freesurfer/6.0.0/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
export SUBJECTS_DIR=/gpfs/projects/Group/ppmi/freesurfer/subjects/
recon-all -i /gpfs/projects/Group/ppmi/all_anat/3105/Baseline/*.nii -s
$SUBJECTS_DIR/freesurfer/subjects/3105 -autorecon-all
The -i option gives the input and the -s option gives the output.
change your script to accept the subject as an argument, so that you have only one generic script.
#!/bin/bash
#PBS -l walltime=25:00:00
#PBS -q long
subject="$1"
export FREESURFER_HOME=/gpfs/software/freesurfer/6.0.0/freesurfer
source $FREESURFER_HOME/SetUpFreeSurfer.sh
export SUBJECTS_DIR=/gpfs/projects/Group/ppmi/freesurfer/subjects/
recon-all -i /gpfs/projects/Group/ppmi/all_anat/"$subject"/Baseline/*.nii -s
$SUBJECTS_DIR/freesurfer/subjects/"$subject" -autorecon-all
and you can call it for all your subjects
for s in 3123 3315 3412;
do
./yourscriptnamehere.sh "$s"
done
add error handling as desired.

How to monitor multiple file through shell script

I want to monitor Apache and Tomcat logs through a shell script.
I could monitor single files through script. But How do I monitor multiple files through script?
I have written sample script for single files.
#!/bin/bash
file=/root/logs_flow/apache_access_log
current=`date +%s`
last_modified=`stat -c "%Y" $file`
if [ $(($current-$last_modified)) -gt 180 ]; then
mail -s "$file is not updating proper" ramacn11#xx.xx.xxx
else
mail -s "$file is updating proper" ramacn11#xx.xxx.xxx
fi
I want to monitor the files apache_error_log and tomcat logs with same script.
An easy solution starting from what you have already would be to call your script with the file to monitor as argument:
script.sh /root/logs_flow/apache_access_log
Then inside you put
file=$1
Now you can put a bunch of these in cron
* * * * * script.sh /root/logs_flow/apache_access_log
* * * * * script.sh /some/other/file.log
You might want to expand your script a bit to check if the argument is passed and if it's a valid filename.
You can list files that have or haven't been updated in a period of time using the find command, which will be more portable than processing the output of stat, which varies by operating system.
The following will output the names of specified logs that have a modification time more than 3 minutes ago:
find httpd.log tomcat.log -not -mtime -3m
Or, for more easier file list management, you could use a bash array:
#!/usr/bin/env bash
files=(
/root/logs_flow/apache_access_log
/var/log/tomcat.log
/var/log/www/apache-*.log # This is an expanding glob.
)
find "${files[#]}" -not -mtime -3m
Files in the array will be listed if they are more than 3 minutes old.
To read from multiple log files, at once.. One could do
tail -f /home/user/log_A -f /home/user/log_B |egrep -v "^$|="
Note: The egrep -v "^$|=" part is to remove header lines and empty lines from the output of the tail command. You can remove that if you want to keep the headers.

Bash script for looping through a program for several different input files

I would like a bash script to run a program (taking several variables and an input file) in the background, but I want the script to only process one input file (all input files in one directory) at a time. e.g.
#!/bin/sh
for file in *
do
~/FrameDP/bin/FrameDP.pl --cfg ~/FrameDP/cfg/FrameDP.cfg --no_train --infile /home/bop08olp/FrameDP/data/"$file" --outdir ~/FrameDP/test
done
I'm guessing the above script is not going to wait between each processing of separate input files, but will just start them all at once. The program generates lots of child processes. Any pointers appreciated. Thanks!
sh loops are not parallelized, you can verify it by running :
for i in *
sleep 1
done
And you'll see that if you have > 1 file that the total time is > than 1s but equal to 'number of files' seconds.
So if actually running the code snippet you gave ran all the process in the same time, it's because of your perl script have a feature to run in the background, you should review your perl script, not your shell script.
Maybe I am thinking too simple and I am too late, but what about
find . -depth 1 -exec ~/FrameDP/bin/FrameDP.pl --cfg ~/FrameDP/cfg/FrameDP.cfg --no_train --infile /home/bop08olp/FrameDP/data/{} --outdir ~/FrameDP/test \; &
where {} takes the place of the former variable $file and "\;" is find's demanded sequence to denote the end of the exec clause?

Resources