Shell script to analyze files in multiple directories

Shell script to analyze files in multiple directories - shell

I am new to Unix and shell scripting, so hoping somebody can help me. I have a code that analyzes a file and then averages over several files (3 in the pared down example below). I submit my analysis code using the bsub command in the shell script. My question is, how do I modify my script so that I can do the analysis, which entails averaging as mentioned, over all 3 files that are in different directories? The only idea that comes to my mind is to copy each file to a separate directory and then doing all of the analysis there, but the files are huge, so copying them doesn't seem like the best idea. I'll be happy to clarify anything in case that's not clear.
#!/bin/sh
lsf_job_file="submit_analysis.lsf"
start=1
stop=3
increment=1
for id in $(seq $start $increment $stop)
do
file_dir="${id}"
cp -R base "${file_dir}"
cd "${file_dir}"
# Submit the job
bsub < "${submit_analysis}"
cd ..
done

Related

Executing a script takes so long on Git Bash

I'm currently executing a script on Git Bash on a Windows 7 VM. The same script is executed within 15-20 seconds on my Mac machine, but it takes almost 1 hour to run on my Windows.
The script itself contains packages that extract data from XML files, and does not call upon any APIs or anything of the sort.
I have no idea what's going on, and I've tried solving it with the following answers, but to no avail:
https://askubuntu.com/a/738493
https://github.com/git-for-windows/git/wiki/Diagnosing-performance-issues
I would like to have someone help me out in diagnosing or giving a few pointers on what I could do to either understand where the issue is, or how to resolve it altogether.
EDIT:
I am not able to share the entire script, but you can see the type of commands that the script uses through previous questions I have asked on Stackoverflow. Essentially, there is a mixture of XMLStarlet commands that are used.
https://stackoverflow.com/a/58694678/3480297
https://stackoverflow.com/a/58693691/3480297
https://stackoverflow.com/a/58080702/3480297
EDIT2:
As a high level overview, the script essentially loops over a folder for XML files, and then retrieves certain data from each one of those files, before creating an HTML page and pasting that data in tables.
A breakdown of these steps in terms of the code can be seen below:
Searching folder for XML files and looping through each one
for file in "$directory"*
do
if [[ "$file" == *".xml"* ]]; then
filePath+=( "$file" )
fi
done
for ((j=0; j < ${#filePath[#]}; j++)); do
retrieveData "${filePath[j]}"
done
Retrieving data from the XML file in question
function retrieveData() {
filePath=$1
# Retrieve data from the revelent xml file
dataRow=$(xml sel -t -v "//xsd:element[#name=\"$data\"]/#type" -n "$filePath")
outputRow "$dataRow"
}
Outputting the data to an HTML table
function outputRow() {
rowValue=$1
cat >> "$HTMLFILE" << EOF
<td>
<div>$rowValue</div>
</td>
EOF
}
As previously mentioned, the actual xml commands used to retrieve the relevant data can differ, however, the links to my previous questions have the different types of commands used.

Your git-bash installation is out of date.
Execute git --version to confirm this. Are you using something from before 2.x?
Please install the latest version of git-bash, which is 2.24.0 as of 2019-11-13.
See the Release Notes for git for more information about performance improvements over time.

Creating steps in bash script

To start, I am relatively new to shell scripting. I was wondering if anyone could help me create "steps" within a bash script. For example, I'd like to run one analysis and then have the script proceed to the next analysis with the output files generated in the first analysis.
So for example, the script below will generate output file "filt_C2":
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db --reads ~/path/to/file/C2.fastq --aligned ~/path/to/file/rrna_C2 --num_alignments 1 --other **~/path/to/file/filt_C2** --fastx --log -a 8 -m 64000
Once this step is complete, I would like to run another step that will use the output file "filt_C2" that was generated. I have been creating multiple bash scripts for each step; however, it would be more efficient if I could do each step in one bash file. So, is there a way to make a script that will complete Step 1, then move to Step 2 using the files generated in step 1? Any tips would be greatly appreciated. Thank you!

Welcome to bash scripting!
Here are a few tips:
You can have multiple lines, as many as you like, in a bash script file.
You may call other bash scripts (or any other executable programs) from within your shell script, just as Frank has mentioned in his answer.
You may use variables to make your script more generic, say, if you want to name your result "C3" instead of "C2". (Not shown below)
You may use bash functions if your script becomes more complicated, e.g. see https://ryanstutorials.net/bash-scripting-tutorial/bash-functions.php
I recommend placing sortmerna in a directory that is in your environmental PATH variable, and to replace the multiple ~/path/to/file to another variable (say WORKDIR) for consistency and flexibility.
For example, let’s say you name your script print_analysis.sh:
#!/bin/bash
# print_analysis.sh
# Written by Nikki E. Andrzejczyk, November 2018
# Set variables
WORKDIR=~/path/to/file
# Stage 1: Generate filt_C2 using SortMeRNA
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db \
--reads "$WORKDIR/C2.fastq" \
--aligned "$WORKDIR/rrna_C2" \
--num_alignments 1 \
--other "$WORKDIR/filt_C2" \
--fastx --log -a 8 -m 64000
# Stage 2: Process filt_C2 to generate result_C2
./stage2 "$WORKDIR/filt_C2" > "$WORKDIR/result_C2.txt"
# Stage 3: Print the result in result_C2
less "$WORKDIR/result_C2.txt"
Note how I use trailing backslash \ so that I could split the long sortmerna command into multiple shorter lines, and the use of # for human-readable comments.
There is still room for improvement as mentioned above but not implemented in this quick example, but hope this quick example shows you how to expand your bash script and make it do multiple steps in one go.
Bash is actually a very powerful scripting and programming language. To learn more, you may want to start with Bash tutorials like the following:
https://ryanstutorials.net/bash-scripting-tutorial/
http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
Hope this helps! If you have any other questions, or if I had misunderstood your question, please feel free to ask!
Cheers,
Anthony

Script for running multiple Make Commands

I would like to get insight on how to get started or what general direction to look in when trying to make a script or makefile that will run 3 make commands at once that take in the same input. These three commands all ask for the same input but just output different excel files due to it manipulating the pulled data in different ways. Therefore If I were able to create a script or makefile that ran all three commands at once when giving the input one time it would SAVE ME A TON OF TIME.
This is all being done in putty pretty much (in terms of the commands)
Thanks,
NP

You want to use a shell script.
For instance, you can create run.sh with:
#!/bin/bash
make FLAG1=ON $*
make FLAG2=ON $*
make FLAG3=ON $*
Make it executable and do `./run.sh MYCOMMOFLAG1=ON MYCOMMONFLAG2=OFF...

Just partially creation of csv file using crontab command

I have some problem in automation the generation of a csv file. The bash code used to produce the csv works in parallel using 3 cores in order to reduce the time consumption; initially different csv files are produced, which are subsequently combined to form a single csv file. The core of the code is this cycle:
...
waitevery=3
for j in `seq 1 24`; do
if((j==1)); then
printf '%s\n' A B C D E | paste -sd ',' >> code${namefile}01${rr}.csv
fi
j=$(printf "%02d" $j)
../src/thunderstorm --mask-file=mask.grib const_${namefile}$j${rr}.grib surf_${namefile}$j${rr}.grib ua_${namefile}$j${rr}.grib hl_const.grib out &
if ! ((c % waitevery)); then
wait
fi
c=$((c+1))
done
...
where ../src/thunderstorm is a .F90 code which produce the second and successive files.
If I run this code manually it produces the right csv file, but if I run it by a programmed crontab command it generates a csv file with the only header A B C D E
Some suggestions?
Thanks!

cron runs your script in an environment, that often does not match your expectations.
check that the PATH is correct and that the script is called from the correct location: ../src is obviously relative, but to what?
I find cron-scripts to be much more reliable when using full paths for input, output and programs.

As #umläute points out, cron runs your scripts but does not run the typical initiallizations that you may have when you open a terminal session. Which means that you have to make no assumptions regarding your environment.
For scripts that may be invoked from the shell and may be invoked from cron I usually add at the beginning something like this:
BIN_DIR=/home/myhome/bin
PATH=$PATH:$BIN_DIR
Also, make sure you do not use relative paths to executables like ../src/thunderstorm. The working directory of the script invoked by cron may not be what you think. You may use $BIN_DIR/../src/thunderstorm. If you want to save typing add the relevant directories to the PATH.
The same logic goes for all other shell variables.
Doing a good initialization at the beginning of your script will allow you to run it from the shell for testing (or manual execution) and then run it as a cron job too.

Can a shell script indicate that its lines be loaded into memory initially?

UPDATE: this is a repost of How to make shell scripts robust to source being changed as they run
This is a little thing that bothers me every now and then:
I write a shell script (bash) for a quick and dirty job
I run the script, and it runs for quite a while
While it's running, I edit a few lines in the script, configuring it for a different job
But the first process is still reading the same script file and gets all screwed up.
Apparently, the script is interpreted by loading each line from the file as it is needed. Is there some way that I can have the script indicate to the shell that the entire script file should be read into memory all at once? For example, Perl scripts seem to do this: editing the code file does not affect a process that's currently interpreting it (because it's initially parsed/compiled?).
I understand that there are many ways I could get around this problem. For example, I could try something like:
cat script.sh | sh
or
sh -c "`cat script.sh`"
... although those might not work correctly if the script file is large and there are limits on the size of stream buffers and command-line arguments. I could also write an auxiliary wrapper that copies a script file to a locked temporary file and then executes it, but that doesn't seem very portable.
So I was hoping for the simplest solution that would involve modifications only to the script, not the way in which it is invoked. Can I just add a line or two at the start of the script? I don't know if such a solution exists, but I'm guessing it might make use of the $0 variable...

The best answer I've found is a very slight variation on the solutions offered to How to make shell scripts robust to source being changed as they run. Thanks to camh for noting the repost!
#!/bin/sh
{
# Your stuff goes here
exit
}
This ensures that all of your code is parsed initially; note that the 'exit' is critical to ensuring that the file isn't accessed later to see if there are additional lines to interpret. Also, as noted on the previous post, this isn't a guarantee that other scripts called by your script will be safe.
Thanks everyone for the help!

Use an editor that doesn't modify the existing file, and instead creates a new file then replaces the old file. For example, using :set writebackup backupcopy=no in Vim.

How about a solution to how you edit it.
If the script is running, before editing it, do this:
mv script script-old
cp script-old script
rm script-old
Since the shell keep's the file open as long as you don't change the contents of the open inode everything will work okay.
The above works because mv will preserve the old inode while cp will create a new one. Since a file's contents will not actually be removed if it is opened, you can remove it right away and it will be cleaned up once the shell closes the file.

According to the bash documentation if instead of
#!/bin/bash
body of script
you try
#!/bin/bash
script=$(cat <<'SETVAR'
body of script
SETVAR)
eval "$script"
then I think you will be in business.

Consider creating a new bang path for your quick-and-dirty jobs. If you start your scripts with:
#!/usr/local/fastbash
or something, then you can write a fastbash wrapper that uses one of the methods you mentioned. For portability, one can just create a symlink from fastbash to bash, or have a comment in the script saying one can replace fastbash with bash.

If you use Emacs, try M-x customize-variable break-hardlink-on-save. Setting this variable will tell Emacs to write to a temp file and then rename the temp file over the original instead of editing the original file directly. This should allow the running instance to keep its unmodified version while you save the new version.
Presumably, other semi-intelligent editors would have similar options.

A self contained way to make a script resistant to this problem is to have the script copy and re-execute itself like this:
#!/bin/bash
if [[ $0 != /tmp/copy-* ]] ; then
rm -f /tmp/copy-$$
cp $0 /tmp/copy-$$
exec /tmp/copy-$$ "$#"
echo "error copying and execing script"
exit 1
fi
rm $0
# rest of script...
(This will not work if the original script begins with the characters /tmp/copy-)
(This is inspired by R Samuel Klatchko's answer)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio