Show real-time progress with GNU Parallel and Stata - bash

I'm using GNU parallel to run a Stata do file for many different data sets.
I have a Bash script that contains the following:
parallel -a arguments.txt -j 3 stata -b do $dofileloc {}
Since the do file has several different parts for each dataset, I would like to have the progress shown "real-time" (e.g. display "data loaded for XYZ" after a part of the Stata do file finishes for a dataset etc.).
So I'd like to redirect messages from Stata to the command line, but I'm having trouble doing this.
If I don't run Stata in batch mode I can see everything, which is a bit messy. I have tried using the shell command in Stata but I can't seem to figure out the correct combination.
I would appreciate any tips.

Does this do what you want?
parallel --tag --linebuffer -a arguments.txt -j 3 stata -b do $dofileloc {}

Related

How does one Parallelize a shell script with arguments using GNU parallel?

I am new to bash scripting. I have a shell script that runs several functions for longitudinal image processing in Matlab through terminal. I would like to parallelize the process in the terminal.
Here is a brief example of how it runs:
./script.sh *.nii -surface -m /Applications/MATLAB_R2018b.app/bin/matlab
*.nii refers to images from a single subject taken at different times (i.e. subj1img1 subj1img2 subj3img3). There are 3 images per subject in my case. So in each run, the script runs through all images of a single subject.
I would like to parallelize this process so that I can run this script for multiple subjects at the same time. Reading through GNU parallel with my little experience I wasn't able to figure out the code I need to write to make it happen. I'd really appreciate if anyone has any suggestions.
parallel ./script.sh {} -surface -m /Applications/MATLAB_R2018b.app/bin/matlab ::: *.nii
you can start them in the background using & in a for loop as below :
for f in *.nii
do
./script.sh "$f" -surface -m /Applications/MATLAB_R2018b.app/bin/matlab &
done

Creating steps in bash script

To start, I am relatively new to shell scripting. I was wondering if anyone could help me create "steps" within a bash script. For example, I'd like to run one analysis and then have the script proceed to the next analysis with the output files generated in the first analysis.
So for example, the script below will generate output file "filt_C2":
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db --reads ~/path/to/file/C2.fastq --aligned ~/path/to/file/rrna_C2 --num_alignments 1 --other **~/path/to/file/filt_C2** --fastx --log -a 8 -m 64000
Once this step is complete, I would like to run another step that will use the output file "filt_C2" that was generated. I have been creating multiple bash scripts for each step; however, it would be more efficient if I could do each step in one bash file. So, is there a way to make a script that will complete Step 1, then move to Step 2 using the files generated in step 1? Any tips would be greatly appreciated. Thank you!
Welcome to bash scripting!
Here are a few tips:
You can have multiple lines, as many as you like, in a bash script file.
You may call other bash scripts (or any other executable programs) from within your shell script, just as Frank has mentioned in his answer.
You may use variables to make your script more generic, say, if you want to name your result "C3" instead of "C2". (Not shown below)
You may use bash functions if your script becomes more complicated, e.g. see https://ryanstutorials.net/bash-scripting-tutorial/bash-functions.php
I recommend placing sortmerna in a directory that is in your environmental PATH variable, and to replace the multiple ~/path/to/file to another variable (say WORKDIR) for consistency and flexibility.
For example, let’s say you name your script print_analysis.sh:
#!/bin/bash
# print_analysis.sh
# Written by Nikki E. Andrzejczyk, November 2018
# Set variables
WORKDIR=~/path/to/file
# Stage 1: Generate filt_C2 using SortMeRNA
./sortmerna --ref ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-id98.db:./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-id98.db:./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-id95.db:./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-id98.db:./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-database-id98.db:./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s.db \
--reads "$WORKDIR/C2.fastq" \
--aligned "$WORKDIR/rrna_C2" \
--num_alignments 1 \
--other "$WORKDIR/filt_C2" \
--fastx --log -a 8 -m 64000
# Stage 2: Process filt_C2 to generate result_C2
./stage2 "$WORKDIR/filt_C2" > "$WORKDIR/result_C2.txt"
# Stage 3: Print the result in result_C2
less "$WORKDIR/result_C2.txt"
Note how I use trailing backslash \ so that I could split the long sortmerna command into multiple shorter lines, and the use of # for human-readable comments.
There is still room for improvement as mentioned above but not implemented in this quick example, but hope this quick example shows you how to expand your bash script and make it do multiple steps in one go.
Bash is actually a very powerful scripting and programming language. To learn more, you may want to start with Bash tutorials like the following:
https://ryanstutorials.net/bash-scripting-tutorial/
http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
Hope this helps! If you have any other questions, or if I had misunderstood your question, please feel free to ask!
Cheers,
Anthony

Script for running multiple Make Commands

I would like to get insight on how to get started or what general direction to look in when trying to make a script or makefile that will run 3 make commands at once that take in the same input. These three commands all ask for the same input but just output different excel files due to it manipulating the pulled data in different ways. Therefore If I were able to create a script or makefile that ran all three commands at once when giving the input one time it would SAVE ME A TON OF TIME.
This is all being done in putty pretty much (in terms of the commands)
Thanks,
NP
You want to use a shell script.
For instance, you can create run.sh with:
#!/bin/bash
make FLAG1=ON $*
make FLAG2=ON $*
make FLAG3=ON $*
Make it executable and do `./run.sh MYCOMMOFLAG1=ON MYCOMMONFLAG2=OFF...

How to process multiple files in sequence using OPEN MP and/or MPI?

I'm using the parallel_multicore version of the DBSCAN clustering algorithm available below:
http://cucis.ece.northwestern.edu/projects/Clustering/index.html
To run the code simply requires the following line:
./omp_dbscan -i trial.txt -m 4 -e 0.5 -o output.txt -t 8
where -i is the input, -m and -e are two parameters, -o is the output and -t is the number of threads.
What I want to do is adapt this command so that I can process lots of input files (say trial_1.txt, trial_2.txt, trial_3.txt and so on) sequentially, but I'm not really sure how to do this in this language? Any help would be greatly appreciated as I'm thoroughly lost!
Any unix server will have a shell installed.
Unix shells have been used for automating simple processes since the beginning of Unix. They are the scripting language at the very heart of any Unix system.
Their syntax is very easy to learn, and it will allow you to easily automate such tasks. So get an tutorial to shell scripting!

How do I curl multiple resources in one command?

Say I am trying to download a set of 50 lecture notes efficiently. These notes are inside the prof subdirectory of a university website. The 45th lecture note is inside the lect45 subdirectory as a pdf entitled lect45.pdf. I get my first pdf as follows:
curl -O http://www.university.edu/~prof/lect1/lect1.pdf
How do I get all my 50 notes efficiently using cURL and bash? I'm trying to do this from the command line, not through a Python / Ruby / Perl script. I know something like the below will generate a lot of 404s:
curl -O http://www.university.edu/~prof/lect{1..50}/lect{1..50}.pdf
so what will work better? I would prefer an elegant one-liner over a loop.
Do it in several processes:
for i in {1..50}
do
curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf &
done
or as a one-liner (just a different formatting):
for i in {1..50}; do curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf & done
The & makes all processes run in parallel.
Don't be scared by the output; the shell tells you that 50 processes have been started, that's a lot of spam. Later it will tell you for each of these that they terminated. A lot of output again.
You probably don't want to run all 50 in parallel ;-)
EDIT:
Your example using {1..50} twice makes a matrix of the numbers. See for example echo {1..3}/{1..3} to see what I mean. And I guess that this way you create a lot of 404s.
Take a look at parallel shell tool.
So, for this particular case it will look like
seq 50 | parallel curl -O http://www.university.edu/~prof/lect{}/lect{}.pdf
As for curl - it doesn't have its own parallel mechanism, and what for it actually should? And your example with shell expansions {1..50} seems valid for me.

Resources