Simulating User Interaction In Gromacs in Bash - bash

I am currently doing parallel cascade simulations in GROMACS 4.6.5 and I am inputting the commands using a bash script:
#!/bin/bash
pdb2gmx -f step_04_01.pdb -o step_04_01.gro -water none -ff amber99sb -ignh
grompp -f minim.mdp -c step_04_01.gro -p topol.top -o em.tpr
mdrun -v -deffnm em
grompp -f nvt.mdp -c em.gro -p topol.top -o nvt.tpr
mdrun -v -deffnm nvt
grompp -f md.mdp -c nvt.gro -t nvt.cpt -p topol.top -o step_04_01.tpr
mdrun -v -deffnm step_04_01
trjconv -s step_04_01.tpr -f step_04_01.xtc -pbc mol -o step_04_01_pbc.xtc
g_rms -s itasser_2znh.tpr -f step_04_01_pbc.xtc -o step_04_01_rmsd.xvg
Commands such as trjconv and g_rms require user interaction to select options. For instance when running trjconv you are given:
Select group for output
Group 0 ( System) has 6241 elements
Group 1 ( Protein) has 6241 elements
Group 2 ( Protein-H) has 3126 elements
Group 3 ( C-alpha) has 394 elements
Group 4 ( Backbone) has 1182 elements
Group 5 ( MainChain) has 1577 elements
Group 6 ( MainChain+Cb) has 1949 elements
Group 7 ( MainChain+H) has 1956 elements
Group 8 ( SideChain) has 4285 elements
Group 9 ( SideChain-H) has 1549 elements
Select a group:
And the user is expected to enter eg. 0 into the terminal to select Group 0. I have tried using expect and send, eg:
trjconv -s step_04_01.tpr -f step_04_01.xtc -pbc mol -o step_04_01_pbc.xtc
expect "Select group: "
send "0"
However this does not work. I have also tried using -flag like in http://www.gromacs.org/Documentation/How-tos/Using_Commands_in_Scripts#Within_Script but it says that it is not a recognised input.
Is my expect \ send formatted correctly? Is there another way around this in GROMACS?

I don't know gromacs but I think they are just asking you to to use the bash syntax:
yourcomand ... <<EOF
1st answer to a question
2nd answer to a question
EOF
so you might have
trjconv -s step_04_01.tpr -f step_04_01.xtc -pbc mol -o step_04_01_pbc.xtc <<EOF
0
EOF

You can use
echo 0 | trjconv -s step_04_01.tpr -f step_04_01.xtc -pbc mol -o step_04_01_pbc.xtc
And if you need to have multiple inputs, just use
echo 4 4 | g_rms -s itasser_2znh.tpr -f step_04_01_pbc.xtc -o step_04_01_rmsd.xvg

Related

How to make a bash script that will use cdhit on each file in the directory separately?

I have a directory with >500 multifasta files. I want to use the same program (cd-hit-est) to cluster sequences in each of the files and then save the output in another directory. I want the name to be the same of the file to be the same as in the original file.
for file in /dir/*.fasta;
do
echo "$file";
cd-hit-est -i $file -o /anotherdir/${file} -c 0.98 -n 9 -d 0 -M 120000 -T 32;
done
I get partial output and then an error:
...
^M# comparing sequences from 33876 to 33910
.................---------- new table with 34 representatives
^M# comparing sequences from 33910 to 33943
.................---------- new table with 33 representatives
^M# comparing sequences from 33943 to 33975
................---------- new table with 32 representatives
^M# comparing sequences from 33975 to 34006
................---------- new table with 31 representatives
^M# comparing sequences from 34006 to 34036
...............---------- new table with 30 representatives
^M# comparing sequences from 34036 to 34066
...............---------- new table with 30 representatives
^M# comparing sequences from 34066 to 35059
.....................
Fatal Error:
file opening failed
Program halted !!
---------- new table with 993 representatives
35059 finished 34719 clusters
No output file was produced. Could anyone help me understand where do I make a mistake?
doit() {
file="$1"
echo "$file";
cd-hit-est -i "$file" -o /anotherdir/$(basename "$transcriptome") -c 0.98 -n 9 -d 0 -M 120000 -T 32;
}
env_parallel doit ::: /dir/*.fasta
OK, it seems that I have an answer now, in any case if somebody is looking for a similar answer.
for file in /dir/*.fasta;
do
echo "$file";
cd-hit-est -i "$file" -o /anotherdir/$(basename "$transcriptome") -c 0.98 -n 9 -d 0 -M 120000 -T 32;
done
Calling the output file in another way did the trick.

Reading from file and passing it to another command

I have a two column tab delimited file the contains input for a command.
The input file looks like this:
2795.bam 2865.bam
2825.bam 2865.bam
2794.bam 2864.bam
the command line is:
macs2 callpeak -t trt.bam -c ctrl.bam -n Macs.name.bam --gsize hs --nomodel
where trt.bam are the names of files in column 1 and ctrl.bam are the names of files in col2.
what I trying is to read these values from input file and run them.
To do achieve this I am doing following:
cat temp | awk '{print $1 "\t" $2 }' | macs2 callpeak -t $1 -c $2 -n Macs.$1 --gsize hs --nomodel
This is failing. The error that I get is:
usage: macs2 callpeak [-h] -t TFILE [TFILE ...] [-c [CFILE [CFILE ...]]]
[-f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}]
[-g GSIZE] [--keep-dup KEEPDUPLICATES]
[--buffer-size BUFFER_SIZE] [--outdir OUTDIR] [-n NAME]
[-B] [--verbose VERBOSE] [--trackline] [--SPMR]
[-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal]
[--nomodel] [--shift SHIFT] [--extsize EXTSIZE]
[-q QVALUE | -p PVALUE] [--to-large] [--ratio RATIO]
[--down-sample] [--seed SEED] [--tempdir TEMPDIR]
[--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL]
[--broad] [--broad-cutoff BROADCUTOFF]
[--cutoff-analysis] [--call-summits]
[--fe-cutoff FECUTOFF]
macs2 callpeak: error: argument -t/--treatment: expected at least one argument
In an ideal situation this should be taking inputs like this:
macs2 callpeak -t 2795.bam -c 2865.bam -n Macs.2795 --gsize hs --nomodel
Where Macs is a standalone software that runs on linux. In the present situation, the software is failing to read the input from the file.
Any inputs are deeply appreciated.
I believe what you want to achieve is a loop over all lines in your input file. In bash, you can achieve this as :
while read -r tfile cfile; do
macs2 callpeak -t "$tfile" -c "$cfile" -n "Macs.$tfile" --gsize hs --nomodel
done < "input_file.txt"
See: https://mywiki.wooledge.org/BashFAQ/001 (cfr. Sundeep's comment)
original answer:
while read -a a; do
macs2 callpeak -t "${a[0]}" -c "${a[1]}" -n "Macs.${a[0]}" --gsize hs --nomodel
done < "input_file.txt"
This will read the input file input_file.txt line by line and store it in a bash array named a using read -a a. From that point forward, you process your command with the variables ${a[0]} and ${a[1]}.

Bash Multiple cURL request Issue

The script submits the files and Post submit , The API service returns "task_id" of the submitted samples ( #task.csv )
#file_submitter.sh
#!/bin/bash
for i in $(find $1 -type f);do
task_id="$(curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &)"
echo "$task_id" >> task.csv
done
Run Method :
$./submitter.sh /home/files/
Results : ( Here 761 & 762 is the task_id of the submitted sample from the API service )
#task.csv
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/761"}
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/762"}
I'm giving the entire folder path (find $1 -type f) to find all the files in the directory to upload the files. Now , I'm using "&" operator to submit/upload the files from the folder which will generate 'task_id' from the API service(stdout) and i wanted that 'task_id'(stdout) to store it in 'task.csv'. But the time taken to upload a file with "&" and without "&" is same. Is there any more method to do the submission parallel/faster? Any suggestions please ?
anubhava suggests using xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 curl -s -F file=#- http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
However, appending to the same file in parallel is generally a bad idea: You really need to know a lot about how this version of the OS buffers output for that to be safe. This example shows why:
#!/bin/bash
size=3000
myfile=/tmp/myfile$$
rm $myfile
echo {a..z} | xargs -P26 -n1 perl -e 'print ((shift)x'$size')' >> $myfile
cat $myfile | perl -ne 'for(split//,$_){
if($_ eq $l) {
$c++
} else {
/\n/ and next;
print $l,1+$c," "; $l=$_; $c=0;
}
}'
echo
With size=10 you will always get (order may differ):
1 d10 i10 c10 n10 h10 x10 l10 b10 u10 w10 t10 o10 y10 z10 p10 j10 q10 s10 v10 r10 k10 e10 m10 f10 g10
Which means that the file contains 10 d's followed by 10 i's followed by 10 c's and so on. I.e. no mixing of the output from the 26 jobs.
But change it to size=30000 and you get something like:
1 c30000 d30000 l8192 g8192 t8192 g8192 t8192 g8192 t8192 g5424 t5424 a8192 i16384 s8192 i8192 s8192 i5424 s13616 f16384 k24576 p24576 n8192 l8192 n8192 l13616 n13616 r16384 u8192 r8192 u8192 r5424 u8192 o16384 b8192 j8192 b8192 j8192 b8192 j8192 b5424 a21808 v8192 o8192 v8192 o5424 v13616 j5424 u5424 h16384 p5424 h13616 x8192 m8192 k5424 m8192 q8192 f8192 m8192 f5424 m5424 q21808 x21808 y30000 e30000 w30000
First 30K c's, then 30K d's, then 8k l's, then 8K g's, 8K t's, then another 8k g's, and so on. I.e. the 26 outputs were mixed together. Very non-good.
For that reason I will advice against appending to the same file in parallel: There is a risk of race condition, and it can often be avoided.
In your case you can simply use GNU Parallel instead of xargs, because GNU Parallel guards against this race condition:
find "$1" -type f -print0 |
parallel -0 -P 5 curl -s -F file=#{} http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
You can use xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 -I{} curl -s -F file='#{}' http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
This will reduce total execution time by launching 5 curl process in parallel.
The command inside command substitution, $(), runs in a subshell; so here you are sending the curl command in the background of that subshell, not the parent shell.
Get rid of the command substitution, and Just do:
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload >task.csv &
You're telling the shell to parallelize inside of a command substitution ($()). That's not going to do what you want. Try this instead:
#!/bin/bash
for i in $(find $1 -type f);do
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &
done > task.csv
#uncomment next line if you want the script to pause until the last curl is done
#wait
This puts the curl into the background and saves its output into task.csv.

sed or awk replace after ignoring N matches

I have a repetitive text, I want to carefully replace one label with another several times, I don mind repeating sed awk or another method. Therefore I would want to first replace the first two matches, then after the first 4, 6, etc. I don't want a for, I just need something like the code below, I want to skip the first two matches and then increase that number.
sed 's/foo/bar/2g' fileX
awk '{ sub(/foo/,"bar"); print }' fileX
here is an example. Two occurrences per line
blastx -q specie.fa -db pep -num 6 -max 1 -o 6 > specie.x.outfmt6
blastp -q specie.pep -db pep -num 6 -max 1 -o 6 > specie.p.outfmt6
blastx -q specie.fa -db pep -num 6 -max 1 -o 6 > specie.x.outfmt6
blastp -q specie.pep -db pep -num 6 -max 1 -o 6 > specie.p.outfmt6
Desired output
blastx -q dog.fa -db pep -num 6 -max 1 -o 6 > dog.x.outfmt6
blastp -q dog.pep -db pep -num 6 -max 1 -o 6 > dog.p.outfmt6
blastx -q worm.fa -db pep -num 6 -max 1 -o 6 > worm.x.outfmt6
blastp -q worm.pep -db pep -num 6 -max 1 -o 6 > worm.p.outfmt6
Is this what you're trying to do?
$ awk -v animals='monkey worm dog' 'BEGIN{split(animals,a)} NR%2{c++} {$NF=a[c]} 1' file
here some text -t monkey
and then do something -t monkey
here some text -t worm
and then do something -t worm
here some text -t dog
and then do something -t dog
Given your new sample input/output maybe this is what you want:
$ awk -v animals='dog worm' 'BEGIN{split(animals,a)} NR%2{c++} {gsub(/specie/,a[c])} 1' file
blastx -q dog.fa -db pep -num 6 -max 1 -o 6 > dog.x.outfmt6
blastp -q dog.pep -db pep -num 6 -max 1 -o 6 > dog.p.outfmt6
blastx -q worm.fa -db pep -num 6 -max 1 -o 6 > worm.x.outfmt6
blastp -q worm.pep -db pep -num 6 -max 1 -o 6 > worm.p.outfmt6
Since you didn't include any regexp characters or backreference characters or partial match cases in your sample input/output (e.g. if the word species appeared somewhere and should NOT be changed) I assume they can't happen and so we don't need the script to guard against them.

How to properly use the grep command to grab and store integers?

I am currently building a bash script for class, and I am trying to use the grep command to grab the values from a simple calculator program and store them in the variables I assign, but I keep receiving a syntax error message when I try to run the script. Any advice on how to fix it? my script looks like this:
#!/bin/bash
addanwser=$(grep -o "num1 + num2" Lab9 -a 5 2)
echo "addanwser"
subanwser=$(grep -o "num1 - num2" Lab9 -s 10 15)
echo "subanwser"
multianwser=$(grep -o "num1 * num2" Lab9 -m 3 10)
echo "multianwser"
divanwser=$(grep -o "num1 / num2" Lab9 -d 100 4)
echo "divanwser"
modanwser=$(grep -o "num1 % num2" Lab9 -r 300 7)
echo "modawser"`
You want to grep the output of a command.
grep searches from either a file or standard input. So you can say either of these equivalent:
grep X file # 1. from a file
... things ... | grep X # 2. from stdin
grep X <<< "content" # 3. using here-strings
For this case, you want to use the last one, so that you execute the program and its output feeds grep directly:
grep <something> <<< "$(Lab9 -s 10 15)"
Which is the same as saying:
Lab9 -s 10 15 | grep <something>
So that grep will act on the output of your program. Since I don't know how Lab9 works, let's use a simple example with seq, that returns numbers from 5 to 15:
$ grep 5 <<< "$(seq 5 15)"
5
15
grep is usually used for finding matching lines of a text file. To actually grab a part of the matched line other tools such as awk are used.
Assuming the output looks like "num1 + num2 = 54" (i.e. fields are separated by space), this should do your job:
addanwser=$(Lab9 -a 5 2 | awk '{print $NF}')
echo "$addanwser"
Make sure you don't miss the '$' sign before addanwser when echo'ing it.
$NF selects the last field. You may select nth field using $n.

Resources