how to use awk and a conditional pipe to submit qsub jobs? - bash

I have a file (fasta) that I am using awk to extract the needed fields from (sequences with their headers). I then pipe it to a BLAST program and finally I pipe it to qsub in order to submit a job.
the file:
>sequence_1
ACTGACTGACTGACTG
>sequence_2
ACTGGTCAGTCAGTAA
>sequence_3
CCGTTGAGTAGAAGAA
and the command (which works):
awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | qsun -q S
what I would like to do is a add a condition that will sample the number of jobs I am running (using qstat) if it is below a certain threshold the job will be submitted.
for example:
allowed_jobs=200 #for example
awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | cmd=$(qstat -u User | grep -c ".") | if [ $cmd -lt $allowed_jobs ]; then qsub -q S
unfortunately (for me anyway) I have failed in all my attempts to do that.
I'd be grateful for any help
EDIT: elaborating a bit:
what I am trying to do is to extract from the fasta file this:
>sequene_x
ACTATATATATA
or basically: >HEADER\nSEQUENCE
one by one and pipe it to the blast program which can take stdin. I want to create a unique job for each sequence and this is the reason I want to pipe to qsub for each sequence.
to put it plainly the qsub submission would have looked something like this:
qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -query FASTA_SEQUENCE -outfmt 5 >> /User/blastresult.xml
note that the -query flag is unnecessary if stdin sequence is piped to it.
however, the main problem for me is how to incorporate the condition I mentioned above so that the sequence will be piped to qsub only if the qstat result is below a threshold. ideally if the qstat result is above the threshold it'll sleep until i goes below and then pass it forward.
thanks.

Hello I guess this is answered since long now.
I'll just provide a way to solve this, by counting the lines that should be processed (sequences) before passing it over to awk, the awk piece would go where echo time to work is.
#!/bin/bash
ct=`grep -c '^>' fasta.fasta`
if [ $ct -lt 201 ] ; then
echo time to work
else
echo too much
fi

This bit of shell reads two lines, prints them to stdout and pipes into your qsub command
while IFS= read -r header; do
IFS= read -r sequence
printf "%s\n" "$header" "$sequence" |
qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml
done < fasta.fasta

Related

In a bash pipe, take the output of the previous command as a variable to the next command (Eg. if statement)

I wanted to write a command to compare the hash of a file. I wrote the below single line command. Wanted to understand as to how I can take the output of the previous command as a variable for the current command, in a pipe.
Eg. below command I wanted to compare the output of 1st command "Calculated hash" to the original hash. In the last command, I wanted to refer to the output of the previous command. How do I do that in the if statement? (Instead of $0)
sha256sum abc.txt | awk '{print $1}' | if [ "$0" = "8237491082roieuwr0r9812734iur" ]; then
echo "match"
fi
Following your narrow request looks like:
sha256sum abc.txt |
awk '{print $1}' |
if [ "$(cat)" = "8237491082roieuwr0r9812734iur" ]; then echo "match"; fi
...as cat with no arguments reads the command's stdin, and in a pipeline, content generated from prior stages are streamed into their successors.
Alternately:
sha256sum abc.txt |
awk '{print $1}' |
if read -r line && [ "$line" = "8237491082roieuwr0r9812734iur" ]; then echo "match"; fi
...wherein we read only a single line from stdin instead of using cat. (To instead loop over all lines given on stdin, see BashFAQ #1).
However, I would strongly suggest writing this instead as:
if [ "$(sha256sum abc.txt | awk '{print $1}')" = "8237491082roieuwr0r9812734iur" ]; then
echo "match"
fi
...which, among other things, keeps your logic outside the pipeline, so your if statement can set variables that remain set after the pipeline exits. See BashFAQ #24 for more details on the problems inherent in running code in pipelines.
Consider using sha256sum's check mode. If you save the output of sha256sum to a file, you can check it with sha256sum -c.
$ echo foo > file
$ sha256sum file > hash.txt
$ cat hash.txt
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c file
$ sha256sum -c hash.txt
file: OK
$ if sha256sum -c --quiet hash.txt; then echo "match"; fi
If you don't want to save the hashes to a file you could pass them in via a here-string:
if sha256sum -c --quiet <<< 'b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c file'; then
echo "match"
fi

Terminate tail command after timeout

I'm capturing stdout (log) in a file using tail -f file_name to save a specific string with grep and sed (to exit the tail) :
tail -f log.txt | sed /'INFO'/q | grep 'INFO' > info_file.txt
This works fine, but I want to terminate the command in case it does not find the pattern (INFO) in the log file after some time
I want something like this (which does not work) to exit the script after a timeout (60sec):
tail -f log.txt | sed /'INFO'/q | grep 'INFO' | read -t 60
Any suggestions?
This seems to work for me...
read -t 60 < <(tail -f log.txt | sed /'INFO'/q | grep 'INFO')
Since you only want to capture one line:
#!/bin/bash
IFS= read -r -t 60 line < <(tail -f log.txt | awk '/INFO/ { print; exit; }')
printf '%s\n' "$line" >info_file.txt
For a more general case, where you want to capture more than one line, the following uses no external commands other than tail:
#!/usr/bin/env bash
end_time=$(( SECONDS + 60 ))
while (( SECONDS < end_time )); do
IFS= read -t 1 -r line && [[ $line = *INFO* ]] && printf '%s\n' "$line"
done < <(tail -f log.txt)
A few notes:
SECONDS is a built-in variable in bash which, when read, will retrieve the time in seconds since the shell was started. (It loses this behavior after being the target of any assignment -- avoiding such mishaps is part of why POSIX variable-naming conventions reserving names with lowercase characters for application use are valuable).
(( )) creates an arithmetic context; all content within is treated as integer math.
<( ) is a process substitution; it evaluates to the name of a file-like object (named pipe, /dev/fd reference, or similar) which, when read from, will contain output from the command contained therein. See BashFAQ #24 for a discussion of why this is more suitable than piping to read.
The timeout command, (part of the Debian/Ubuntu "coreutils" package), seems suitable:
timeout 1m tail -f log.txt | grep 'INFO'

How to read a file for special lines in bash script

I just want to read even line number from a file in bash shell, how to do it?
Also I just want to read the fifth line of a file, then how do it?
awk 'NR % 2 == 1' <filename>
For the second one:
awk 'NR == 5' <filename>
You can also use sed to get numbers in a specified range:
sed -ne '5,5p' <filename>
You could use the tail command. Put it in a for loop for the first case and the second is totally trivial if you get the first.
Or maybe you could even use awk:
awk NR==5 file_name
To read even number files using gnu-sed:
sed -n "2~2 p" file
To print specific line # from a file using sed:
sed '5q;d' file
Awk is often the answer (or, nowadays, Perl, Python etc. too)
If for some reason you must do it with only bash and the basic shell utilities:
cat file | \
while read line; do
i=$(( (i + 1) % 2 ))
if [[ $i -eq 0 ]]; then
echo $line // or whatever else you wanted to do with it
fi
done
And to get a specific line:
cat file | head -5 | tail -1
try this:
for example lines between 3 and 6
awk 'NR>=3 && NR<=6'`
These is a help to improve it(but not completed)
#!/bin/bash
test=`cat input.txt | awk 'NR>=3 && NR<=6'`
while read line; do
#do stuff
done <input.txt

Bash Script to save grep -c results

I am new to programming altogether and am trying to write my first bash script.
I have a file called NUMBERS.txt that has various numbers in it, as such:
1000
1001
1001
1000
1002
1001
etc..
I would like to write a script to count the occurrence of each number, save it as a variable and print it into a new text file as such:
1001= 3
1000= 2
etc..
I am completely stuck.
Here's what I have so far:
#!/bin/bash
for Count in `grep -c '1000' /NUMBERS.txt `
do
echo 'Count = '${Count}
done
for Count in `grep -c '1001' /NUMBERS.txt `
do
echo 'Count = '${Count}
done
Sort the file then count how many times each unique line occurs:
sort NUMBERS.txt | uniq -c
Now your file is already have one number on each line, it is simpler
for i in `sort -u NUMBERS.txt ` ; do count=`grep -c "$i" NUMBERS.txt ` ; echo "$i=$count" ; done > your_result.txt
or in a different format
for i in `sort -u NUMBERS.txt `
do
count=`grep -c "$i" NUMBERS.txt `
echo "$i=$count"
done > your_result.txt
As asked by , the performance is not very good. here is a much better one
sort NUMBERS.txt | uniq -c | awk '{print $1,"=",$2}'
Basically you go through NUNMBERS.txt twice. The first pass, you get the unique numbers;
The second pass you count the occurrence of each unique number.
I'm not the best at shell script, but here is a solution that works, using bash and grep -c :
#!/bin/bash
INPUT="./numbers.txt"
OUTPUT="./result.txt"
rm -f ${OUTPUT}
# you might want to change the values
for i in {1000..2000}; do
for Count in `grep -c ${i} ${INPUT}`; do
echo "${i} = ${Count}" >> ${OUTPUT}
done
done

AWK: execute CURL on each line and parse result

given an input stream with following lines:
123
456
789
098
...
I would like to call
curl -s http://foo.bar/some.php?id=xxx
with xxx being the number for each line, and everytime let an awk script fetch some information from the curl output which is written to the output stream. I am wondering if this is possible without using the awk "system()" call in following way:
cat lines | grep "^[0-9]*$" | awk '
{
system("curl -s " $0 \
" | awk \'{ #parsing; print }\'")
}'
You can use bash and avoid awk system call:
grep "^[0-9]*$" lines | while read line; do
curl -s "http://foo.bar/some.php?id=$line" | awk 'do your parsing ...'
done
A shell loop would achieve a similar result, as follows:
#!/bin/bash
for f in $(cat lines|grep "^[0-9]*$"); do
curl -s "http://foo.bar/some.php?id=$f" | awk '{....}'
done
Alternative methods for doing similar tasks include using Perl or Python with an HTTP client.
If your file gets dynamically appended the id's, you can daemonize a small while loop to keep checking for more data in the file, like this:
while IFS= read -d $'\n' -r a || sleep 1; do [[ -n "$a" ]] && curl -s "http://foo.bar/some.php?id=${a}"; done < lines.txt
Otherwise if it's static, you can change the sleep 1 to break and it will read the file and quit when there is no data left, pretty useful to know how to do.

Resources