how to use awk and a conditional pipe to submit qsub jobs? - bash

I have a file (fasta) that I am using awk to extract the needed fields from (sequences with their headers). I then pipe it to a BLAST program and finally I pipe it to qsub in order to submit a job.
the file:
and the command (which works):
awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | qsun -q S
what I would like to do is a add a condition that will sample the number of jobs I am running (using qstat) if it is below a certain threshold the job will be submitted.
for example:
allowed_jobs=200 #for example
awk < fasta.fasta '/^>/ { print $0 } $0 !~ /^>/' | echo "/Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml" | cmd=$(qstat -u User | grep -c ".") | if [ $cmd -lt $allowed_jobs ]; then qsub -q S
unfortunately (for me anyway) I have failed in all my attempts to do that.
I'd be grateful for any help
EDIT: elaborating a bit:
what I am trying to do is to extract from the fasta file this:
or basically: >HEADER\nSEQUENCE
one by one and pipe it to the blast program which can take stdin. I want to create a unique job for each sequence and this is the reason I want to pipe to qsub for each sequence.
to put it plainly the qsub submission would have looked something like this:
qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -query FASTA_SEQUENCE -outfmt 5 >> /User/blastresult.xml
note that the -query flag is unnecessary if stdin sequence is piped to it.
however, the main problem for me is how to incorporate the condition I mentioned above so that the sequence will be piped to qsub only if the qstat result is below a threshold. ideally if the qstat result is above the threshold it'll sleep until i goes below and then pass it forward.

Hello I guess this is answered since long now.
I'll just provide a way to solve this, by counting the lines that should be processed (sequences) before passing it over to awk, the awk piece would go where echo time to work is.
ct=`grep -c '^>' fasta.fasta`
if [ $ct -lt 201 ] ; then
echo time to work
echo too much

This bit of shell reads two lines, prints them to stdout and pipes into your qsub command
while IFS= read -r header; do
IFS= read -r sequence
printf "%s\n" "$header" "$sequence" |
qsub -q S /Local/ncbi-blast-2.2.25+/bin/blastx -db blastdb.fa -outfmt 5 >> /User/blastresult.xml
done < fasta.fasta


In a bash pipe, take the output of the previous command as a variable to the next command (Eg. if statement)

I wanted to write a command to compare the hash of a file. I wrote the below single line command. Wanted to understand as to how I can take the output of the previous command as a variable for the current command, in a pipe.
Eg. below command I wanted to compare the output of 1st command "Calculated hash" to the original hash. In the last command, I wanted to refer to the output of the previous command. How do I do that in the if statement? (Instead of $0)
sha256sum abc.txt | awk '{print $1}' | if [ "$0" = "8237491082roieuwr0r9812734iur" ]; then
echo "match"
Following your narrow request looks like:
sha256sum abc.txt |
awk '{print $1}' |
if [ "$(cat)" = "8237491082roieuwr0r9812734iur" ]; then echo "match"; fi cat with no arguments reads the command's stdin, and in a pipeline, content generated from prior stages are streamed into their successors.
sha256sum abc.txt |
awk '{print $1}' |
if read -r line && [ "$line" = "8237491082roieuwr0r9812734iur" ]; then echo "match"; fi
...wherein we read only a single line from stdin instead of using cat. (To instead loop over all lines given on stdin, see BashFAQ #1).
However, I would strongly suggest writing this instead as:
if [ "$(sha256sum abc.txt | awk '{print $1}')" = "8237491082roieuwr0r9812734iur" ]; then
echo "match"
...which, among other things, keeps your logic outside the pipeline, so your if statement can set variables that remain set after the pipeline exits. See BashFAQ #24 for more details on the problems inherent in running code in pipelines.
Consider using sha256sum's check mode. If you save the output of sha256sum to a file, you can check it with sha256sum -c.
$ echo foo > file
$ sha256sum file > hash.txt
$ cat hash.txt
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c file
$ sha256sum -c hash.txt
file: OK
$ if sha256sum -c --quiet hash.txt; then echo "match"; fi
If you don't want to save the hashes to a file you could pass them in via a here-string:
if sha256sum -c --quiet <<< 'b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c file'; then
echo "match"

Terminate tail command after timeout

I'm capturing stdout (log) in a file using tail -f file_name to save a specific string with grep and sed (to exit the tail) :
tail -f log.txt | sed /'INFO'/q | grep 'INFO' > info_file.txt
This works fine, but I want to terminate the command in case it does not find the pattern (INFO) in the log file after some time
I want something like this (which does not work) to exit the script after a timeout (60sec):
tail -f log.txt | sed /'INFO'/q | grep 'INFO' | read -t 60
Any suggestions?
This seems to work for me...
read -t 60 < <(tail -f log.txt | sed /'INFO'/q | grep 'INFO')
Since you only want to capture one line:
IFS= read -r -t 60 line < <(tail -f log.txt | awk '/INFO/ { print; exit; }')
printf '%s\n' "$line" >info_file.txt
For a more general case, where you want to capture more than one line, the following uses no external commands other than tail:
#!/usr/bin/env bash
end_time=$(( SECONDS + 60 ))
while (( SECONDS < end_time )); do
IFS= read -t 1 -r line && [[ $line = *INFO* ]] && printf '%s\n' "$line"
done < <(tail -f log.txt)
A few notes:
SECONDS is a built-in variable in bash which, when read, will retrieve the time in seconds since the shell was started. (It loses this behavior after being the target of any assignment -- avoiding such mishaps is part of why POSIX variable-naming conventions reserving names with lowercase characters for application use are valuable).
(( )) creates an arithmetic context; all content within is treated as integer math.
<( ) is a process substitution; it evaluates to the name of a file-like object (named pipe, /dev/fd reference, or similar) which, when read from, will contain output from the command contained therein. See BashFAQ #24 for a discussion of why this is more suitable than piping to read.
The timeout command, (part of the Debian/Ubuntu "coreutils" package), seems suitable:
timeout 1m tail -f log.txt | grep 'INFO'

How to read a file for special lines in bash script

I just want to read even line number from a file in bash shell, how to do it?
Also I just want to read the fifth line of a file, then how do it?
awk 'NR % 2 == 1' <filename>
For the second one:
awk 'NR == 5' <filename>
You can also use sed to get numbers in a specified range:
sed -ne '5,5p' <filename>
You could use the tail command. Put it in a for loop for the first case and the second is totally trivial if you get the first.
Or maybe you could even use awk:
awk NR==5 file_name
To read even number files using gnu-sed:
sed -n "2~2 p" file
To print specific line # from a file using sed:
sed '5q;d' file
Awk is often the answer (or, nowadays, Perl, Python etc. too)
If for some reason you must do it with only bash and the basic shell utilities:
cat file | \
while read line; do
i=$(( (i + 1) % 2 ))
if [[ $i -eq 0 ]]; then
echo $line // or whatever else you wanted to do with it
And to get a specific line:
cat file | head -5 | tail -1
try this:
for example lines between 3 and 6
awk 'NR>=3 && NR<=6'`
These is a help to improve it(but not completed)
test=`cat input.txt | awk 'NR>=3 && NR<=6'`
while read line; do
#do stuff
done <input.txt

Bash Script to save grep -c results

I am new to programming altogether and am trying to write my first bash script.
I have a file called NUMBERS.txt that has various numbers in it, as such:
I would like to write a script to count the occurrence of each number, save it as a variable and print it into a new text file as such:
1001= 3
1000= 2
I am completely stuck.
Here's what I have so far:
for Count in `grep -c '1000' /NUMBERS.txt `
echo 'Count = '${Count}
for Count in `grep -c '1001' /NUMBERS.txt `
echo 'Count = '${Count}
Sort the file then count how many times each unique line occurs:
sort NUMBERS.txt | uniq -c
Now your file is already have one number on each line, it is simpler
for i in `sort -u NUMBERS.txt ` ; do count=`grep -c "$i" NUMBERS.txt ` ; echo "$i=$count" ; done > your_result.txt
or in a different format
for i in `sort -u NUMBERS.txt `
count=`grep -c "$i" NUMBERS.txt `
echo "$i=$count"
done > your_result.txt
As asked by , the performance is not very good. here is a much better one
sort NUMBERS.txt | uniq -c | awk '{print $1,"=",$2}'
Basically you go through NUNMBERS.txt twice. The first pass, you get the unique numbers;
The second pass you count the occurrence of each unique number.
I'm not the best at shell script, but here is a solution that works, using bash and grep -c :
rm -f ${OUTPUT}
# you might want to change the values
for i in {1000..2000}; do
for Count in `grep -c ${i} ${INPUT}`; do
echo "${i} = ${Count}" >> ${OUTPUT}

AWK: execute CURL on each line and parse result

given an input stream with following lines:
I would like to call
curl -s
with xxx being the number for each line, and everytime let an awk script fetch some information from the curl output which is written to the output stream. I am wondering if this is possible without using the awk "system()" call in following way:
cat lines | grep "^[0-9]*$" | awk '
system("curl -s " $0 \
" | awk \'{ #parsing; print }\'")
You can use bash and avoid awk system call:
grep "^[0-9]*$" lines | while read line; do
curl -s "$line" | awk 'do your parsing ...'
A shell loop would achieve a similar result, as follows:
for f in $(cat lines|grep "^[0-9]*$"); do
curl -s "$f" | awk '{....}'
Alternative methods for doing similar tasks include using Perl or Python with an HTTP client.
If your file gets dynamically appended the id's, you can daemonize a small while loop to keep checking for more data in the file, like this:
while IFS= read -d $'\n' -r a || sleep 1; do [[ -n "$a" ]] && curl -s "${a}"; done < lines.txt
Otherwise if it's static, you can change the sleep 1 to break and it will read the file and quit when there is no data left, pretty useful to know how to do.
