Exiting an AWK statement after printing a block of text - bash

My problem is that I have a very large database (10GB) and I want to save as much time as possible searching through it. I have an awk statement that is searching through the database and depending on the pattern, writes the data into another file.
I have an input file that will be fed into my script as a Terminal argument variable. There are several lines of data within it that will be used as the pattern for the awk statement.
Within the database, all the lines that match the pattern are all sorted next to each other, so essentially, after printing, there is no need to search any further into the database cause everything has already been found. Once the awk finds the first pattern matching line, all the other pattern matching lines are sequentially after it.
This problem is hard to explain with just words, so I've created a few examples of what my files, code, and the database look and operate like.
The input file via Terminal looks like this:
group_1
group_2
group_3
...
The 10GB database looks like this:
group_1 DATA ...
group_1 DATA ...
group_1 DATA ...
group_2 DATA ...
group_2 DATA ...
group_2 DATA ...
group_2 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
...
The script code with the awk statement in question looks like this:
IFS=$'\n'
set -f
for var in $(cat < "$1")
do
awk -v seq="$var" '{if (match($1, seq)) {print $0}}' filepath/database > pattern_matched.file
done
a brief explanation of what this code is doing is that it takes in the Terminal argument variable, a filename in this case, and opens it up for the for loop to begin looping. the pattern group_1, for example, is placed in var and the search through the database begins. If the first column matches the pattern, it saves the line into the file pattern_matched.file file.
Currently, it searches through the entire 10GB worth of data and prints the data into the file as intended, but it wastes a lot of time. After printing the lines that match the pattern, I want to stop the awk from continuing the search through the database and move on to the next pattern from the input file. An example behavior for group_2 would be the awk checking the first 3 lines of the database and sees that none of the lines have the matching pattern. However, line 4 contains the pattern, so it prints the line and the subsequent pattern matching lines after it. When the awk reaches line 8, it exits the awk statement and the for loop can then iterate to the next pattern to be searched for, group_3.
awk '{print $0; exit}' filename
Something like this does not work since it only prints the first instance and breaks out, I want something that can print all the matches and as soon as it finds the next non-pattern match, it breaks out.
Thanks in advance.
UPDATE:
The current problem now is that the solution given below makes logical sense. If it enters the if-statement, It would print the line into the file and iterate to the next line. If the line did not match, it would enter the else-if statement and exit the awk. This makes a lot of sense to me, but for some reason, once the flag variable has been set to 1 by the if-statement for the first matched line, it enters the else-if statement. Since the else-if condition evaluates to true, it exits before even scanning the next line. I confirmed this behavior with print statements everywhere in the awk statement.
This is my code with print statements:
awk -v seq="$seqid" '{if(match($1, seq)) {print "matched" ; print $1 ; flag=1} else if (flag) {print "not matched" ; exit}}'
which outputs this:
weird behavior

Can't you just read in the input file (input_file) into awk:
$ cat input_file
group_1
group_3
Awk script:
$ awk 'NR==FNR{a[$0];next} $1 in a' input_file database
group_1 DATA ...
group_1 DATA ...
group_1 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...
group_3 DATA ...

Your shell code:
for var in $(cat < "$1")
do
awk 'script' filepath/database > pattern_matched.file
done
is using an anti-pattern to read the input file stored in $1, see http://mywiki.wooledge.org/BashFAQ/001, and will overwrite pattern_matched.file on every iteration of the loop. You should, I suspect, have written it as:
while IFS= read -r var
do
awk 'script' filepath/database
done < "$1" > pattern_matched.file
Your awk code:
awk -v seq="$var" '{if (match($1, seq)) {print $0}}'
is using match() unnecessarily since you just want to do a regexp comparison and aren't using the variables that match() populates to help you isolate the matching string (RSTART/RLENGTH) and it's using a defult null condition and then puting the real condition in the action space and then hard-coding the default action of printing the current record. It's equivalent to just:
awk -v seq="$var" '$1 ~ seq'
but I'm not convinced you actually need a regexp comparison - given your example you should be doing a string comparison instead:
awk -v seq="$var" '$1 == seq'
Given your posted example may be misleading you'd just choose which of these is appropriate based on whether you want a regexp or string and partial or full match on $1:
awk -v seq="$var" '$1 == seq' # full string
awk -v seq="$var" 'index($1,seq)' # partial string
awk -v seq="$var" '$1 ~ ("^"seq"$")' # full regexp
awk -v seq="$var" '$1 ~ seq' # partial regexp
Let's say we go with that first full string match match, then to exit once the matching $1 has been processed would be:
awk -v seq="$var" '$1 == seq{print; f=1; next} f{exit}'
which would make your full code:
while IFS= read -r var
do
awk -v seq="$var" '$1 == seq{print; f=1; next} f{exit}' filepath/database
done < "$1" > pattern_matched.file
BUT I doubt if you need a shell loop at all and you could just do this instead:
awk 'NR==FNR{seqs[$1]; next} $1 in seqs' "$1" filepath/database > pattern_matched.file
or some other variant that just has awk (or maybe just join) read the input files once. You can make the above exit after all seqs[] have been processed by:
awk '
NR==FNR { seqs[$1]; numSeqs++; next }
$1 in seqs { print; if ($1 !== prev) numSeqs--; prev = $1; next }
numSeqs == -1 { exit }
' "$1" filepath/database > pattern_matched.file
or similar.

I think this should do the trick:
awk -v seq="$var" '{if (match($1, seq)) {print $0; found=1} else if (found) { exit }}'
Similar to David C. Rankin answer, but no need to pass the rd=0 argument to awk since in awk any uninitialized variable is initialized to zero when its first used.

As we do not really know what you intend to do with your program, I will just give you an awk solution:
awk -v seq="$var" '($1!=seq) { if(p) exit; next }($1==seq){p=1}p'
This uses the flag p to check if it already met the sequence seq. A simple if condition determines if it should exit awk or move to the next record. Exiting is done after seq is found, moving to the next record is done before.
However, since you place this in a loop, this will read the file over and over and over again. If you want to make a subselection, you could use the solution of James Brown

Related

Prepend text to specific line numbers with variables

I have spent hours trying to solve this. There are a bunch of answers as to how to prepend to all lines or specific lines but not with a variable text and a variable number.
while [ $FirstVariable -lt $NextVariable ]; do
#sed -i "$FirstVariables/.*/$FirstVariableText/" "$PWD/Inprocess/$InprocessFile"
cat "$PWD/Inprocess/$InprocessFile" | awk 'NR==${FirstVariable}{print "$FirstVariableText"}1' > "$PWD/Inprocess/Temp$InprocessFile"
FirstVariable=$[$FirstVariable+1]
done
Essentially I am looking for a particular string delimiter and then figuring out where the next one is and appending the first result back into the following lines... Note that I already figured out the logic I am just having issues prepending the line with the variables.
Example:
This >
Line1:
1
2
3
Line2:
1
2
3
Would turn into >
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
You can do all that using below awk one liner.
Assuming your pattern starts with Line, then the below script can be used.
> awk '{if ($1 ~ /Line/ ){var=$1;print $0;}else{ if ($1 !="")print var $1}}' $PWD/Inprocess/$InprocessFile
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
Here is how the above script works:
If the first record contains word Line then it is copied into an awk variable var. From next word onwards, if the record is not empty, the newly created var is appended to that record and prints it producing the desired result.
If you need to pass the variables dynamically from shell to awk you can use -v option. Like below:
awk -v var1=$FirstVariable -v var2=$FirstVariableText 'NR==var{print var2}1' > "$PWD/Inprocess/Temp$InprocessFile"
The way you addressed the problem is by parsing everything both with bash and awk to process the file. You make use of bash to extract a line, and then use awk to manipulate this one line. The whole thing can actually be done with a single awk script:
awk '/^Line/{str=$1; print; next}{print (NF ? str $0 : "")}' inputfile > outputfile
or
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}{gsub(FS,OFS $1)}1' inputfile > outputfile

Using a value from stored in a different file awk

I have a value stored in a file named cutoff1
If I cat cutoff1 it will look like
0.34722
I want to use the value stored in cutoff1 inside an awk script. Something like following
awk '{ if ($1 >= 'cat cutoff1' print $1 }' hist1.dat >hist_oc1.dat
I think I am making some mistakes. If I do manually it will look like
awk '{ if ($1 >= 0.34722) print $1 }' hist1.dat >hist_oc1.dat
How can I use the value stored in cutoff1 file inside the above mentioned awk script?
The easiest ways to achieve this are
awk -v cutoff="$(cat cutoff1)" '($1 >= cutoff){print $1}' hist.dat
awk -v cutoff="$(< cutoff1)" '($1 >= cutoff){print $1}' hist.dat
or
awk '(NR==FNR){cutoff=$1;next}($1 >= cutoff){print $1}' cutoff1 hist.dat
or
awk '($1 >= cutoff){print $1}' cutoff="$(cat cutoff1)" hist.dat
awk '($1 >= cutoff){print $1}' cutoff="$(< cutoff1)" hist.dat
note: thanks to Glenn Jackman to point to :
man bash Command substitution: Bash performs the expansion by executing command and replacing the command substitution with the
standard output of the command, with any trailing newlines deleted.
Embedded newlines are not deleted, but they may be removed during word
splitting. The command substitution $(cat file) can be replaced by
the equivalent but faster $(< file).
since awk can read multiple files just add the filename before your data file and treat first line specially. No need for external variable declaration.
awk 'NR==1{cutoff=$1; next} $1>=cutoff{print $1}' cutoff data
PS Just noticed that it's similar to the #kvantour's second answer, but keepin it here as a different flavor.
You could use getline to read a value from another file at your convenience. First the main file to process:
$ cat > file
wait
wait
did you see that
nothing more to see here
And cutoff:
$ cat cutoff
0.34722
An wwk script that reads a line from cutoff when it meets the string see in a record:
$ awk '/see/{if((getline val < "cutoff") > 0) print val}1' file
wait
wait
0.34722
did you see that
nothing more to see here
Explained:
$ awk '
/see/ { # when string see is in the line
if((getline val < "cutoff") > 0) # read a value from cutoff if there are any available
print val # and output the value from cutoff
}1' file # output records from file
As there was only one value, it was printed only once even see was seen twice.

bash grep for string and ignore above one line

One of my script will return output as below,
NameComponent=Apache
Fixed=False
NameComponent=MySQL
Fixed=True
So in the above output, I am trying to ignore the below output using grep grep -vB1 'False' which seems not working,
NameComponent=Apache
Fixed=False
Is it possible to perform this using grep or is any better way with awk..
<some-command> |tac |sed -e '/False/ { N; d}' |tac
NameComponent=MySQL
Fixed=True
For every line that matches "False", the code in the {} gets executed. N takes the next line into the pattern space as well, and then d deletes the whole thing before moving on to the next line. Note: using multiple pipes is not considered as good practice.
#Karthi1234: If your Input_file is same as provided samples then try:
awk -F' |=' '($2 != "Apache" && $2 != "False")' Input_file
First making field separator as a space or = then checking here if field 2nd's value is not equal to sting Apache and False and mentioned no action to be performed so default print action will be done by awk.
EDIT: as per OP's request following is the code changed one, try:
awk '!/Apache/ && !/False/' Input_file
You could change strings too in case if these are not the ones which you want, logic should be same.
EDIT2: eg--> You could change values of string1 and string2 and increase the conditions if needed as per your requirement.
awk '!/string1/ && !/string2/' Input_file
If I understand the question correctly you will always have a line before "Fixed=..." and you want to print both lines if and only if "Fixed=True"
The following awk should do the trick:
< command > | awk 'BEGIN {prev='NA'} {if ($0=="Fixed=True") {print prev; print $0;} prev=$0;}'
Note that if the first line is "Fixed=True" it will print the string "NA" as the first line.

retaining text after delimiter in fasta headers using awk

I have what should be a simple problem, but my lack of awk knowledge is holding me back.
I would like to clean up the headers of a fasta file that is in this format:
>HWGG454_Clocus2_Locus3443_allele1
ATTCTACTACTACTCT
>GHW757_clocus37_Locus555662_allele2
CTTCCCTACGATG
>TY45_clocus23_Locus800_allele0
TTCTACTTCATCT
I would like to clean up each header (line starting with ">") to retain only the informative part, which is the second "_Locus*" with or without the allele part.
I thought awk would be the easy way to do this, but I cant quite get it to work.
If I wanted to retain just the first column of text up to the "_" delimiter for the header, and the sequences below, I run this (assuming this toy example is in the file test.fasta):
cat test.fasta | awk -F '_' '{print $1}'
>HWGG454
ATTCTACTACTACTCT
>GHW757
CTTCCCTACGATG
>TY45
TTCTACTTCATCT
But, what I want is to retain just the "Locus*" text, which is after the 3rd delimiter, but, using this code I get this:
cat test.fasta | awk -F '_' '{print $3}'
Locus3443
Locus555662
Locus800
What am I doing wrong here?
thanks.
I understand this to mean that you want to pick the Locus field from the header lines and leave the others unchanged. Then:
awk -F _ '/^>/ { print $3; next } 1' filename
is perhaps the easiest way. This works as follows:
/^>/ { # in lines that begin with >
print $3 # print the third field
next # and go to the next line.
}
1 # print other lines unchanged. Here 1 means true, and the
# default action (unchanged printing) is performed.
The thing to understand here is awk's control flow: awk code consists of conditions with associated actions, and the actions are performed if the condition evaluates to true.
/^>/ is a regex match over the whole record (line by default); it is true if the line begins with > (because ^ matches the beginning), so
/^>/ { print $3; next }
will make awk execute print $3; next in lines that begin with >. The less straightforward part is
1
which prints lines unchanged. We only get here if the first action was not executed (because of the next in it), and this 1 is to be read as a condition that is always true -- nonzero values being true in awk.
Now, if either the condition or the action in an awk statement are omitted, a default is used. The default action is printing the line unchanged, and this takes advantage of it. It would be equally possible to write
1 { print }
or
{ print }
In the latter case, the condition is omitted and the default condition "true" is used. 1 is the shortest variant of this and idiomatic because of it.
$ awk -F_ '{print (/^>/ ? $3 : $0)}' file
Locus3443
ATTCTACTACTACTCT
Locus555662
CTTCCCTACGATG
Locus800
TTCTACTTCATCT
You need a second awk match for the row below. e.g.
cat test.fasta | awk -F _ '/^>/ { print $3"_"$4 } /^[A-Z]/ {print $1}'
Output:
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT
If you don't want the _allele1 bit remove "_"$4 from the awk script.
You can just do a regex on each line:
$ awk '{ sub(/^.*_L/,"L"); print $0}' /tmp/fasta.txt
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT

searching multi-word patterns from one file in another using awk

patterns file:
wicked liquid
movie
guitar
balance transfer offer
drive car
bigfile file:
wickedliquidbrains
drivelicense
balanceofferings
using awk on command line:
awk '/balance/ && /offer/' bigfile
i get the result i want which is
balanceofferings
awk '/wicked/ && /liquid/' bigfile
gives me
wickedliquidbrains, which is also good..
awk '/drive/ && /car/' bigfile
does not give me drivelicense which is also good, as i am having &&
now when trying to pass shell variable, containg those '/regex1/ && /regex2/.. etc' to awk..
awk -v search="$out" '$0 ~ search' "$bigfile"
awk does not run.. what may be the problem??
Try this:
awk "$out" "$bigfile"
When you do $0 ~ search, the value of search has to be a regular expression. But you were setting it to a string containing a bunch of regexps with && between them -- that's not a valid regexp.
To perform an action on the lines that match, do:
awk "$out"' { /* do stuff */ }' "$bigfile"
I switched from double quotes to single quotes for the action in case the action uses awk variables with $.
UPDATED
An alternative to Barmars's solution with arguments passed with -v:
awk -v search="$out" 'match($0,search)' "$bigfile"
Test:
$ echo -e "one\ntwo"|awk -v luk=one 'match($0,luk)'
one
Passing two (real) regexs (EREs) to awk:
echo -e "one\ntwo\nnone"|awk -v re1=^o -v re2=e$ 'match($0,re1) && match($0,re2)'
Output:
one
If You want to read the pattern_file and do match against all the rows, You could try something like this:
awk 'NR==FNR{N=NR;re[N,0]=split($0,a);for(i in a)re[N,i]=a[i];next}
{
for(i=1;i<=N;++i) {
#for(j=1;j<=re[i,0]&&match($0,re[i,j]);++j);
for(j=1;j<=re[i,0]&&$0~re[i,j];++j);
if(j>re[i,0]){print;break}
}
}' patterns_file bigfile
Output:
wickedliquidbrains
At the 1st line it reads and stores the pattern_file in a 2D array re. Each row contains the split input string. The 0th element of each row is the length of that row.
Then it reads bigfile. Each lines of bigfile are tested for match of re array. If all items in a row are matching then that row is printed.

Resources