setting the NR to 1 does not work (awk) - bash

I have the following script in bash.
awk -F ":" '{if($1 ~ "^fall")
{ NR==1
{{printf "\t<course id=\"%s\">\n",$1} } } }' file1.txt > container.xml
So what I have a small file. If ANY line starts with fall, then I want the first field of the VERY first line.
So I did that in the code and set NR==1. However, it does not do the job!!!

Try this:
awk -F: 'NR==1 {id=$1} $1~/^fall/ {printf "\t<course id=\"%s\">\n",id}' file1.txt > container.xml
Notes:
NR==1 {id=$1}
This saves the course ID from the first line
$1~/^fall/ {printf "\t<course id=\"%s\">\n",id}
If any line begins with fall, then the course ID is printed.
The above code illustrates that awk commands can be preceded by conditions. Thus, id=$1 is executed only if we are on the first line: NR==1. If this way, it is often unnecessary to have explicit if statements.
In awk, assignment with done with = while tests for equality are done with ==.
If this doesn't do what you want, then please add sample input and corresponding desired output to the question.

awk -F: 'NR==1{x=$1}/^fail/{printf "\t<course id=\"%s\">\n",x;exit}' file
Note:
if the file has any line beginning with fail, print the 1st field in very first line in certain format (xml tag).
no matter how many lines with fail as start, it outputs the xml tag only once.
if the file has no line starts with fail, it outputs nothing.

#!awk -f
BEGIN {
FS = ":"
}
NR==1 {
foo = $1
}
/^fall/ {
printf "\t<course id=\"%s\">\n", foo
}
Also note
BUGS
The -F option is not necessary given the command line variable assignment
feature; it remains only for backwards compatibility.
awk man page

Related

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

bash grep for string and ignore above one line

One of my script will return output as below,
NameComponent=Apache
Fixed=False
NameComponent=MySQL
Fixed=True
So in the above output, I am trying to ignore the below output using grep grep -vB1 'False' which seems not working,
NameComponent=Apache
Fixed=False
Is it possible to perform this using grep or is any better way with awk..
<some-command> |tac |sed -e '/False/ { N; d}' |tac
NameComponent=MySQL
Fixed=True
For every line that matches "False", the code in the {} gets executed. N takes the next line into the pattern space as well, and then d deletes the whole thing before moving on to the next line. Note: using multiple pipes is not considered as good practice.
#Karthi1234: If your Input_file is same as provided samples then try:
awk -F' |=' '($2 != "Apache" && $2 != "False")' Input_file
First making field separator as a space or = then checking here if field 2nd's value is not equal to sting Apache and False and mentioned no action to be performed so default print action will be done by awk.
EDIT: as per OP's request following is the code changed one, try:
awk '!/Apache/ && !/False/' Input_file
You could change strings too in case if these are not the ones which you want, logic should be same.
EDIT2: eg--> You could change values of string1 and string2 and increase the conditions if needed as per your requirement.
awk '!/string1/ && !/string2/' Input_file
If I understand the question correctly you will always have a line before "Fixed=..." and you want to print both lines if and only if "Fixed=True"
The following awk should do the trick:
< command > | awk 'BEGIN {prev='NA'} {if ($0=="Fixed=True") {print prev; print $0;} prev=$0;}'
Note that if the first line is "Fixed=True" it will print the string "NA" as the first line.

retaining text after delimiter in fasta headers using awk

I have what should be a simple problem, but my lack of awk knowledge is holding me back.
I would like to clean up the headers of a fasta file that is in this format:
>HWGG454_Clocus2_Locus3443_allele1
ATTCTACTACTACTCT
>GHW757_clocus37_Locus555662_allele2
CTTCCCTACGATG
>TY45_clocus23_Locus800_allele0
TTCTACTTCATCT
I would like to clean up each header (line starting with ">") to retain only the informative part, which is the second "_Locus*" with or without the allele part.
I thought awk would be the easy way to do this, but I cant quite get it to work.
If I wanted to retain just the first column of text up to the "_" delimiter for the header, and the sequences below, I run this (assuming this toy example is in the file test.fasta):
cat test.fasta | awk -F '_' '{print $1}'
>HWGG454
ATTCTACTACTACTCT
>GHW757
CTTCCCTACGATG
>TY45
TTCTACTTCATCT
But, what I want is to retain just the "Locus*" text, which is after the 3rd delimiter, but, using this code I get this:
cat test.fasta | awk -F '_' '{print $3}'
Locus3443
Locus555662
Locus800
What am I doing wrong here?
thanks.
I understand this to mean that you want to pick the Locus field from the header lines and leave the others unchanged. Then:
awk -F _ '/^>/ { print $3; next } 1' filename
is perhaps the easiest way. This works as follows:
/^>/ { # in lines that begin with >
print $3 # print the third field
next # and go to the next line.
}
1 # print other lines unchanged. Here 1 means true, and the
# default action (unchanged printing) is performed.
The thing to understand here is awk's control flow: awk code consists of conditions with associated actions, and the actions are performed if the condition evaluates to true.
/^>/ is a regex match over the whole record (line by default); it is true if the line begins with > (because ^ matches the beginning), so
/^>/ { print $3; next }
will make awk execute print $3; next in lines that begin with >. The less straightforward part is
1
which prints lines unchanged. We only get here if the first action was not executed (because of the next in it), and this 1 is to be read as a condition that is always true -- nonzero values being true in awk.
Now, if either the condition or the action in an awk statement are omitted, a default is used. The default action is printing the line unchanged, and this takes advantage of it. It would be equally possible to write
1 { print }
or
{ print }
In the latter case, the condition is omitted and the default condition "true" is used. 1 is the shortest variant of this and idiomatic because of it.
$ awk -F_ '{print (/^>/ ? $3 : $0)}' file
Locus3443
ATTCTACTACTACTCT
Locus555662
CTTCCCTACGATG
Locus800
TTCTACTTCATCT
You need a second awk match for the row below. e.g.
cat test.fasta | awk -F _ '/^>/ { print $3"_"$4 } /^[A-Z]/ {print $1}'
Output:
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT
If you don't want the _allele1 bit remove "_"$4 from the awk script.
You can just do a regex on each line:
$ awk '{ sub(/^.*_L/,"L"); print $0}' /tmp/fasta.txt
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT

how can i make awk process the BEGIN block for each file it parses?

i have an awk script that i'm running against a pair of files. i'm calling it like this:
awk -f script.awk file1 file2
script.awk looks something like this:
BEGIN {FS=":"}
{ if( NR == 1 )
{
var=$2
FS=" "
}
else print var,"|",$0
}
the first line of each file is colon-delimited. for every other line, i want it to return to the default whitespace file seperator.
this works fine for the first file, but fails because FS is not reset to : after each file, because the BEGIN block is only processed once.
tldr: is there a way to make awk process the BEGIN block once for each file i pass it?
i'm running this on cygwin bash, in case that matters.
If you're using gawk version 4 or later there's the BEGINFILE block. From the manual:
BEGINFILE and ENDFILE are additional special patterns whose bodies are executed before reading the first
record of each command line input file and after reading the last record of each file. Inside the BEGINFILE
rule, the value of ERRNO will be the empty string if the file could be opened successfully. Otherwise, there
is some problem with the file and the code should use nextfile to skip it. If that is not done, gawk produces
its usual fatal error for files that cannot be opened.
For example:
touch a b c
awk 'BEGINFILE { print "Processing: " FILENAME }' a b c
Output:
Processing: a
Processing: b
Processing: c
Edit - a more portable way
As noted by DennisWilliamson you can achieve a similar effect with FNR == 1 at the beginning of your script. In addition to this you could change FS from the command-line directly, e.g.:
awk -f script.awk FS=':' file1 FS=' ' file2
Here the FS variable will retain whatever value it had previously.
Instead of:
BEGIN {FS=":"}
use:
FNR == 1 {FS=":"}
The FNR variable should do the trick for you. It's the same as NR except it is scoped within the file, so it resets to 1 for every input file.
http://unstableme.blogspot.ca/2009/01/difference-between-awk-nr-and-fnr.html
http://www.unix.com/shell-programming-scripting/46931-awk-different-between-nr-fnr.html
When you want a POSIX complient version, the best is to do:
(FNR == 1) { FS=":"; $0=$0 }
This states that, if the File record number (FNR) equals one, we reset the field separator FS. However, you also need to reparse $0 and reset the values of all other fields and the NF built-in variable.
This is equivalent to the GNU awk 4.x BEGINFILE if and only if the record separator (RS) stays unchanged.

why this code is not working?

#! /bin/ksh
awk -F':' '{
if( match($0,":server:APBS") )
{
print x;
x=$0;
}
}' iws_config4.dat
You write your program in native awk as:
awk -F':' '/:server:APBS/ { print x; x=$0; }' iws_config4.dat
An awk program consists of patterns and the actions to take when those patterns match. What you wrote is tantamount to abusing the built-in facilities of awk.
Given that you're only interested in $0, the field separator is redundant, so the -F':' argument could go.
What your program does is:
read a line.
if it matches the pattern
print the last line that matched
save the current line
So, if your input contains one match, you see nothing output (or, more precisely, a blank line). If your input contains two matches, you see the first; if three, the first two; and so on.
Given the desire to emulate 'grep -B1 :server:APBS iws_config4.dat', you can do:
awk '/:server:APBS/ { print old }
{ old = $0 }' iws_config4.dat
If the line matches, print the old saved line.
Regardless, store the current line as the (new) old saved line.
It probably can all be flattened onto one line. It is crucial that the pattern match precede the unconditional save.
Given the script k.awk and data file iws_config4.dat shown, I get the output I expect. What do you get? What do you expect?
$ cat iws_config4.dat
One line of text followed by the marker
:server:APBS blah blah bah 1
:server:APBS blah blah bah 2
More text
Blah blah blah
$ cat k.awk
awk '/:server:APBS/ { print old }
{ old = $0 }' iws_config4.dat
$ sh k.awk
One line of text followed by the marker
:server:APBS blah blah bah 1
$
If blank lines could be the trouble, only save non-blank lines:
awk '/:server:APBS/ { print old }
/[^ ]/ { old = $0 }' iws_config4.dat
The second line now is only active on lines that contain at least one non-blank character, so only those lines will be saved.

Resources