Find which row having less columns using awk - shell

I have a file where there are 4 fields expected for each row. If there are less number of fields then I want to write that information in a logfile with the row number.
Filed1line1| Filed2line1| Filed3line1| Filed4line1
Filed1line2| Filed2line2|
Filed1line3| Filed2line3| Filed3line3| Filed4line3
Something like - Row number 2 is having 3 fields for file a.txt
Can we achieve this using awk.
Actually I am using the below code snippet. If the number of fields is <> 4 then I am writing it in a bad file. that is working good. But I am unable to write NR value in log.
awk -F'|' -v DGFNM="$IN_DIR$DGFNAME" -v DBFNM="$IN_DIR$DBFNAME" '
$1 == "DTL" {
if (NF == 4) {
print substr($0, 5) > DGFNM
} else {
print > DBFNM
print NR >> $logfile
}
}
' "$IN_DIR$IN_FILE"

Easy: NF is the number of fields in the record and NR is the record number.
Something like: awk '{ if (NF < 4) { print "Row " NR " has " NF " fields"; } }' - there are shorter ways, but I prefer longer code that is easier to read ;-)
See this question for some info on printing to different output files: is it possible to print different lines to different output files using awk
To answer your edited question: $logfile is inside the single quotes, so it is not expanded to your shell variable logfile. And it is not an "awk" variable. try print NR >> "some_file"; in the awk, and then rename some_file to $logfile later.
Another option would be to generate the awk file with the expanded $logfile already in place instead of trying to do it inline.

Related

How to add an if statement before calculation in AWK

I have a series of files that I am looping through and calculating the mean on a column within each file after performing a serious of filters. Each filter is piped in to the next, BEFORE calculating the mean on the final output. All of this is done within a sub shell to assign it to a variable for later use.
for example:
variable=$(filter1 | filter 2 | filter 3 | calculate mean)
to calculate the mean I use the following code
... | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
So, my problem is that depending on the file, the number of rows after the final filter is reduced to 0, i.e. the pipe passes nothing to AWK and I end up with awk: fatal: division by zero attempted printed to screen, and the variable then remains empty. I later print the variable to file and in this case I end up with BLANK in a text file. Instead what I am attempting to do is state that if NR==0 then assign 0 to the variable so that my final output in the text file is 0.
To do this I have tried to add an if statement at the start of my awk command
... | awk '{if (NR==0) print 0}BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
but this doesn't change the output/ error and I am left with BLANKs
I did move the begin statement but this caused other errors (syntax and output errors)
Expected results:
given that column from a file has 5 lines and looks thus, I would filter on apple and pipe into the calculation
apple 10
apple 10
apple 10
apple 10
apple 10
code:
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /apple/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
then I would expect the variable to be set to 10 (10*5/5 = 10)
In the following scenario where I filter on banana
vairable=$(awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}')
given that the pipe passes nothing to AWK I would want the variable to be 0
is it just easier to accept the blank space and change it later when printed to file - i.e. replace BLANK with 0?
The default value of a variable which you treat as a number in AWK is 0, so you don't need BEGIN {s=0}.
You should put the condition in the END block. NR is not the number of all rows, but the index of the current row. So it will only give the number of rows there were at the end.
awk '{s += $5} END { if (NR == 0) { print 0 } else { print s/NR } }'
Or, using a ternary:
awk '{s += $5} END { print (NR == 0) ? 0 : s/NR }'
Also, a side note about your BEGIN{OFS='\t'} ($1 ~ /banana/) { print $0 } examples: most of that code is unnecessary. You can just pass the condition:
awk -F'\t' '$1 ~ /banana/'`
When an awk program is only a condition, it uses that as a condition for whether or not to print a line. So you can use conditions as a quick way to filter through the text.
The correct way to write:
awk -F"\t" '{OFS="\t"; if($1 ~ /banana/) print $0}' file.in | awk 'BEGIN{s=0;}{s=s+$5;}END{print s/NR;}'
is (assuming a regexp comparison for $1 really is appropriate, which it probably isn't):
awk 'BEGIN{FS=OFS="\t"} $1 ~ /banana/{ s+=$5; c++ } END{print (c ? s/c : 0)}' file.in
Is that what you're looking for?
Or are you trying to get the mean per column 1 like this:
awk 'BEGIN{FS=OFS="\t"} { s[$1]+=$5; c[$1]++ } END{ for (k in s) print k, s[k]/c[k] }' file.in
or something else?

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
The beauty of awk is that you can do this in one awk line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
This is functionally the same as #JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: #kvantour).
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file

Replace special characters in variable in awk shell command

I am currently executing the following command:
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; print H > "/Directory/FILE_"$3"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$3"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
This takes the value from the third position in the CSV file and creates a CSV for each distinct $3 value. Works as desired.
The input file looks as follows:
Name, Amount, ID
"ABC", "100.00", "0000001"
"DEF", "50.00", "0000001"
"GHI", "25.00", "0000002"
Unfortunately I have no control over the value in the source (CSV) sheet, the $3 value, but I would like to eliminate special (non-alphanumeric) characters from it. I tried the following to accomplish this but failed...
awk 'BEGIN { FS="," ; getline ; H=$0 } N != $3 { N=$3 ; name=${$3//[^a-zA-Z_0-9]/}; print H > "/Directory/FILE_"$name"_DOWNLOAD.csv" } { print > "/Directory/FILE_"$name"_DOWNLOAD.csv" }' /Directory/FILE_ALL_DOWNLOAD.csv
Suggestions? I'm hoping to do this in a single command but if anyone has a bash script answer that would work.
This is definitely not a job you should be using getline for, see http://awk.info/?tip/getline
It looks like you just want to reproduce the first line of your input file in every $3-named file. That'd be:
awk -F, '
NR==1 { hdr=$0; next }
$3 != prev { prev=name=$3; gsub(/[^[:alnum:]_]/,"",name); $0 = hdr "\n" $0 }
{ print > ("/Directory/FILE_" name "_DOWNLOAD.csv") }
' /Directory/FILE_ALL_DOWNLOAD.csv
Note that you must always parenthesize expressions on the right side of output redirection (>) as it's ambiguous otherwise and different awks will behave differently if you don't.
Feel free to put it all back onto one line if you prefer.
If you always expect the number to be in the last field of your CSV and you know that each field is wrapped in quotes, you could use this awk to extract the value 456 from the input you have provided in the comment:
echo " 123.", "Company Name" " 456." | awk -F'[^a-zA-Z0-9]+' 'NF { print $(NF-1) }'
This defines the field separator as any number of non-alphanumeric characters and retrieves the second-last field.
If this is sufficient to reliably retrieve the value, you could construct your filename like this:
file = "/Directory/FILE_" $(NF-1) "_DOWNLOAD.csv"
and output to it as you're already doing.
bash variable expansions do not occur in single quotes.
They also cannot be performed on awk variables.
That being said you don't need that to work.
awk has string manipulation functions that can perform the same tasks. In this instance you likely want the gsub function.
Would this not work for what you asked ?
awk -F, 'a=NR==1{x=$0;next}
!a{gsub(/[^[:alnum:]]/,"",$3);print x"\n"$0 >> "/Directory/FILE_"$3"_DOWNLOAD.csv"}' file

Get next field/column width awk

I have a dataset of the following structure:
1234 4334 8677 3753 3453 4554
4564 4834 3244 3656 2644 0474
...
I would like to:
1) search for a specific value, eg 4834
2) return the following field (3244)
I'm quite new to awk, but realize it is a simple operation. I have created a bash-script that asks the user for the input, and attempts to return the following field.
But I can't seem to get around scoping in AWK. How do I parse the input value to awk?
#!/bin/bash
read input
cat data.txt | awk '
for (i=1;i<=NF;i++) {
if ($i==input) {
print $(i+1)
}
}
}'
Cheers and thanks in advance!
UPDATE Sept. 8th 2011
Thanks for all the replies.
1) It will never happen that the last number of a row is picked - still I appreciate you pointing this out.
2) I have a more general problem with awk. Often I want to "do something" with the result found. In this case I would like to output it to xclip - an application which read from standard input and copies it to the clipboard. Eg:
$ echo Hi | xclip
Unfortunately, echo doesn't exist for awk, so I need to return the value and echo it. How would you go about this?
#!/bin/bash
read input
cat data.txt | awk '{
for (i=1;i<=NF;i++) {
if ($i=='$input') {
print $(i+1)
}
}
}'
Don't over think it!
You can create an array in awk with the split command:
split($0, ary)
This will split the line $0 into an array called ary. Now, you can use array syntax to find the particular fields:
awk '{
size = split($0, ary)
for (i=1; i < size ;i++) {
print ary[i]
}
print "---"
}' data.txt
Now, when you find ary[x] as the field, you can print out ary[x+1].
In your example:
awk -v input=$input '{
size = split($0, ary)
for (i=1; i<= size ;i++) {
if ($i == ary[i]) {
print ary[i+1]
}
}
}' data.txt
There is a way of doing this without creating an array, but it's simply much easier to work with arrays in situations like this.
By the way, you can eliminate the cat command by putting the file name after the awk statement and save creating an extraneous process. Everyone knows creating an extraneous process kills a kitten. Please don't kill a kitten.
You pass shell variable to awk using -v option. Its cleaner/nicer than having to put quotes.
awk -v input="$input" '
for(i=1;i<=NF;i++){
if ($i == input ){
print "Next value: " $(i+1)
}
}
' data.txt
And lose the useless cat.
Here is my solution: delete everything up to (and including) the search field, then the field you want to print out is field #1 ($1):
awk '/4834/ {sub(/^.* * 4834 /, ""); print $1}' data.txt

Resources