Print columns in for loop in awk wrt variable value - shell

I am trying to count number of occurrences of a word "is" using awk through sample program below:
awk '
BEGIN { count = 0; word="is"; out=$ }
/word/ {
for (i=1; i<=NR; i++) {
if ($(i) == word) count++;
}
}
END {print "Found word " word count " no of times"}
' data.txt
But here the problem is $(i) is not being interpreted as column number.
Can you please suggest what should be written in place of $(i) to reference the column number (dynamic) as per value of i in that line?

i is the field or column number (1, 2, 3, ...) and $i is the value in that field:
$ echo This is it|awk '{for(i=1; i<=NF; i++) print i" "$i}'
1 This
2 is
3 it
So your program:
$ cat test.awk
BEGIN {
count = 0
word="is"
}
{
for (i=1; i<=NF; i++)
if ($(i) == word)
count++
}
END {
print "Found word " word" "count " no of times"
}
$ echo This is it|awk -f test.awk
Found word is 1 no of times

Related

bash - select columns based on values

I am new to bash and have the below requirement:
I have a file as below:
col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2
If you notice the values of these columns can be of 3 types only: s1,s2,s3
I can extract the last 2rows from the given file which gives me:
col1,col2,col3....col25
s3,s1,s2..........s2
I want to further parse the above lines so that I get only the columns with say value s1.
Desired output:
say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:
col3,col25
Can someone please help?
P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed?
Checked URLs:
awk one liner select only rows based on value of a column
Assumptions:
there are 2 input lines
each input line has the same number of comma-separated items
We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.
$ cat col.awk
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
/col1/ : for the line that contains col1, store the fields in array arr_c
n=NF : grab our max array index value (NF=number of fields)
! /col1/ : for line that does not contain col1, store the fields in array arr_s
END ... : executed once the arrays have been loaded
sep="" : set our initial output separator to a null string
for (...) : loop through our array indexes (1 to n)
if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
sep=", " : set our separator for the next match in the loop
We use printf because without specifying '\n' (a new line), all output goes to one line.
Example:
$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out
col2, col4
-F, : define the input field separator as a comma
here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)
If you want to do the whole thing at the command line:
$ awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
' smatch=s1 col.out
col2, col4
Or collapsing the END block to a single line:
awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4
I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :
#<yourTwoLines> |
tac |
awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'
It works in the following way :
reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).
when handling the first line (now values) with awk, store in an array which ones are s1
when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.
solution in awk that prints a resulting row after parsing each set of 2 rows.
$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
for (i=1; i<NF; i++) {
if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
};
p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }
Rationale: build up your resultstring str when you're looping through the value-row.
whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
if p = 0 build up column array
if p = 1 print resultstring str
With input:
$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
The result is:
$ awk -f tst.awk input.txt
col2, col3
col3
col3
Notice the empty 3rd line: no s2's for that one.
Let's say you have this:
cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2
Then you can use this awk:
awk -F, -v val='s2' '{
s="";
for (i=1; i<=NF; i++)
if (NR==1)
hdr[i]=$i
else if ($i==val)
s=s hdr[i] FS;
if (s) {
sub(/,$/, "", s);
print s
}
}' file
col3,col25
If order of the columns returned is not a concern
awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]","; gsub(/,$/,"", m); print m }'

Got stuck with multiple value validation against in particular columns in awk?

I have a text file where i'm trying to validate with particular column(5) if that column contains value like ACT,LFP,TST and EPO then file goes to further process else it should be exit.Here i'm if my text file contains these value in column number 5 means ACT,LFP,TST and EPO go for further process on other hand if column contains apart from that four value then script will terminate.
Code
cat test.txt \
| awk -F '~' -v ERR="/a/x/ERROR" -v NAME="/a/x/z/" -v WRKD="/a/x/b/" -v DATE="23_09_16" -v PD="234" -v FILE_NAME="FILENAME" \
'{ if ($5 != "ACT" || $5 != "LFP" || $5 != "EPO" || $5 != "TST")
system("mv "NAME" "ERR);
system("rm -f"" "WRKD);
print DATE" " PD " " "[" FILE_NAME "]" " ERROR: Panel status contains invalid value due to this file move to error folder";
print DATE" " PD " " "[" FILE_NAME "]" " INFO: Script is exited";
system("exit");
}' >>log.txt
Txt file: test.txt(Note:- File should be processed successfully)
161518~CHEM~ACT~IRPMR~ACT~UD
010282~CHEM~ACT~IRPMR~ACT~UD
162794~CHEM~ACT~IRPMR~LFP~UD
030767~CHEM~ACT~IRPMR~LFP~UD
Txt file: test1.txt(Note:- File should not be processed successfully.This file contains one invalid value)
161518~CHEM~ACT~IRPMR~**ACT1**~UD
010282~CHEM~ACT~IRPMR~ACT~UD
162794~CHEM~ACT~IRPMR~TST~UD
030767~CHEM~ACT~IRPMR~LFP~UD
awk to the rescue!
Lets assume the following input file:
010282~CHEM~ACT~IRPMR~ACT~UD
121212~CHEM~ACT~IRPMR~ZZZ~UD
162794~CHEM~ACT~IRPMR~TST~UD
020202~CHEM~ACT~IRPMR~YYY~UD
030767~CHEM~ACT~IRPMR~LFP~UD
987654~CHEM~ACT~IRPMR~EPO~UD
010101~CHEM~ACT~IRPMR~XXX~UD
123456~CHEM~ACT~IRPMR~TST~UD
1) This example illustrates how to check for invalid lines/records in the input file:
#!/bin/awk
BEGIN {
FS = "~"
s = "ACT,LFP,TST,EPO"
n = split( s, a, "," )
}
{
for( i = 1; i <= n; i++ )
if( a[i] == $5 )
next
print "Unexpected value # line " NR " [" $5 "]"
}
# eof #
Testing:
$ awk -f script.awk -- input.txt
Unexpected value # line 2 [ZZZ]
Unexpected value # line 4 [YYY]
Unexpected value # line 7 [XXX]
2) This example illustrates how to filter out (remove) invalid lines/records from the input file:
#!/bin/awk
BEGIN {
FS = "~"
s = "ACT,LFP,TST,EPO"
n = split( s, a, "," )
}
{
for( i = 1; i <= n; i++ )
{
if( a[i] == $5 )
{
print $0
next
}
}
}
# eof #
Testing:
$ awk -f script.awk -- input.txt
010282~CHEM~ACT~IRPMR~ACT~UD
162794~CHEM~ACT~IRPMR~TST~UD
030767~CHEM~ACT~IRPMR~LFP~UD
987654~CHEM~ACT~IRPMR~EPO~UD
123456~CHEM~ACT~IRPMR~TST~UD
3) This example illustrates how to display the invalid lines/records from the input file:
#!/bin/awk
BEGIN {
FS = "~"
s = "ACT,LFP,TST,EPO"
n = split( s, a, "," )
}
{
for( i = 1; i <= n; i++ )
if( a[i] == $5 )
next
print $0
}
# eof #
Testing:
$ awk -f script.awk -- input.txt
121212~CHEM~ACT~IRPMR~ZZZ~UD
020202~CHEM~ACT~IRPMR~YYY~UD
010101~CHEM~ACT~IRPMR~XXX~UD
Hope it Helps!
Without getting into the calls to system, this will show you an answer.
awk -F"~" '{ if (! ($5 == "ACT" || $5 == "LFP" || $5 == "EPO" || $5 == "TST")) print $0}' data.txt
output
161518~CHEM~ACT~IRPMR~**ACT1**~UD
This version is testing if $5 matches at least one item in the list. If it doesn't (the ! at the front of the || chain tests), then it prints the record as an error.
Of course, $5 will match only one from that list at a time, but that is all you need.
By contrast, when you say
if ($5 != "ACT" || $5 != "LFP" ...)
You're creating a logic test that can never be true. If $5 does not equal "ACT" because it is "LFP", you have already had the chained condition fail, and the remaining || will not be checked.
IHTH

Finding sum based on multiple columns from a file and display the highest value and the corresponding row using awk

I have a file with 5 columns in the below format :
$cat test.txt
id;section;name;val1;val2
11;10;John;50;15
12;20;Sam;40;20
13;30;Jeny;30;30
14;10;Ted;60;10
15;10;Mary;30;5
16;20;Tim;15;15
17;30;Pen;20;100
I want to process the data in the file based on the section_number(column 2) passed . And I want to display the id,Name,Total(column4+column5) for the section_id passed . At the end i want to print the row information that has the highest total .
I have already made a awk command like below :
section=10 ; awk -F";" -v var="$section" 'BEGIN { print "id Name Total" } { if ($2 == var) { sum = $4 + $5 ;print $1 " "$3 " " sum ;if (sum>newsum) {newsum=sum;name=$3;id=$1}}} END { print "Max sum for section "var" is "newsum " for Name: " name " and ID: " id }' test.txt;
And it is displaying the data as below :
id Name Total
11 John 65
14 Ted 70
15 Mary 35
Max sum for section 10 is 70 for Name: Ted and ID: 14
But how to handle the scenario if there are multiple records with the same highest value as Total ?
It all depends on how you would like to handle it i guess? You could say the first gets precedens >, the last >= or both by using arrays.
Assuming you want to show all having the same shared highest sum:
% cat script.awk
BEGIN {
FS=";";
print "id Name Total";
}
$2 != var {next} # If line doesn't match skip blocks
{
sum = $4 + $5;
print $1 " " $3 " " sum;
}
sum > max { # If sum > max we need to reset the arrays (names and ids)
max = sum; # because we get a new winner
delete names;
delete ids;
l = 0;
}
sum >= max { # If sum is same or higher than max we will need to add this
l++; # to the list of winners.
names[l] = $3;
ids[l] = $1;
}
END {
printf "Max sum for section %s is %d for\n", var, max;
# Iterate though all "winners" and print them
for ( i = 1; i <= l; i++ ) {
printf "Name: %s, ID: %s\n", names[i], ids[i];
}
}
Hope this gives you an idea of how to use arrays.
And running:
section=10;
awk -F";" -v var="$section" -f script.awk test.txt
# ^ Instead of having awk on command line use script.awk

Finding the minimum and maximum length of columns in a CSV file using shell script

I have several CSV files with multiple columns, and I want to get the max length, min length of individual columns and diff (max -min) for each column in the same CSV file. Example:
File:
abc 1234 4
bcd 23644 534
c 3232 6
Expected output:
abc 1234 4
bcd 23644 534
c 3232 6
Max Length 3 5 3
Min Length 1 4 1
Diff 2 1 2
The following script for computing the MAX column length is producing the expected output:
awk -F, '
{ for (i=1;i<=NF;i++)l[i]=((x=length($i))>l[i]?x:l[i])}
END {for(i=1;i<=NF;i++) print "Column"i":",l[i]} '
but there is problem with MIN Length script:
awk -F"," 'BEGIN {
for (i=1;i<=NF;i++) {
cur = length($i)
if ( (min == 0) || (cur < min) ) {
minlength = i
min = cur
}
} ;
for (i=1;i<=NF;i++) print $minlength}'
Any help would be greatly appreciated.
You just need to set the starting values for the min and max arrays based on the first line of the file:
awk '
NR==1 {for (i=1; i<=NF; i++) maxlen[i] = minlen[i] = length($i)}
{
for (i=1; i<=NF; i++) {
len = length($i)
if (len > maxlen[i]) maxlen[i] = len
if (len < minlen[i]) minlen[i] = len
}
}
END {
printf "Max Length"
for (i=1; i<=NF; i++) printf " %d", maxlen[i]
print ""
printf "Min Length"
for (i=1; i<=NF; i++) printf " %d", minlen[i]
print ""
printf "Diff"
for (i=1; i<=NF; i++) printf " %d", maxlen[i]-minlen[i]
print ""
}
' file

Improving a bash script for csv files

I have a bunch of CSV files in a folder. All of them on the same structure. more than 2k columns. The first column is ID.
I need to do the following for each file:
For each n odd column (except the first column), do the following:
If n value is 0, for all of the rows, then delete the n column and also n-1 column
If n value is 100, for all of the rows, then delete the n column
print the indexes of the removed columns
I have the following code:
for f in *.csv; do
awk 'BEGIN { FS=OFS="," }
NR==1 {
for (i=3; i<=NF; i+=2)
a[i]
}FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} {
for (i=1; i<=NF; i++)
if (sums[i] > 0 && sums[i+1]>0 && sums[i] != 100*r)
printf "%s%s", (i>1)?OFS:"", $i;
else print "removed index: " i > "removed.index"
print ""
}' "$f" "$f" > "new_$f"
done
For some reason the ID column (first column) is been removed.
Input:
23232,0,0,5,0,1,100,3,0,33,100
21232,0,0,5,0,1,100,3,0,33,100
23132,0,0,5,0,1,100,3,0,33,100
23212,0,0,5,0,1,100,3,0,33,100
24232,0,0,5,0,1,100,3,0,33,100
27232,0,0,5,0,1,100,3,0,33,100
Current output (bad):
,1,33
,1,33
,1,33
,1,33
,1,33
,1,33
Expected output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Can anyone check what is the issue?
You need to skip first column from the logic to check for 0 in previous column:
awk 'BEGIN{FS=OFS=","; out=ARGV[1] ".removed.index"}
FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} FNR==1 {
for (i=3; i<=NF; i++) {
if (sums[i] == 0) {
if (i-1 in sums) {
delete sums[i-1];
print "removed index: " (i-1) > out
}
delete sums[i];
print "removed index: " i > out
} else if (sums[i] == 100*r) {
delete sums[i];
print "removed index: " i > out
}
}
} {
printf "%s", $1
for (i=2; i<=NF; i++)
if (i in sums)
printf "%s%s", OFS, $i;
printf "%s", ORS
} END{close(out)}' file file
Output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Also removed indices is:
cat file.removed.index
cat removed.index
removed index: 2
removed index: 3
removed index: 4
removed index: 5
removed index: 7
removed index: 8
removed index: 9
removed index: 11

Resources