Awk script within shell script - shell

I wrote some awk script to be executed while looping over {a..z}.txt files. I've been staring at this code for 30 minutes, but I just can't find what's wrong. The terminal complains that there is some syntax error around >, but I don't think that's where the bug is.
Basically, what I'm trying to do is this:
Each line contains a string and a following set of numbers. I want to re-print the numbers so that the first number is the smallest one of them.
input: a 1125159 2554 290 47364290 47392510 48629708 68
60771
output:a 290 1125159 2554 47364290 47392510 48629708 68
60771
Could anyone help me find what is wrong with the below code?
for alphabet in {a..z}
do
awk -F$'\t' "NF>2{maxId=\$2;maxIndex=2;
for(i=2; i<=NF; i++){
if(maxId>\$i){maxId=\$i; maxIndex=i}
};
printf \"%s \t %s \t\",\$1, maxId;
for(i=2; i<=NF; i++){
if(i!=maxIndex)
printf \"%d \t\", \$i};
printf \"\n\";
}" $alphabet.merged > $alphabet.out
done

Here's how your script should really be written:
awk 'BEGIN { FS=OFS="\t" }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i
}
}
print ""
}' file
a 68 2554 290 47364290 47392510 48629708 1125159 60771
Don't shy away from white space and brackets as they help readability. I don't understand the purpose of the surrounding shell loop in your question though - I suspect all you really need is:
awk 'BEGIN { FS=OFS="\t" }
FNR==1 { close(out); out=FILENAME; sub(/merged/,"out",out) }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex > out
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i > out
}
}
print "" > out
}' *.merged

Related

bash - select columns based on values

I am new to bash and have the below requirement:
I have a file as below:
col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2
If you notice the values of these columns can be of 3 types only: s1,s2,s3
I can extract the last 2rows from the given file which gives me:
col1,col2,col3....col25
s3,s1,s2..........s2
I want to further parse the above lines so that I get only the columns with say value s1.
Desired output:
say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:
col3,col25
Can someone please help?
P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed?
Checked URLs:
awk one liner select only rows based on value of a column
Assumptions:
there are 2 input lines
each input line has the same number of comma-separated items
We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.
$ cat col.awk
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
/col1/ : for the line that contains col1, store the fields in array arr_c
n=NF : grab our max array index value (NF=number of fields)
! /col1/ : for line that does not contain col1, store the fields in array arr_s
END ... : executed once the arrays have been loaded
sep="" : set our initial output separator to a null string
for (...) : loop through our array indexes (1 to n)
if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
sep=", " : set our separator for the next match in the loop
We use printf because without specifying '\n' (a new line), all output goes to one line.
Example:
$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out
col2, col4
-F, : define the input field separator as a comma
here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)
If you want to do the whole thing at the command line:
$ awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END {
sep=""
for (i=1; i<=n; i++)
{ if (arr_s[i]==smatch)
{ printf "%s%s" ,sep,arr_c[i]
sep=", "
}
}
}
' smatch=s1 col.out
col2, col4
Or collapsing the END block to a single line:
awk -F, '
/col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i } }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4
I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :
#<yourTwoLines> |
tac |
awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'
It works in the following way :
reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).
when handling the first line (now values) with awk, store in an array which ones are s1
when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.
solution in awk that prints a resulting row after parsing each set of 2 rows.
$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
for (i=1; i<NF; i++) {
if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
};
p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }
Rationale: build up your resultstring str when you're looping through the value-row.
whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
if p = 0 build up column array
if p = 1 print resultstring str
With input:
$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
The result is:
$ awk -f tst.awk input.txt
col2, col3
col3
col3
Notice the empty 3rd line: no s2's for that one.
Let's say you have this:
cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2
Then you can use this awk:
awk -F, -v val='s2' '{
s="";
for (i=1; i<=NF; i++)
if (NR==1)
hdr[i]=$i
else if ($i==val)
s=s hdr[i] FS;
if (s) {
sub(/,$/, "", s);
print s
}
}' file
col3,col25
If order of the columns returned is not a concern
awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]","; gsub(/,$/,"", m); print m }'

How to reformat a text file - awk

I have a text file containing 3 columns:
broke banana 192
broke apple 175
broke avocado 20
fixed banana 117
fixed apple 89
I need the output below:
Issue,banana,apple,avocado
broke,192,175,20
fixed,117,90,0
I am new to this and have no idea how to get this result.
I appreciate any help,
Thanks
Input
$ cat file
broke banana 192
broke apple 175
fixed banana 117
fixed apple 89
I don't understand where from you got fixed,117,90 in expected o/p
Output
$ awk -v OFS=, '{
is_fr[$1,$2]=$3
if(!($1 in i_tmp))issue[++i]=$1;
if(!($2 in f_tmp))fruit[++f]=$2;
i_tmp[$1]; f_tmp[$2]
}
END{
printf ("%s","issue");
for(i=1; i in fruit; i++)
printf("%s%s",OFS,fruit[i]);
print "";
for(i=1; i in issue; i++)
{
printf("%s",issue[i]);
for(j=1; j in fruit; j++)
{
printf("%s%s",OFS,(issue[i],fruit[j]) in is_fr ? is_fr[issue[i],fruit[j]]:"")
}
print ""
}
}' file
issue,banana,apple
broke,192,175
fixed,117,89
If order doesn't matter then you may try below awk
$ awk -v OFS=, '{
issue[$1];
fruit[$2];
is_fr[$1,$2]=$3
}
END{
printf ("%s","issue");
for(i in fruit)
printf("%s%s",OFS,i);
print "";
for(i in issue)
{
printf("%s",i);
for(j in fruit)
{
printf("%s%s",OFS,(i,j) in is_fr ? is_fr[i,j]:"")
}
print ""
}
}' file
issue,apple,banana
fixed,89,117
broke,175,192
For new input edited by OP
akshay#db-3325:/tmp$ cat f
broke banana 192
broke apple 175
broke avocado 20
fixed banana 117
fixed apple 89
akshay#db-3325:/tmp$ awk -v OFS=, '{
is_fr[$1,$2]=$3
if(!($1 in i_tmp))issue[++i]=$1;
if(!($2 in f_tmp))fruit[++f]=$2;
i_tmp[$1]; f_tmp[$2]
}
END{
printf ("%s","issue");
for(i=1; i in fruit; i++)
printf("%s%s",OFS,fruit[i]);
print "";
for(i=1; i in issue; i++)
{
printf("%s",issue[i]);
for(j=1; j in fruit; j++)
{
printf("%s%s",OFS,(issue[i],fruit[j]) in is_fr ? is_fr[issue[i],fruit[j]]+0:0)
}
print ""
}
}' f
issue,banana,apple,avocado
broke,192,175,20
fixed,117,89,0
$ cat tst.awk
BEGIN { OFS="," }
{
states[$1]
fruits[$2]
count[$1,$2] += $NF
}
END {
printf "%s", "Issue"
for (fruit in fruits) {
printf "%s%s", OFS, fruit
}
print ""
for (state in states) {
printf "%s", state
for (fruit in fruits) {
printf "%s%s", OFS, count[state,fruit]+0
}
print ""
}
}
$ awk -f tst.awk file
Issue,apple,banana,avocado
fixed,89,117,0
broke,175,192,20

Improving a bash script for csv files

I have a bunch of CSV files in a folder. All of them on the same structure. more than 2k columns. The first column is ID.
I need to do the following for each file:
For each n odd column (except the first column), do the following:
If n value is 0, for all of the rows, then delete the n column and also n-1 column
If n value is 100, for all of the rows, then delete the n column
print the indexes of the removed columns
I have the following code:
for f in *.csv; do
awk 'BEGIN { FS=OFS="," }
NR==1 {
for (i=3; i<=NF; i+=2)
a[i]
}FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} {
for (i=1; i<=NF; i++)
if (sums[i] > 0 && sums[i+1]>0 && sums[i] != 100*r)
printf "%s%s", (i>1)?OFS:"", $i;
else print "removed index: " i > "removed.index"
print ""
}' "$f" "$f" > "new_$f"
done
For some reason the ID column (first column) is been removed.
Input:
23232,0,0,5,0,1,100,3,0,33,100
21232,0,0,5,0,1,100,3,0,33,100
23132,0,0,5,0,1,100,3,0,33,100
23212,0,0,5,0,1,100,3,0,33,100
24232,0,0,5,0,1,100,3,0,33,100
27232,0,0,5,0,1,100,3,0,33,100
Current output (bad):
,1,33
,1,33
,1,33
,1,33
,1,33
,1,33
Expected output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Can anyone check what is the issue?
You need to skip first column from the logic to check for 0 in previous column:
awk 'BEGIN{FS=OFS=","; out=ARGV[1] ".removed.index"}
FNR==NR {
for (i=1; i<=NF; i++)
sums[i] += $i;
++r;
next
} FNR==1 {
for (i=3; i<=NF; i++) {
if (sums[i] == 0) {
if (i-1 in sums) {
delete sums[i-1];
print "removed index: " (i-1) > out
}
delete sums[i];
print "removed index: " i > out
} else if (sums[i] == 100*r) {
delete sums[i];
print "removed index: " i > out
}
}
} {
printf "%s", $1
for (i=2; i<=NF; i++)
if (i in sums)
printf "%s%s", OFS, $i;
printf "%s", ORS
} END{close(out)}' file file
Output:
23232,1,33
21232,1,33
23132,1,33
23212,1,33
24232,1,33
27232,1,33
Also removed indices is:
cat file.removed.index
cat removed.index
removed index: 2
removed index: 3
removed index: 4
removed index: 5
removed index: 7
removed index: 8
removed index: 9
removed index: 11

Rearranging a csv file

I have a file with contents similar to the below
Boy,Football
Boy,Football
Boy,Football
Boy,Squash
Boy,Tennis
Boy,Football
Girl,Tennis
Girl,Squash
Girl,Tennis
Girl,Tennis
Boy,Football
How can I use 'awk' or similar to rearrange this to the below:
Football Tennis Squash
Boy 5 1 1
Girl 0 3 1
I'm not even sure if this is possible, but any help would be great.
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
genders[$1]
sports[$2]
count[$1,$2]++
}
END {
printf ""
for (sport in sports) {
printf "%s%s", OFS, sport
}
print ""
for (gender in genders) {
printf "%s", gender
for (sport in sports) {
printf "%s%s", OFS, count[gender,sport]+0
}
print ""
}
}
$ awk -f tst.awk file
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
In general when you know the end point of the loop you put the OFS or ORS after each field:
for (i=1; i<=n; i++) {
printf "%s%s", $i, (i<n?OFS:ORS)
}
but if you don't then you put the OFS before the second and subsequent fields and print the ORS after the loop:
for (x in array) {
printf "%s%s", (++i>1?OFS:""), array[x]
}
print ""
I do like the:
n = length(array)
for (x in array) {
printf "%s%s", array[x], (++i<n?OFS:ORS)
}
idea to get the end of the loop too, but length(array) is gawk-specific.
Another approach to consider:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
for (i=1; i<=NF; i++) {
if (!seen[i,$i]++) {
map[i,++num[i]] = $i
}
}
count[$1,$2]++
}
END {
for (i=0; i<=num[2]; i++) {
printf "%s%s", map[2,i], (i<num[2]?OFS:ORS)
}
for (i=1; i<=num[1]; i++) {
printf "%s%s", map[1,i], OFS
for (j=1; j<=num[2]; j++) {
printf "%s%s", count[map[1,i],map[2,j]]+0, (j<num[2]?OFS:ORS)
}
}
}
$ awk -f tst.awk file
Football Squash Tennis
Boy 5 1 1
Girl 0 1 3
That last will print the rows and columns in the order they were read. Not quite as obvious how it works though :-).
I would just loop normally:
awk -F, -v OFS="\t" '
{names[$1]; sport[$2]; count[$1,$2]++}
END{printf "%s", OFS;
for (i in sport)
printf "%s%s", i, OFS;
print "";
for (n in names) {
printf "%s%s", n, OFS
for (s in sport)
printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""
}
}' file
This keeps track of three arrays: names[] for the first column, sport[] for the second column and count[name,sport] to count the occurrences of every combination.
Then, it is a matter of looping through the results and printing them in a fancy way and making sure 0 is printed if the count[a,b] does not exist.
Test
$ awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) printf "%s%s", i, OFS; print ""; for (n in names) {printf "%s%s", n, OFS; for (s in sport) printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""}}' a
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
Format is a bit ugly, there are some trailing OFS.
To get rid of trailing OFS:
awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) {cn++; printf "%s%s", i, (cn<length(sport)?OFS:ORS)} for (n in names) {cs=0; printf "%s%s", n, OFS; for (s in sport) {cs++; printf "%s%s", count[n,s]?count[n,s]:0, (cs<length(sport)?OFS:ORS)}}}' a
You can always pipe to column -t for a nice output.

Transpose CSV data with awk (pivot transformation)

my CSV data looks like this:
Indicator;Country;Value
no_of_people;USA;500
no_of_people;Germany;300
no_of_people;France;200
area_in_km;USA;18
area_in_km;Germany;16
area_in_km;France;17
proportion_males;USA;5.3
proportion_males;Germany;7.9
proportion_males;France;2.4
I want my data to look like this:
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4
There are more Indicators and more countries than listed here.
Pretty large files (number of rows something with 5 digits).
Looked around for some transpose threads, but nothing matched my situation (also I'm quite new to awk, so I couldn't change the code I found to fit my data).
Thanks for your help.
Regards
Ad
If the number of Ind fields is fixed, you can do:
awk 'BEGIN{FS=OFS=";"}
{a[$2,$1]=$3; count[$2]}
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as semicolon.
{a[$2,$1]=$3; count[$2]} get list of countries in count[] array and values of each Ind on a["country","Ind"] array.
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]} print the summary of the values.
Output
$ awk 'BEGIN{FS=OFS=";"} {a[$2,$1]=$3; count[$2]} END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
France;200;17;2.4
Germany;300;16;7.9
USA;500;18;5.3
Update
unfortunately, the number of Indicators is not fixed. Also, they are
not named like "Ind1", "Ind2" etc. but are just strings.' I clarified
my question.
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file
proportion_males no_of_people area_in_km
France 2.4 200 17
Germany 7.9 300 16
USA 5.3 500 18
To have ; separated, do replace each space with ;:
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file | tr ' ' ';'
proportion_males;no_of_people;area_in_km;
France;2.4;200;17;
Germany;7.9;300;16;
USA;5.3;500;18;
Using awk and maintaining the order of output:
awk -F\; '
NR>1 {
if(!($1 in indicators)) { indicator[++types] = $1 }; indicators[$1]++
if(!($2 in countries)) { country[++num] = $2 }; countries[$2]++
map[$1,$2] = $3
}
END {
printf "%s;" ,"Country";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, indicator[ind];
sep = ";"
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", country[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[indicator[val], country[coun]];
}
print ""
}
}' file
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4

Resources