Rearranging a csv file - bash

I have a file with contents similar to the below
Boy,Football
Boy,Football
Boy,Football
Boy,Squash
Boy,Tennis
Boy,Football
Girl,Tennis
Girl,Squash
Girl,Tennis
Girl,Tennis
Boy,Football
How can I use 'awk' or similar to rearrange this to the below:
Football Tennis Squash
Boy 5 1 1
Girl 0 3 1
I'm not even sure if this is possible, but any help would be great.

$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
genders[$1]
sports[$2]
count[$1,$2]++
}
END {
printf ""
for (sport in sports) {
printf "%s%s", OFS, sport
}
print ""
for (gender in genders) {
printf "%s", gender
for (sport in sports) {
printf "%s%s", OFS, count[gender,sport]+0
}
print ""
}
}
$ awk -f tst.awk file
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
In general when you know the end point of the loop you put the OFS or ORS after each field:
for (i=1; i<=n; i++) {
printf "%s%s", $i, (i<n?OFS:ORS)
}
but if you don't then you put the OFS before the second and subsequent fields and print the ORS after the loop:
for (x in array) {
printf "%s%s", (++i>1?OFS:""), array[x]
}
print ""
I do like the:
n = length(array)
for (x in array) {
printf "%s%s", array[x], (++i<n?OFS:ORS)
}
idea to get the end of the loop too, but length(array) is gawk-specific.
Another approach to consider:
$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
for (i=1; i<=NF; i++) {
if (!seen[i,$i]++) {
map[i,++num[i]] = $i
}
}
count[$1,$2]++
}
END {
for (i=0; i<=num[2]; i++) {
printf "%s%s", map[2,i], (i<num[2]?OFS:ORS)
}
for (i=1; i<=num[1]; i++) {
printf "%s%s", map[1,i], OFS
for (j=1; j<=num[2]; j++) {
printf "%s%s", count[map[1,i],map[2,j]]+0, (j<num[2]?OFS:ORS)
}
}
}
$ awk -f tst.awk file
Football Squash Tennis
Boy 5 1 1
Girl 0 1 3
That last will print the rows and columns in the order they were read. Not quite as obvious how it works though :-).

I would just loop normally:
awk -F, -v OFS="\t" '
{names[$1]; sport[$2]; count[$1,$2]++}
END{printf "%s", OFS;
for (i in sport)
printf "%s%s", i, OFS;
print "";
for (n in names) {
printf "%s%s", n, OFS
for (s in sport)
printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""
}
}' file
This keeps track of three arrays: names[] for the first column, sport[] for the second column and count[name,sport] to count the occurrences of every combination.
Then, it is a matter of looping through the results and printing them in a fancy way and making sure 0 is printed if the count[a,b] does not exist.
Test
$ awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) printf "%s%s", i, OFS; print ""; for (n in names) {printf "%s%s", n, OFS; for (s in sport) printf "%s%s", count[n,s]?count[n,s]:0, OFS; print ""}}' a
Squash Tennis Football
Boy 1 1 5
Girl 1 3 0
Format is a bit ugly, there are some trailing OFS.
To get rid of trailing OFS:
awk -F, -v OFS="\t" '{names[$1]; sport[$2]; count[$1,$2]++} END{printf "%s", OFS; for (i in sport) {cn++; printf "%s%s", i, (cn<length(sport)?OFS:ORS)} for (n in names) {cs=0; printf "%s%s", n, OFS; for (s in sport) {cs++; printf "%s%s", count[n,s]?count[n,s]:0, (cs<length(sport)?OFS:ORS)}}}' a
You can always pipe to column -t for a nice output.

Related

Add line numbers to final nonempty lines using bash

Is there a way to add line numbers to a file, but only have it start from the line after the last blank line? For example
aaa
bbb
ccc
ddd
eee
would become
aaa
bbb
ccc
1. ddd
2. eee
since the line ddd is the first line after the last blank line.
Right now, I'm going through file by file to do this using vim (by selecting the lines and doing a quick command), but I have 1,000 files I need to run through and I'd prefer not having to do it individually by hand, but can't think of how to get around it.
You can do it simply with awk and three-rules, including the END rule to number the final group of lines, e.g.
awk '
NF > 0 { a[++n]=$0 }
NF == 0 { for(i=1; i<=n; i++) print a[i]; print""; n=0 }
END { for(i=1; i<=n; i++) printf "%d. %s\n", i, a[i]}
' file
Explanation
For the first rule NF > 0, if there is at least one field (line non-empty), store the line in the array a and pre-increment counter n (to keep consistent with awk 1 to NF indexing)
For the second rule NF == 0 if the line is blank, output what you have stored in a and then output an empty-line and reset n to zero;
Finally, in the END rule, number and output all lines stored in a.
Example Use/Output
$ awk '
> NF > 0 { a[++n]=$0 }
> NF == 0 { for(i=1; i<=n; i++) print a[i]; print""; n=0 }
> END { for(i=1; i<=n; i++) printf "%d. %s\n", i, a[i]}
> ' file
aaa
bbb
ccc
1. ddd
2. eee
Here is a solution with awk.
awk '!NF{x=NR} {r[NR]=$0}
END {for (i=1;i<=NR;i++) print (i>x? (++n)". "r[i]: r[i])}' file
We store the rows to array. At the END, x will be the last blank number line, so we print the numbering for line numbers greater than x.
Here is a two pass version that does not require reading the entire file into memory. If it is a large file, you might want to consider memory:
awk ' NR==FNR {if (/^[ \t]*$/) ll=FNR; next}
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file file
Prints:
aaa
bbb
ccc
1. ddd
2. eee
With sed and tail you can calculate the last blank line like so:
$ sed -n '/^[[:blank:]]*$/=' file | tail -1
5
Which you can use to set a variable for awk as well:
awk -v ll=$(sed -n '/^[[:blank:]]*$/=' file | tail -1) '
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file
In any case, you can only do one of two things: Hold the entire file in memory and print after reading it all or read it twice. In most cases, it is better to read it twice since it will not impact memory and it is usually just as fast (since most OSs will cache the file.)
Just to compare some timings of read whole file vs read it twice consider a 100 line file (likely too fast to accurate measure):
$ awk 'BEGIN {for (i=1; i<=101; i++) print i%3 ? i : ""}' >/tmp/file
Now time that with the read all vs read twice solutions:
time awk -v ll=$(sed -n '/^[[:blank:]]*$/=' file | tail -1) '
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file > /dev/null
time awk '
NF > 0 { a[++n]=$0 }
NF == 0 { for(i=1; i<=n; i++) print a[i]; print""; n=0 }
END { for(i=1; i<=n; i++) printf "%d. %s\n", i, a[i]}
' file > /dev/null
time awk ' NR==FNR {if (/^[ \t]*$/) ll=FNR; next}
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file file > /dev/null
Prints:
real 0m0.012s
user 0m0.004s
sys 0m0.007s
real 0m0.007s
user 0m0.003s
sys 0m0.003s
real 0m0.007s
user 0m0.003s
sys 0m0.003s
Now try with 100,001 lines:
awk 'BEGIN { for (i=1; i<=100001; i++) print i%3 ? i : ""}' >/tmp/file
time awk -v ll=$(sed -n '/^[[:blank:]]*$/=' file | tail -1) '
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file > /dev/null
time awk '
NF > 0 { a[++n]=$0 }
NF == 0 { for(i=1; i<=n; i++) print a[i]; print""; n=0 }
END { for(i=1; i<=n; i++) printf "%d. %s\n", i, a[i]}
' file > /dev/null
time awk ' NR==FNR {if (/^[ \t]*$/) ll=FNR; next}
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file file > /dev/null
Times:
real 0m0.081s
user 0m0.079s
sys 0m0.007s
real 0m0.047s
user 0m0.042s
sys 0m0.004s
real 0m0.058s
user 0m0.052s
sys 0m0.004s
Now 10,000,001 lines:
awk 'BEGIN { for (i=1; i<=10000001; i++) print i%3 ? i : ""}' >/tmp/file
time awk -v ll=$(sed -n '/^[[:blank:]]*$/=' file | tail -1) '
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file > /dev/null
time awk '
NF > 0 { a[++n]=$0 }
NF == 0 { for(i=1; i<=n; i++) print a[i]; print""; n=0 }
END { for(i=1; i<=n; i++) printf "%d. %s\n", i, a[i]}
' file > /dev/null
time awk ' NR==FNR {if (/^[ \t]*$/) ll=FNR; next}
FNR>ll {c++}
{printf "%s%s\n", (c ? c "." OFS : ""), $0}' file file > /dev/null
Times:
real 0m6.766s
user 0m7.671s
sys 0m0.063s
real 0m3.950s
user 0m3.921s
sys 0m0.026s
real 0m4.801s
user 0m4.754s
sys 0m0.041s
So surprisingly, it is slightly faster for a smaller file to read it twice and slightly faster to hold it in memory if larger. This is on a computer with 64GB RAM and an SSD. Smaller memory or slower hard drives would change this.

Compare two files using awk having many columns and get the column in which data is different

file 1:
field1|field2|field3|
abc|123|234
def|345|456
hij|567|678
file2:
field1|field2|field3|
abc|890|234
hij|567|658
desired output:
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
I need to compare.if the fields match , then it shld put Y , else N in the output file.
Using awk, you may try this:
awk -F '|' 'FNR == NR {
p = $1
sub(p, "")
a[p] = $0
next
}
{
if (FNR > 1 && $1 in a) {
split(a[$1], b, /\|/)
printf "%s", $1 FS
for (i=2; i<=NF; i++)
printf "%s%s", ($i == b[i] ? "Y" : "N"), (i == NF ? ORS : FS)
}
else
print
}' file2 file1
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
Code Demo

Spreading cell values into columns using UNIX

Suppose we have this file:
head file
id,name,value
1,Je,1
2,Je,1
3,Ko,1
4,Ne,1
5,Ne,1
6,Je,1
7,Ko,1
8,Ne,1
9,Ne,1
And I'd like to get this out:
id,Je,Ko,Ne
1,1,0,0
2,1,0,0
3,0,1,0
4,0,0,1
5,0,0,1
6,1,0,0
7,0,1,0
8,0,0,1
9,0,0,1
Does someone know how to get this output, using awk or sed?
Assuming that the possible values of name are only Je or Ko or Ne, you can do:
awk -F, 'BEGIN{print "id,Je,Ko,Ne"}
NR==1{ next }
{je=$2=="Je"?"1":"0";
ko=$2=="Ko"?"1":"0";
ne=$2=="Ne"?"1":"0";
print $1","je","ko","ne}' file
If you want something that will print the values in the same order they are read and not limited to your example fields, you could do:
awk -F, 'BEGIN{OFS=FS; x=1;y=1}
NR==1 { next }
!($2 in oa){ oa[$2]=1; ar[x++]=$2}
{lines[y++]=$0;}
END{
s="";
for (i=1; i<x; i++)
s=s==""?ar[i]:s OFS ar[i];
print "id" OFS s;
for (j=1; j<y; j++){
split(lines[j], a)
s=""
for (i=1; i<x; i++) {
tt=ar[i]==a[2]?"1":"0"
s=s==""?tt:s OFS tt;
}
print a[1] OFS s;
}
}
' file
Here's a "two-pass solution" (along the lines suggested by #Drakosha) implemented using a single invocation of awk. The implementation would be a little simpler if there was no requirement regarding the ordering of names.
awk -F, '
# global: n, array a
function println(ix,name,value, i,line) {
line=ix;
for (i=0;i<n;i++) {
if (a[i]==name) {line=line OFS value} else {line=line OFS 0}
}
print line;
}
BEGIN {OFS=FS; n=0}
FNR==1 {next} # skip the header each time
NR==FNR {if (!mem[$2]) {mem[$2] = a[n++] = $2}; next}
!s { s="id"; for (i=0;i<n;i++) {s=s OFS a[i]}; print s}
{println($1, $2, $3)}
' file file
I suggest 2 passes.
1st will generate all the possible values of column 2 (Je, Ko, Ne,
...).
2nd will be able to trivially generate the output you are looking for.
awk -F, 'BEGIN{s="Je,Ko,Ne";print "id,"s}
NR>1 {m=s; sub($2,1,m); gsub("[^0-9,]+","0",m); print $1","m}' file

Awk script within shell script

I wrote some awk script to be executed while looping over {a..z}.txt files. I've been staring at this code for 30 minutes, but I just can't find what's wrong. The terminal complains that there is some syntax error around >, but I don't think that's where the bug is.
Basically, what I'm trying to do is this:
Each line contains a string and a following set of numbers. I want to re-print the numbers so that the first number is the smallest one of them.
input: a 1125159 2554 290 47364290 47392510 48629708 68
60771
output:a 290 1125159 2554 47364290 47392510 48629708 68
60771
Could anyone help me find what is wrong with the below code?
for alphabet in {a..z}
do
awk -F$'\t' "NF>2{maxId=\$2;maxIndex=2;
for(i=2; i<=NF; i++){
if(maxId>\$i){maxId=\$i; maxIndex=i}
};
printf \"%s \t %s \t\",\$1, maxId;
for(i=2; i<=NF; i++){
if(i!=maxIndex)
printf \"%d \t\", \$i};
printf \"\n\";
}" $alphabet.merged > $alphabet.out
done
Here's how your script should really be written:
awk 'BEGIN { FS=OFS="\t" }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i
}
}
print ""
}' file
a 68 2554 290 47364290 47392510 48629708 1125159 60771
Don't shy away from white space and brackets as they help readability. I don't understand the purpose of the surrounding shell loop in your question though - I suspect all you really need is:
awk 'BEGIN { FS=OFS="\t" }
FNR==1 { close(out); out=FILENAME; sub(/merged/,"out",out) }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex > out
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i > out
}
}
print "" > out
}' *.merged

Transpose CSV data with awk (pivot transformation)

my CSV data looks like this:
Indicator;Country;Value
no_of_people;USA;500
no_of_people;Germany;300
no_of_people;France;200
area_in_km;USA;18
area_in_km;Germany;16
area_in_km;France;17
proportion_males;USA;5.3
proportion_males;Germany;7.9
proportion_males;France;2.4
I want my data to look like this:
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4
There are more Indicators and more countries than listed here.
Pretty large files (number of rows something with 5 digits).
Looked around for some transpose threads, but nothing matched my situation (also I'm quite new to awk, so I couldn't change the code I found to fit my data).
Thanks for your help.
Regards
Ad
If the number of Ind fields is fixed, you can do:
awk 'BEGIN{FS=OFS=";"}
{a[$2,$1]=$3; count[$2]}
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as semicolon.
{a[$2,$1]=$3; count[$2]} get list of countries in count[] array and values of each Ind on a["country","Ind"] array.
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]} print the summary of the values.
Output
$ awk 'BEGIN{FS=OFS=";"} {a[$2,$1]=$3; count[$2]} END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
France;200;17;2.4
Germany;300;16;7.9
USA;500;18;5.3
Update
unfortunately, the number of Indicators is not fixed. Also, they are
not named like "Ind1", "Ind2" etc. but are just strings.' I clarified
my question.
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file
proportion_males no_of_people area_in_km
France 2.4 200 17
Germany 7.9 300 16
USA 5.3 500 18
To have ; separated, do replace each space with ;:
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file | tr ' ' ';'
proportion_males;no_of_people;area_in_km;
France;2.4;200;17;
Germany;7.9;300;16;
USA;5.3;500;18;
Using awk and maintaining the order of output:
awk -F\; '
NR>1 {
if(!($1 in indicators)) { indicator[++types] = $1 }; indicators[$1]++
if(!($2 in countries)) { country[++num] = $2 }; countries[$2]++
map[$1,$2] = $3
}
END {
printf "%s;" ,"Country";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, indicator[ind];
sep = ";"
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", country[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[indicator[val], country[coun]];
}
print ""
}
}' file
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4

Resources