transform csv file using awk script - bash

I have csv file as below:
C1, C2, C3,Cv1,Cv2,Cv3,Cv4 ... this one can be have longer column
x1, x2 ,x3.1, 1.1, 1.2, 1.3, 1.4
x1, x2, x3.2, 2.1, 2.2, 2.3, 2.4
x1, x2, x3.3, 3.1, 3.2, 3.3, 3.4
i would like to transform this csv file to as below:
C1,C2, C3,CTEXT,XVALUE
x1, x2, x3.1, Cv1 , 1.1
x1, x2, x3.1, Cv2 , 1.2
x1, x2, x3.1, Cv3 , 1.3
x1, x2, x3.1, Cv4 , 1.4
x1, x2, x3.2, Cv1 , 2.1
x1, x2, x3.2, Cv2 , 2.2
x1, x2, x3.2, Cv3 , 2.3
x1, x2, x3.2, Cv4 , 2.4
x1, x2, x3.3, Cv1 , 3.1
x1,x2,x3.3, Cv2 , 3.2
x1,x2,x3.3, Cv3 , 3.3
x1,x2,x3.3, Cv4 , 3.4
Below is my code:
#!/bin/bash
awk -F, -v OFS=, '{ if (NR==1)
{ print $1,$2,$3, "CTEXT","XVALUE"
i=4; while (i < NF) {
a[i]=$i; i=i+1
}
am=NF; next
}
i=4 ; while (i < am) {
if (i > NF) {print "record "NR" insufficient value" >/dev/stderr
break}
print $1,$2,$3,a[i],$i
i=i+1
}
if (am <NF) print "record "NR" too many values for text" >/dev/stderr
}' input.csv
When i run the script, it shows error :
awk: syntax error near line 2
awk: bailing out near line 2
Edit by Ed Morton - I just ran the script through a beautifier (gawk -o- '...') so it's much easier to read/understand:
{
if (NR == 1) {
print $1, $2, $3, "CTEXT", "XVALUE"
i = 4
while (i < NF) {
a[i] = $i
i = i + 1
}
am = NF
next
}
i = 4
while (i < am) {
if (i > NF) {
print("record " NR " insufficient value") > (/dev/) stderr
break
}
print $1, $2, $3, a[i], $i
i = i + 1
}
if (am < NF) {
print("record " NR " too many values for text") > (/dev/) stderr
}
}

Even if you switch your Solaris awk to gawk or nawk, there still
remain some problems. Would you please try the following:
awk -F, -v OFS=, '
NR==1 {
print $1,$2,$3, "CTEXT","XVALUE"
for (i = 4; i <= NF; i++) a[i]=$i
am=NF; next
}
{
if (am < NF) {
print "record "NR" too many values for text" > "/dev/stderr"
next
}
for (i = 4; i <= am; i++) {
if (i > NF) {
print "record "NR" insufficient value" > "/dev/stderr"
break
}
print $1,$2,$3,a[i],$i
}
}' input.csv
You need to increment i up to NR or am (not < but <=).
Enclose /dev/stderr with quotes.
Better to use for loop rather than while.
Hope this helps.

something like this
$ awk -F, 'BEGIN {OFS=FS}
NR==1 {n=split($0,h);
print $1,$2,$3,"CTEXT","XVALUE";
next}
n!=NF {print n<NF?"too many":"not enough";
exit}
{for(i=4;i<=NF;i++) print $1,$2,$3,h[i],$i}' file
C1,C2,C3,CTEXT,XVALUE
x1,x2,x3.1,Cv1,1.1
x1,x2,x3.1,Cv2,1.2
x1,x2,x3.1,Cv3,1.3
x1,x2,x3.1,Cv4,1.4
x1,x2,x3.2,Cv1,2.1
x1,x2,x3.2,Cv2,2.2
x1,x2,x3.2,Cv3,2.3
x1,x2,x3.2,Cv4,2.4
x1,x2,x3.3,Cv1,3.1
x1,x2,x3.3,Cv2,3.2
x1,x2,x3.3,Cv3,3.3
x1,x2,x3.3,Cv4,3.4

Related

How to calculate the mean of row from csv file from nth column?

This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>

Compare two files using awk having many columns and get the column in which data is different

file 1:
field1|field2|field3|
abc|123|234
def|345|456
hij|567|678
file2:
field1|field2|field3|
abc|890|234
hij|567|658
desired output:
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
I need to compare.if the fields match , then it shld put Y , else N in the output file.
Using awk, you may try this:
awk -F '|' 'FNR == NR {
p = $1
sub(p, "")
a[p] = $0
next
}
{
if (FNR > 1 && $1 in a) {
split(a[$1], b, /\|/)
printf "%s", $1 FS
for (i=2; i<=NF; i++)
printf "%s%s", ($i == b[i] ? "Y" : "N"), (i == NF ? ORS : FS)
}
else
print
}' file2 file1
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
Code Demo

Awk script within shell script

I wrote some awk script to be executed while looping over {a..z}.txt files. I've been staring at this code for 30 minutes, but I just can't find what's wrong. The terminal complains that there is some syntax error around >, but I don't think that's where the bug is.
Basically, what I'm trying to do is this:
Each line contains a string and a following set of numbers. I want to re-print the numbers so that the first number is the smallest one of them.
input: a 1125159 2554 290 47364290 47392510 48629708 68
60771
output:a 290 1125159 2554 47364290 47392510 48629708 68
60771
Could anyone help me find what is wrong with the below code?
for alphabet in {a..z}
do
awk -F$'\t' "NF>2{maxId=\$2;maxIndex=2;
for(i=2; i<=NF; i++){
if(maxId>\$i){maxId=\$i; maxIndex=i}
};
printf \"%s \t %s \t\",\$1, maxId;
for(i=2; i<=NF; i++){
if(i!=maxIndex)
printf \"%d \t\", \$i};
printf \"\n\";
}" $alphabet.merged > $alphabet.out
done
Here's how your script should really be written:
awk 'BEGIN { FS=OFS="\t" }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i
}
}
print ""
}' file
a 68 2554 290 47364290 47392510 48629708 1125159 60771
Don't shy away from white space and brackets as they help readability. I don't understand the purpose of the surrounding shell loop in your question though - I suspect all you really need is:
awk 'BEGIN { FS=OFS="\t" }
FNR==1 { close(out); out=FILENAME; sub(/merged/,"out",out) }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex > out
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i > out
}
}
print "" > out
}' *.merged

Transpose CSV data with awk (pivot transformation)

my CSV data looks like this:
Indicator;Country;Value
no_of_people;USA;500
no_of_people;Germany;300
no_of_people;France;200
area_in_km;USA;18
area_in_km;Germany;16
area_in_km;France;17
proportion_males;USA;5.3
proportion_males;Germany;7.9
proportion_males;France;2.4
I want my data to look like this:
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4
There are more Indicators and more countries than listed here.
Pretty large files (number of rows something with 5 digits).
Looked around for some transpose threads, but nothing matched my situation (also I'm quite new to awk, so I couldn't change the code I found to fit my data).
Thanks for your help.
Regards
Ad
If the number of Ind fields is fixed, you can do:
awk 'BEGIN{FS=OFS=";"}
{a[$2,$1]=$3; count[$2]}
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as semicolon.
{a[$2,$1]=$3; count[$2]} get list of countries in count[] array and values of each Ind on a["country","Ind"] array.
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]} print the summary of the values.
Output
$ awk 'BEGIN{FS=OFS=";"} {a[$2,$1]=$3; count[$2]} END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
France;200;17;2.4
Germany;300;16;7.9
USA;500;18;5.3
Update
unfortunately, the number of Indicators is not fixed. Also, they are
not named like "Ind1", "Ind2" etc. but are just strings.' I clarified
my question.
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file
proportion_males no_of_people area_in_km
France 2.4 200 17
Germany 7.9 300 16
USA 5.3 500 18
To have ; separated, do replace each space with ;:
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file | tr ' ' ';'
proportion_males;no_of_people;area_in_km;
France;2.4;200;17;
Germany;7.9;300;16;
USA;5.3;500;18;
Using awk and maintaining the order of output:
awk -F\; '
NR>1 {
if(!($1 in indicators)) { indicator[++types] = $1 }; indicators[$1]++
if(!($2 in countries)) { country[++num] = $2 }; countries[$2]++
map[$1,$2] = $3
}
END {
printf "%s;" ,"Country";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, indicator[ind];
sep = ";"
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", country[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[indicator[val], country[coun]];
}
print ""
}
}' file
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4

awk manipulates csv file

I want to use awk to read a csv file. The csv file contains 5 columns, c1, c2, c3, c4, c5. I want judge the c1, c2 and c3 together is unique, like database constraint.
here is sample csv file:
c1,c2,c3,c4,c5
1886,5141,11-2011,62242.57,52.71
1886,5140,11-2011,63763.75,52.22
23157666,4747,11-2011,71.07,83.33
1886,5141,11-2011,4645.45,2135.45
In this case, row1 and row4 violate the unique constraint, and prompt the error message.
How to implement it with awk? Thanks a lot in advance.
awk -F, 'line[$1,$2,$3] {printf "Error: lines %d and %d collide\n", line[$1,$2,$3], NR; next} {line[$1,$2,$3] = NR}'
This lists all the lines for each duplication. It only outputs the duplication message once for each set.
awk -F, '{count[$1,$2,$3]++; line[$1,$2,$3] = line[$1,$2,$3] ", " NR} END {for (i in count) {if (count[i] > 1) {v=i; gsub(SUBSEP, FS, v); print "Error: lines", substr(line[i], 3), "collide on value:", v}}}'
Broken out on multiple lines:
awk -F, '
{
count[$1,$2,$3]++;
line[$1,$2,$3] = line[$1,$2,$3] ", " NR
}
END {
for (i in count) {
if (count[i] > 1) {
v = i;
gsub(SUBSEP, FS, v);
print "Error: lines", substr(line[i], 3), "collide on value:", v
}
}
}'
This is a variation on Kevin's answer.

Resources