I have csv file as below:
C1, C2, C3,Cv1,Cv2,Cv3,Cv4 ... this one can be have longer column
x1, x2 ,x3.1, 1.1, 1.2, 1.3, 1.4
x1, x2, x3.2, 2.1, 2.2, 2.3, 2.4
x1, x2, x3.3, 3.1, 3.2, 3.3, 3.4
i would like to transform this csv file to as below:
C1,C2, C3,CTEXT,XVALUE
x1, x2, x3.1, Cv1 , 1.1
x1, x2, x3.1, Cv2 , 1.2
x1, x2, x3.1, Cv3 , 1.3
x1, x2, x3.1, Cv4 , 1.4
x1, x2, x3.2, Cv1 , 2.1
x1, x2, x3.2, Cv2 , 2.2
x1, x2, x3.2, Cv3 , 2.3
x1, x2, x3.2, Cv4 , 2.4
x1, x2, x3.3, Cv1 , 3.1
x1,x2,x3.3, Cv2 , 3.2
x1,x2,x3.3, Cv3 , 3.3
x1,x2,x3.3, Cv4 , 3.4
Below is my code:
#!/bin/bash
awk -F, -v OFS=, '{ if (NR==1)
{ print $1,$2,$3, "CTEXT","XVALUE"
i=4; while (i < NF) {
a[i]=$i; i=i+1
}
am=NF; next
}
i=4 ; while (i < am) {
if (i > NF) {print "record "NR" insufficient value" >/dev/stderr
break}
print $1,$2,$3,a[i],$i
i=i+1
}
if (am <NF) print "record "NR" too many values for text" >/dev/stderr
}' input.csv
When i run the script, it shows error :
awk: syntax error near line 2
awk: bailing out near line 2
Edit by Ed Morton - I just ran the script through a beautifier (gawk -o- '...') so it's much easier to read/understand:
{
if (NR == 1) {
print $1, $2, $3, "CTEXT", "XVALUE"
i = 4
while (i < NF) {
a[i] = $i
i = i + 1
}
am = NF
next
}
i = 4
while (i < am) {
if (i > NF) {
print("record " NR " insufficient value") > (/dev/) stderr
break
}
print $1, $2, $3, a[i], $i
i = i + 1
}
if (am < NF) {
print("record " NR " too many values for text") > (/dev/) stderr
}
}
Even if you switch your Solaris awk to gawk or nawk, there still
remain some problems. Would you please try the following:
awk -F, -v OFS=, '
NR==1 {
print $1,$2,$3, "CTEXT","XVALUE"
for (i = 4; i <= NF; i++) a[i]=$i
am=NF; next
}
{
if (am < NF) {
print "record "NR" too many values for text" > "/dev/stderr"
next
}
for (i = 4; i <= am; i++) {
if (i > NF) {
print "record "NR" insufficient value" > "/dev/stderr"
break
}
print $1,$2,$3,a[i],$i
}
}' input.csv
You need to increment i up to NR or am (not < but <=).
Enclose /dev/stderr with quotes.
Better to use for loop rather than while.
Hope this helps.
something like this
$ awk -F, 'BEGIN {OFS=FS}
NR==1 {n=split($0,h);
print $1,$2,$3,"CTEXT","XVALUE";
next}
n!=NF {print n<NF?"too many":"not enough";
exit}
{for(i=4;i<=NF;i++) print $1,$2,$3,h[i],$i}' file
C1,C2,C3,CTEXT,XVALUE
x1,x2,x3.1,Cv1,1.1
x1,x2,x3.1,Cv2,1.2
x1,x2,x3.1,Cv3,1.3
x1,x2,x3.1,Cv4,1.4
x1,x2,x3.2,Cv1,2.1
x1,x2,x3.2,Cv2,2.2
x1,x2,x3.2,Cv3,2.3
x1,x2,x3.2,Cv4,2.4
x1,x2,x3.3,Cv1,3.1
x1,x2,x3.3,Cv2,3.2
x1,x2,x3.3,Cv3,3.3
x1,x2,x3.3,Cv4,3.4
Related
This may look like a duplicate but I could not solve the issue I'm having.
I'm trying to find the average of each column from a CSV/TSV file the data looks like below:
input.tsv
ID source random text val1 val2 val3 val4 val330
1 atttt eeeee test 0.9 0.5 0.2 0.54 0.89
2 afdg adfgrg tf 0.6 0.23 0.5 0.4 0.29
output.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
or at least
ID Avg
1 0.606
2 0.404
I tried a suggestion from here
awk 'NR==1{next}
{printf("%s\t", $1
printf("%.2f\n", ($5 + $6 + $7)/3}' input.tsv
which threw error.
and
awk '{ s = 4; for (i = 5; i <= NF; i++) s += $i; print $1, (NF > 1) ? s / (NF - 1) : 0; }' input.tsv
the below code also threw a syntax error
for i in `cat input.tsv` do; VALUES=`echo $i | tr '\t' '\t'`;COUNT=0;SUM=0;typeset -i j;IFS=' ';for j in $VALUES; do;SUM=`expr $SUM + $j`;COUNT=`expr $COUNT + 1`;done;AVG=`expr $SUM / $COUNT`;echo $AVG;done
help me resolve the issue to calculate the average of the row
From you code reference:
awk 'NR==1{next}
{
# missing the last ). This print the 1st column
#printf("%s\t", $1
printf("%s\t", $1 )
# missing the last ) and average of 3 colum only
#printf("%.2f\n", ($5 + $6 + $7)/3
printf("%.2f\n", ($5 + $6 + $7 + $8 + $9) / 5 )
}' input.tsv
Your second code is not easy work with , lot of subshell (backtic) and shell loop but most of all, i think it is made for working with integer value and for full line of value (not 5- > 9). Forget it unless you don't want awk in this case.
for fun
awk 'NR==1{
# Header
print $0 OFS "Avg"
Count = NF - 5
next
}
{
# print each element of the line + sum after col 4
for( i=Avg=0;i<=NF;i++) {
if( i >=5 ) Avg+= $i
printf( "%s ", $i)
}
# print average
printf( "%.2f\n", Avg/Count )
}
' input.tsv
Assuming here that it is always counting on the full stack of value, we can change the Count by (NF - 4) if less value are on the line and empty are not counting
You could use this awk script:
awk 'NR>1{
for(i=5;i<=NF;i++)
sum+=$i
}
{
print $1,$2,$3,$4,(NF>4&&sum!=""?sum/(NF-4):(NR==1?"Avg":""))
sum=0
}' file | column -t
The first block gets the sum of all ids starting from the 5th element.
The second block, prints the header line and the average value.
column -t displays the result in column.
This would be working as expected:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ s=0; for(i=5;i<=NF;++i) s+=$i }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
or just for the fun of it, if you want to make the for-loop obfuscated:
awk 'BEGIN{OFS="\t"}
(NR==1){ print $1,$2,$3,$4,"Avg:"; next }
{ for(s=!(i=5);i<=NF;s+=$(i++)) {} }
{ print $1,$2,$3,$4, (NF>4 ? s/(NF-4) : s) }' input.tsv
$ cat tst.awk
NR == 1 { avg = "Avg" }
NR > 1 {
sum = cnt = 0
for (i=5; i<=NF; i++) {
sum += $i
cnt++
}
avg = (cnt ? sum / cnt : 0)
}
{ print $1, $2, $3, $4, avg }
$ awk -f tst.awk file
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
Using Perl one-liner
> perl -lane '{ $s=0;foreach(#F[4..8]){$s+=$_} $F[4]=$s==0?"Avg":$s/5;print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]" } ' input.tsv
ID source random text Avg
1 atttt eeeee test 0.606
2 afdg adfgrg tf 0.404
>
file 1:
field1|field2|field3|
abc|123|234
def|345|456
hij|567|678
file2:
field1|field2|field3|
abc|890|234
hij|567|658
desired output:
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
I need to compare.if the fields match , then it shld put Y , else N in the output file.
Using awk, you may try this:
awk -F '|' 'FNR == NR {
p = $1
sub(p, "")
a[p] = $0
next
}
{
if (FNR > 1 && $1 in a) {
split(a[$1], b, /\|/)
printf "%s", $1 FS
for (i=2; i<=NF; i++)
printf "%s%s", ($i == b[i] ? "Y" : "N"), (i == NF ? ORS : FS)
}
else
print
}' file2 file1
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N
Code Demo
I wrote some awk script to be executed while looping over {a..z}.txt files. I've been staring at this code for 30 minutes, but I just can't find what's wrong. The terminal complains that there is some syntax error around >, but I don't think that's where the bug is.
Basically, what I'm trying to do is this:
Each line contains a string and a following set of numbers. I want to re-print the numbers so that the first number is the smallest one of them.
input: a 1125159 2554 290 47364290 47392510 48629708 68
60771
output:a 290 1125159 2554 47364290 47392510 48629708 68
60771
Could anyone help me find what is wrong with the below code?
for alphabet in {a..z}
do
awk -F$'\t' "NF>2{maxId=\$2;maxIndex=2;
for(i=2; i<=NF; i++){
if(maxId>\$i){maxId=\$i; maxIndex=i}
};
printf \"%s \t %s \t\",\$1, maxId;
for(i=2; i<=NF; i++){
if(i!=maxIndex)
printf \"%d \t\", \$i};
printf \"\n\";
}" $alphabet.merged > $alphabet.out
done
Here's how your script should really be written:
awk 'BEGIN { FS=OFS="\t" }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i
}
}
print ""
}' file
a 68 2554 290 47364290 47392510 48629708 1125159 60771
Don't shy away from white space and brackets as they help readability. I don't understand the purpose of the surrounding shell loop in your question though - I suspect all you really need is:
awk 'BEGIN { FS=OFS="\t" }
FNR==1 { close(out); out=FILENAME; sub(/merged/,"out",out) }
NF>2 {
minIndex = 2
for (i=3; i<=NF; i++) {
if ( $minIndex > $i ) {
minIndex = i
}
}
printf "%s%s%s", $1, OFS, $minIndex > out
for (i=2; i<=NF; i++) {
if ( i != minIndex ) {
printf "%s%s", OFS, $i > out
}
}
print "" > out
}' *.merged
my CSV data looks like this:
Indicator;Country;Value
no_of_people;USA;500
no_of_people;Germany;300
no_of_people;France;200
area_in_km;USA;18
area_in_km;Germany;16
area_in_km;France;17
proportion_males;USA;5.3
proportion_males;Germany;7.9
proportion_males;France;2.4
I want my data to look like this:
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4
There are more Indicators and more countries than listed here.
Pretty large files (number of rows something with 5 digits).
Looked around for some transpose threads, but nothing matched my situation (also I'm quite new to awk, so I couldn't change the code I found to fit my data).
Thanks for your help.
Regards
Ad
If the number of Ind fields is fixed, you can do:
awk 'BEGIN{FS=OFS=";"}
{a[$2,$1]=$3; count[$2]}
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as semicolon.
{a[$2,$1]=$3; count[$2]} get list of countries in count[] array and values of each Ind on a["country","Ind"] array.
END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]} print the summary of the values.
Output
$ awk 'BEGIN{FS=OFS=";"} {a[$2,$1]=$3; count[$2]} END {for (i in count) print i, a[i,"Ind1"], a[i, "Ind2"], a[i, "Ind3"]}' file
France;200;17;2.4
Germany;300;16;7.9
USA;500;18;5.3
Update
unfortunately, the number of Indicators is not fixed. Also, they are
not named like "Ind1", "Ind2" etc. but are just strings.' I clarified
my question.
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file
proportion_males no_of_people area_in_km
France 2.4 200 17
Germany 7.9 300 16
USA 5.3 500 18
To have ; separated, do replace each space with ;:
$ awk -v FS=";" '{a[$2,$1]=$3; count[$2]; indic[$1]} END {for (j in indic) printf "%s ", j; printf "\n"; for (i in count) {printf "%s ", i; for (j in indic) printf "%s ", a[i,j]; printf "\n"}}' file | tr ' ' ';'
proportion_males;no_of_people;area_in_km;
France;2.4;200;17;
Germany;7.9;300;16;
USA;5.3;500;18;
Using awk and maintaining the order of output:
awk -F\; '
NR>1 {
if(!($1 in indicators)) { indicator[++types] = $1 }; indicators[$1]++
if(!($2 in countries)) { country[++num] = $2 }; countries[$2]++
map[$1,$2] = $3
}
END {
printf "%s;" ,"Country";
for(ind=1; ind<=types; ind++) {
printf "%s%s", sep, indicator[ind];
sep = ";"
}
print "";
for(coun=1; coun<=num; coun++) {
printf "%s", country[coun]
for(val=1; val<=types; val++) {
printf "%s%s", sep, map[indicator[val], country[coun]];
}
print ""
}
}' file
Country;no_of_people;area_in_km;proportion_males
USA;500;18;5.3
Germany;300;16;7.9
France;200;17;2.4
I want to use awk to read a csv file. The csv file contains 5 columns, c1, c2, c3, c4, c5. I want judge the c1, c2 and c3 together is unique, like database constraint.
here is sample csv file:
c1,c2,c3,c4,c5
1886,5141,11-2011,62242.57,52.71
1886,5140,11-2011,63763.75,52.22
23157666,4747,11-2011,71.07,83.33
1886,5141,11-2011,4645.45,2135.45
In this case, row1 and row4 violate the unique constraint, and prompt the error message.
How to implement it with awk? Thanks a lot in advance.
awk -F, 'line[$1,$2,$3] {printf "Error: lines %d and %d collide\n", line[$1,$2,$3], NR; next} {line[$1,$2,$3] = NR}'
This lists all the lines for each duplication. It only outputs the duplication message once for each set.
awk -F, '{count[$1,$2,$3]++; line[$1,$2,$3] = line[$1,$2,$3] ", " NR} END {for (i in count) {if (count[i] > 1) {v=i; gsub(SUBSEP, FS, v); print "Error: lines", substr(line[i], 3), "collide on value:", v}}}'
Broken out on multiple lines:
awk -F, '
{
count[$1,$2,$3]++;
line[$1,$2,$3] = line[$1,$2,$3] ", " NR
}
END {
for (i in count) {
if (count[i] > 1) {
v = i;
gsub(SUBSEP, FS, v);
print "Error: lines", substr(line[i], 3), "collide on value:", v
}
}
}'
This is a variation on Kevin's answer.