AWK doesn't recognise more than one field when changing value of an element - shell

I have a csv file, a simplified version of which is:
#data
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
1,2,3,4,normal
When I do:
awk -F',' '{print NF}' myfile.csv
I get:
1
5
5
5
5
5
5
5
5
5
5
I am trying to change the 5th element of the 10th line in this dataset but I noticed a strange behavior upon doing so. More specificaly, when I give:
awk -F',' 'NR==10{$5="abnormal"}1' myfile.csv | awk -F',' '{print NF}'
I get:
1
5
5
5
5
5
5
5
5
1
5
Does anyone has an explanation or any thought on this?

Thanks to #EdMorton for the valuable comment that assigning a value to any field causes the record to be recompiled using the OFS value which by default is a space.
Updating as per updated question: #drDoom, you are missing `OFS=','.
See the difference in the two outputs below with your sample data:
/home/amit/$ awk -F',' 'NR==10{$5="abnormal"}1' OFS=',' myfile.csv | awk -F',' '{print NF}'
1
5
5
5
5
5
5
5
5
5
5
/home/amit/$ awk -F',' 'NR==10{$5="abnormal"}1' myfile.csv | awk -F',' '{print NF}'
1
5
5
5
5
5
5
5
5
1
5
For changing the 150th field on a 100th line, you can do as below
awk -F',' 'NR==100{ $150 = "NewValue"}1' OFS=',' myfile.csv

Any or all of these are the issue:
a) Your csv file was created on Windows and so has extraneous control-M characters in it.
b) Your separator is not a comma [on every line].
c) You are miscounting which line is the 100th one.
Do this and update your question with the output:
dos2unix file
awk -F',' -v OFS=':' 'NR>98 && NR<102{print NR, NF, $1, $0}' file
Note that I said update your question with the output - do NOT post the output as a comment, as we will not be able to see the format.

Related

Optimizing grep -f piping commands [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]
Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl
Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

How to print the data that have largest length using awk?

I have this input:
1 happy,t,c,d
2 t,c
3 e,fgh,k
4 yk,j,f
5 leti,j,f,g
I want to print the length of the largest item (with comma as a delimiter), which should yield:
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
And then I want to select the max value for the 2nd column finally creating:
1 5
2 1
3 3
4 2
5 4
How can I do this in awk?
1) For the first task I have tried:
awk -v col=$2 -F',' '{OFS=","; for(i=1; i<=NF; i++) print length($i);}' test.data.txt
Which doesn't output correct data:
7
1
1
1
3
1
3
3
1
4
1
1
6
1
1
1
The only problem is that I am not able to use -v option properly to read only that column. So, I have all data in one column, and values added (from length) from column1 and space between column1 and column2.
2) To select the max value, I am doing:
awk -F',' '{OFS="\t"; m=length($1); for(i=1; i<=NF; i++) if (length($i) > m) m=length($i); print m}' test.data.txt
This works properly, but due to the presence of 1st column the values are added to the max values giving me:
7
3
3
4
6
instead of:
5
1
3
2
4
Lastly, I want to merge these two processes in one go. Any suggestions on improvements?
awk -F'[, ]' -v OFS="," '{m=length($2);for (i=3;i<=NF;i++) if (length($i) > m) m=length($i)}{print $1,m}' file
1,5
2,1
3,3
4,2
5,4
For the first case:
awk -F'[, ]' -v OFS="," '{printf "%s",$1;for (i=2;i<=NF;i++) printf "%s%s",(i==2?" ":OFS),length($i)}{print ""}'
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
Shorter alternative:
awk -F'[, ]' -v OFS="," '{printf "%s ",$1;for (i=2;i<=NF;i++) printf "%s%s",length($i),(i==NF?ORS:OFS)}'
While print in awk prints data and changes line by printing a new line at the end, printf does not change line on it's own.
PS: Thanks Ed Morton for the valuable comment.
We start with this data file:
$ cat data
1 happy,t,c,d
2 t,c
3 e,fgh,k
4 yk,j,f
5 leti,j,f,g
For the first task:
$ awk '{n=split($2,a,/,/); printf "%2s %s",$1,length(a[1]); for(i=2; i<=n; i++) printf ",%s",length(a[i]); print""}' data
1 5,1,1,1
2 1,1
3 1,3,1
4 2,1,1
5 4,1,1,1
For the second task:
$ awk '{n=split($2,a,/,/); max=length(a[1]); for(i=2; i<=n; i++) if (length(a[i])>max)max=length(a[i]); print $1,max}' data
1 5
2 1
3 3
4 2
5 4
How it works
For the second task:
n=split($2,a,/,/)
We split up the contents of field 2 into array a
max=length(a[1])
We assign the length of the first element of array a to the awk variable max.
for(i=2; i<=n; i++) if (length(a[i])>max)max=length(a[i])
If any succeeding element of array a is larger than max, we update `max.
print $1,max
We print the first field and the value of max.
Trying to golf this down:
gawk -F'[ ,]' '{m=0;for(i=2;i<=NF;i++){l=length($i);if(l>m)m=l}print$1,m}'
perl -MList::Util=max -F'\s+|,' -lne'$,=" ";print shift(#F),max map{length}#F'
perl -MList::Util=max -F'\s+|,' -lne'print"#{[shift(#F),max map{length}#F]}"'
perl -MList::Util=max -F'\s+|,' -lpe'$_="#{[shift(#F),max map{length}#F]}"'
ruby -F'[ ,]' -lape'$_="#{$F[0]} #{$F[1..-1].map{|e|e.size}.max}"'

Variable in commands in bash

I wrote program that should write words from example.txt from the longest to the shortest. I don't know how exactly '^.{$v}$' should look like to make it work?
#!/bin/bash
v=30
while [ $v -gt 0 ] ; do
grep -P '^.{$v}$' example.txt
v=$(($v - 1))
done
I tried:
${v}
$v
"$v"
It is my first question, sorry for any mistake :)
What you're doing is not how you'd approach this problem in shell. Read why-is-using-a-shell-loop-to-process-text-considered-bad-practice to learn some of the issues and then this is how you'd really do what you're trying to do in a shell script:
$ cat file
now
is
the
winter
of
our
discontent
$ awk -v OFS='\t' '{print length($0), NR, $0}' file | sort -k1rn -k2n | cut -f3-
discontent
winter
now
the
our
is
of
To understand what that's doing, look at the awk output:
$ awk -v OFS='\t' '{print length($0), NR, $0}' file
3 1 now
2 2 is
3 3 the
6 4 winter
2 5 of
3 6 our
10 7 discontent
The first number is the length of each line and the second number is the order the lines appeared in the input file so when we come to sort it:
$ awk -v OFS='\t' '{print length($0), NR, $0}' file | sort -k1rn -k2n
10 7 discontent
6 4 winter
3 1 now
3 3 the
3 6 our
2 2 is
2 5 of
we can sort by length (longest first) with -k1rn but retain the order from the input file for lines that are the same length by adding -k2n. Then the cut just removes the 2 leading numbers that awk added for sort to use.
use :
grep -P "^.{$v}$" example.txt

Remove comma using awk command with multiple record

Let says i've record like this.
Input
1,1,1,1.213,1,1,1.23
2,2,2,2.345,2,2,2.33
3,3,3,3.456,3,3,3.44
I want to be like this
Output
1,1,1,1,1,1,1.23
2,2,2,2,2,2,2.33
3,3,3,3,3,3,3.44
How to remove the comma only on column number 4th ?, i don't want to remove comma on last column
You can use:
awk -F"," '{print $1,$2,$3,int($4),$5,$6,$7}'
The int() is what you are looking for I guess.
Example:
$ cat test
1,1,1,1.213,1,1,1.23
2,2,2,2.345,2,2,2.33
3,3,3,3.456,3,3,3.44
$ awk -F"," '{print $1,$2,$3,int($4),$5,$6,$7}' test
1 1 1 1 1 1 1.23
2 2 2 2 2 2 2.33
3 3 3 3 3 3 3.44
Edit (Good suggestion from ccf):
You could use this instead of the long version of awk command above.
$ awk -F',' '{$4=int($4); print}'
1,1,1,1.213,1,1,1.23
1 1 1 1 1 1 1.23
If temp.txt has the input, then
$ cat temp.txt | sed 's/\.[0-9]\+//1'
1,1,1,1,1,1,1.23
2,2,2,2,2,2,2.33
3,3,3,3,3,3,3.44
1 at the end means, only replace the first match.

awk space delimiter with empty content

I have a text file which is delimited by space
1 dsfsdf 2
2 3
4 sdfsdf 4
5 sdfsdf 5
When I run
awk -F' ' '{s+=$3} END {print s}' test
It returns 11. It should return 14. I believe awk gets confused about the second line, between two spaces nothing there. How should I modify my command?
Thanks
try
awk -F' {1}' '{s+=$3} END {print s}' test
you get
14
Note
if test file contains
1 dsfsdf 2 1
2 3 1
4 sdfsdf 4 1
5 sdfsdf 5 1
also it works, i use gnu-awk
edit
how, #Ed_Morton and #"(9 )*" says is better to use literal space [ ]
awk -F'[ ]' '{s+=$3} END {print s}' test
this should work too if only the second column has missing values.
awk '{s+=$(NF-1)} END{print s}'

Resources