Sum and replace in awk based on duplicate column - bash

I have a file that contains the following:
z,cat,7,9,bar
x,dog,9,9,bar
y,dog,3,4,foo
s,cat,3,4,bar
t,boat,21,1,foo
u,boat,19,3,bar
and i need to reach this result:
x,cat,10,13,x
x,dog,12,13,x
x,boat,40,4,x
i was trying something similar to
awk '{a[$NF]+=$1}END{for(x in a) printf "%s %s\n",x,a[x]}'
but what happens with this approach is that when you put more columns, it breaks the hole thing, because rows 1,2 and 5 can contain alpha numeric characters

This should do;
awk -F, '{arr1[$2]+=$3;arr2[$2]+=$4} END {for (i in arr1) print "x",i,arr1[i],arr2[i],"x"}' OFS=, file
x,cat,10,13,x
x,boat,40,4,x
x,dog,12,13,x

Perl solution:
perl -aF, -ne '$h{ $F[1] }[$_] += $F[ $_ + 2 ] for 0,1
}{
$" = ",";
print "x,$k,#{ $h{$k} },x\n" while ($k, $v) = each %h'

Related

awk FS vs FPAT puzzle and counting words but not blank fields

Suppose I have the file:
$ cat file
This, that;
this-that or this.
(Punctuation at the line end is not always there...)
Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:
sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n" | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1 or
2 that
3 this
With grep you can shorten that a bit to only match what you define as a word:
grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output
With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):
gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
3 this
1 or
2 that
Now trying to replicate in POSIX awk I tried:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
2
3 this
1 or
2 that
Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.
You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.
I can also fix it this way:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
Which seems a little less than straight forward.
Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?
This should work in POSIX/BSD or any version of awk:
awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file
1 or
3 this
2 that
By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.
With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:
#!awk
{
s = $0
while (match(s, /[[:alpha:]]+/)) {
word = substr(s, RSTART, RLENGTH)
count[tolower(word)]++
s = substr(s, RSTART+RLENGTH)
}
}
END {
for (word in count) print count[word], word
}
$ awk -f countwords.awk file
1 or
3 this
2 that
Works with the default BSD awk on my Mac.
With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.
awk -v RS='[[:alpha:]]+' '
RT{
val[tolower(RT)]++
}
END{
for(word in val){
print val[word], word
}
}
' Input_file
Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.
Using RS instead:
$ gawk -v RS="[^[:alpha:]]+" ' # [^a-zA-Z] or something for some awks
$0 { # remove possible leading null string
a[tolower($0)]++
}
END {
for(i in a)
print i,a[i]
}' file
Output:
this 3
or 1
that 2
Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]
With GNU awk using patsplit() and a second array for counting, you can try this:
awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that

Use AWK to print FILENAME to CSV

I have a little script to compare some columns inside a bunch of CSV files.
It's working fine, but there are some things that are bugging me.
Here is the code:
FILES=./*
for f in $FILES
do
cat -v $f | sed "s/\^A/,/g" > op_tmp.csv
awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' op_tmp.csv >> output.csv
rm op_tmp.csv
done
Just to explain:
I get all files on the directory, then i use CAT to replace the divisor ^A for a Pipe |.
Then i use the awk onliner to compare the columns i need and print the result to a output.csv.
But now i want to print the filename before every loop.
I tried using the cat sed and awk in the same line and printing the $FILENAME, but it doesn't work:
cat -v $f | sed "s/\^A/,/g" | awk -F, -vOFS=, 'NR == 1{next} $9=="T"{t[$8]+=$7;n[$8]} $9=="A"{a[$8]+=$7;n[$8]} $9=="C"{c[$8]+=$7;n[$8]} $9=="R"{r[$8]+=$7;n[$8]} $9=="P"{p[$8]+=$7;n[$8]} END{ for (i in n){print i "|" "A" "|" a[i]; print i "|" "C" "|" c[i]; print i "|" "R" "|" r[i]; print i "|" "P" "|" p[i]; print i "|" "T" "|" t[i] "|" (t[i]==a[i]+c[i]+r[i]+p[i] ? "ERROR" : "MATCHED")} }' > output.csv
Can anyone help?
You can rewrite the whole script better, but assuming it does what you want for now just add
echo $f >> output.csv
before awk call.
If you want to add filename in every awk output line, you have to pass it as an argument, i.e.
awk ... -v fname="$f" '{...; print fname... etc
A rewrite:
for f in ./*; do
awk -F '\x01' -v OFS="|" '
BEGIN {
letter[1]="A"; letter[2]="C"; letter[3]="R"; letter[4]="P"; letter[5]="T"
letters["A"] = letters["C"] = letters["R"] = letters["P"] = letters["T"] = 1
}
NR == 1 {next}
$9 in letters {
count[$9,$8] += $7
seen[$8]
}
END {
print FILENAME
for (i in seen) {
sum = 0
for (j=1; j<=4; j++) {
print i, letter[j], count[letter[j],i]
sum += count[letter[j],i]
}
print i, "T", count["T",i], (count["T",i] == sum ? "ERROR" : "MATCHED")
}
}
' "$f"
done > output.csv
Notes:
your method of iterating over files will break as soon as you have a filename with a space in it
try to reduce duplication as much as possible.
newlines are free, use them to improve readability
improve your variable names i, n, etc -- here "letter" and "letters" could use improvement to hold some meaning about those symbols.
awk has a FILENAME variable (here's the actual answer to your question)
awk understands \x01 to be a Ctrl-A -- I assume that's the field separator in your input files
define an Output Field Separator that you'll actually use
If you have GNU awk (version ???) you can use the ENDFILE block and do away with the shell for loop altogether:
gawk -F '\x01' -v OFS="|" '
BEGIN {...}
FNR == 1 {next}
$9 in letters {...}
ENDFILE {
print FILENAME
for ...
# clean up the counters for the next file
delete count
delete seen
}
' ./* > output.csv

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Awk consider double quoted string as one token and ignore space in between

Data file - data.txt:
ABC "I am ABC" 35 DESC
DEF "I am not ABC" 42 DESC
cat data.txt | awk '{print $2}'
will result the "I" instead of the string being quoted
How to make awk so that it ignore the space within the quote and think that it is one single token?
Another alternative would be to use the FPAT variable, that defines a regular expression describing the contents of each field.
Save this AWK script as parse.awk:
#!/bin/awk -f
BEGIN {
FPAT = "([^ ]+)|(\"[^\"]+\")"
}
{
print $2
}
Make it executable with chmod +x ./parse.awk and parse your data file as ./parse.awk data.txt:
"I am ABC"
"I am not ABC"
Yes, this can be done nicely in awk. It's easy to get all the fields without any serious hacks.
(This example works in both The One True Awk and in gawk.)
{
split($0, a, "\"")
$2 = a[2]
$3 = $(NF - 1)
$4 = $NF
print "and the fields are ", $1, "+", $2, "+", $3, "+", $4
}
Try this:
$ cat data.txt | awk -F\" '{print $2}'
I am ABC
I am not ABC
The top answer for this question only works for lines with a single quoted field. When I found this question I needed something that could work for an arbitrary number of quoted fields.
Eventually I came upon an answer by Wintermute in another thread, and he provided a good generalized solution to this problem. I've just modified it to remove the quotes. Note that you need to invoke awk with -F\" when running the below program.
BEGIN { OFS = "" } {
for (i = 1; i <= NF; i += 2) {
gsub(/[ \t]+/, ",", $i)
}
print
}
This works by observing that every other element in the array will be inside of the quotes when you separate by the "-character, and so it replaces the whitespace dividing the ones not in quotes with a comma.
You can then easily chain another instance of awk to do whatever processing you need (just use the field separator switch again, -F,).
Note that this might break if the first field is quoted - I haven't tested it. If it does, though, it should be easy to fix by adding an if statement to start at 2 rather than 1 if the first character of the line is a ".
I've scrunched up together a function that re-splits $0 into an array called B. Spaces between double quotes are not acting as field separators. Works with any number of fields, a mix of quoted and unquoted ones. Here goes:
#!/usr/bin/gawk -f
# Resplit $0 into array B. Spaces between double quotes are not separators.
# Single quotes not handled. No escaping of double quotes.
function resplit( a, l, i, j, b, k, BNF) # all are local variables
{
l=split($0, a, "\"")
BNF=0
delete B
for (i=1;i<=l;++i)
{
if (i % 2)
{
k=split(a[i], b)
for (j=1;j<=k;++j)
B[++BNF] = b[j]
}
else
{
B[++BNF] = "\""a[i]"\""
}
}
}
{
resplit()
for (i=1;i<=length(B);++i)
print i ": " B[i]
}
Hope it helps.
Okay, if you really want all three fields, you can get them, but it takes a lot of piping:
$ cat data.txt | awk -F\" '{print $1 "," $2 "," $3}' | awk -F' ,' '{print $1 "," $2}' | awk -F', ' '{print $1 "," $2}' | awk -F, '{print $1 "," $2 "," $3}'
ABC,I am ABC,35
DEF,I am not ABC,42
By the last pipe you've got all three fields to do whatever you'd like with.
Here is something like what I finally got working that is more generic for my project.
Note it doesn't use awk.
someText="ABC \"I am ABC\" 35 DESC '1 23' testing 456"
putItemsInLines() {
local items=""
local firstItem="true"
while test $# -gt 0; do
if [ "$firstItem" == "true" ]; then
items="$1"
firstItem="false"
else
items="$items
$1"
fi
shift
done
echo "$items"
}
count=0
while read -r valueLine; do
echo "$count: $valueLine"
count=$(( $count + 1 ))
done <<< "$(eval putItemsInLines $someText)"
Which outputs:
0: ABC
1: I am ABC
2: 35
3: DESC
4: 1 23
5: testing
6: 456

Print last 10 rows of specific columns using awk

I have the below awk command-line argument and it works aside from the fact it performs the print argument on the entire file (as expected). I would like it to just perform the formatting on the last 10 lines of the file (or any arbitrary number). Any suggestions are greatly appreciated, thanks!
I know one solution would be to pipe it with tail, but would like to stick with a pure awk solution.
awk '{print "<category label=\"" $13 " " $14 " " $15 "\"/>"}' foofile
There is no need to be orthodox with a language or tool on the Unix shell.
tail -10 foofile | awk '{print "<category label=\"" $13 " " $14 " " $15 "\"/>"}'
is a good solution. And, you already had it.
Your arbitrary number can still be used as an argument to tail, nothing is lost;
solution does not lose any elegance.
Using ring buffers, this one-liner prints last 10 lines;
awk '{a[NR%10]=$0}END{for(i=NR+1;i<=NR+10;i++)print a[i%10]}'
then, you can merge "print last 10 lines" and "print specific columns" like below;
{
arr_line[NR % 10] = $0;
}
END {
for (i = NR + 1; i <= NR + 10; i++) {
split(arr_line[i % 10], arr_field);
print "<category label=\"" arr_field[13] " " \
arr_field[14] " " \
arr_field[15] "\"/>";
}
}
I don't think this can be tidily done in awk. The only way you can do it is to buffer the last X lines, and then print them in the END block.
I think you'll be better off sticking with tail :-)
Just for last 10 rows
awk 'BEGIN{OFS="\n"}
{
a=b;b=c;c=d;d=e;e=f;f=g;g=h;h=i;i=j;j=$0
}END{
print a,b,c,d,e,f,g,h,i,j
}' file
In the case of variable # of columns, i have worked out two solutions
#cutlast [number] [[$1] [$2] [$3]...]
function cutlast {
length=${1-1}; shift
list=( ${#-`cat /proc/${$}/fd/0`} )
output=${list[#]:${#list[#]}-${length-1}}
test -z "$output" && exit 1 || echo $output && exit 0
}
#example: cutlast 2 one two three print print # echo`s print print
#example1: echo one two three four print print | cutlast 2 # echo`s print print
or
function cutlast {
length=${1-1}; shift
list=( ${#-`cat /proc/${$}/fd/0`} )
aoutput=${#-`cat /proc/${$}/fd/0`} | rev | cut -d ' ' -f-$num | rev
test -z "$output" && exit 1 || echo $output && exit 0
}
#example: cutlast 2 one two three print print # echo`s print print
There is loads of awk one liners in this text document, not sure if any of those will help.
This specifically might be what you're after (something similar anyway):
# print the last 2 lines of a file (emulates "tail -2")
awk '{y=x "\n" $0; x=$0};END{print y}'
awk '{ y=x "\n" $0; x=$0 }; END { print y }'
This is very inefficient: what it does is reading the whole file line by line only to print the last two lines.
Because there is no seek() statement in awk it is recommended to use tail to print the last lines of a file.

Resources