Understanding AWK and CSV files - bash

How can I write an AWK program that analyses a list of fields in CSV files, count the number of each different string in the specified field, and print out the count of each string that is found? I have only coded in C and Java, so I am completely confused on the syntax of AWK. I understand the simplest of concepts, however, AWK is structured much differently. Any time is appreciated, thank you!
BEGIN {
FS = ""
}
{
for(i = 1; i <= NF; i++)
freq[$i]++
PROCINFO ["sorted_in"] = "#val_num_desc" #this got the desired result
}
END {
for {this in freq)
printf "%s\t%d\n", this, freq[this]
}
On a CSV file containing:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
I am able to obtain the result:
A 2
B 1
C 3
Q 1
D 1
E 2
F 1
, 12
G 1
W 1
Field1,Field2,Field3,Field4 1
Z 2
This is the desired result:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
Field1 1
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
There is an edit to my code which is commented.

Fixed your code:
$ awk '
BEGIN { # you need BEGIN block for FS
FS = ", *" # your data had ", " and "," seps
} # ... based on your sample output
{
for(i = 1; i <= NF; i++)
freq[$i]++
}
END {
for(this in freq) # fixed a parenthesis
printf "%s\t%d\n", this, freq[this]
}' file
Output (using GNU awk. Other awks displayed output in different order):
A 2
B 1
C 3
Q 1
D 2
Field1 1
E 2
Field2 1
F 1
Field3 1
G 1
Field4 1
W 1
Z 2

AWK really isn't the right tool for this job. While AWK can interpret Comma or Tab separated data, it has no concept of field enclosures or escapes. So it could handle a simple example like:
Month,Day
January,Sunday
February,Monday
but would fail with this valid example:
Month,Day
January,"Sunday"
February,"Monday"
Because of that, I would recommend considering another language. Something like Python:
import csv
o = open('a.csv')
for m in csv.DictReader(o):
print(m)
https://docs.python.org/library/csv.html
Or Ruby:
require 'csv'
CSV.table('a.csv').each do |m|
p m
end
https://ruby-doc.org/stdlib/libdoc/csv/rdoc/CSV.html
Or even PHP:
<?php
$r = fopen('a.csv', 'r');
$a_head = fgetcsv($r);
while (true) {
$a_row = fgetcsv($r);
if (feof($r)) {
break;
}
$m_row = array_combine($a_head, $a_row);
print_r($m_row);
}
https://php.net/function.fgetcsv

Related

How to create a pivot table from a CSV file having subgroups and getting the count of the last values using shell script?

I want to group the columns then form subsequent group getting the count of last column values.
For example main Group A, Subgroup D, J , P and count of P in the subsequent groups as well as the total count of last column.
I am able to form groups but subgroup seems a little hard. Any help is appreciated like how to get this.
Input:
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
Output:
A D J P 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspQ 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK P 2
A E J Q 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK Q 1
B F L R 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspM S 1
C H N T 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspO U 2
&nbsp&nbsp&nbsp&nbspTotal 14
Here's a different approach, a shell script that uses sqlite to calculate the group counts (Requires 3.25 or newer because it uses window functions):
#!/bin/sh
file="$1"
sqlite3 -batch -noheader <<EOF
CREATE TABLE data(c1 TEXT, c2 TEXT, c3 TEXT, c4 TEXT);
.mode csv
.import "$file" data
.mode list
.separator " "
SELECT (CASE c1 WHEN lag(c1, 1) OVER (PARTITION BY c1 ORDER BY c1) THEN ' ' ELSE c1 END)
, (CASE c2 WHEN lag(c2, 1) OVER (PARTITION BY c1,c2 ORDER BY c1,c2) THEN ' ' ELSE c2 END)
, (CASE c3 WHEN lag(c3, 1) OVER (PARTITION BY c1,c2,c3 ORDER BY c1,c2,c3) THEN ' ' ELSE c3 END)
, c4
, count(*)
FROM data
GROUP BY c1, c2, c3, c4
ORDER BY c1, c2, c3, c4;
SELECT 'Total ' || count(*) FROM data;
EOF
Running this gives:
$ ./group.sh example.csv
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
Also a one-liner using datamash, though it doesn't include the fancy output format:
$ datamash -st, groupby 1,2,3,4 count 4 < example.csv | tr , ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
Using Perl
Script
perl -0777 -lne '
s/^(.+?)$/$x++;$kv{$1}++/mge;
foreach my $k (sort keys %kv)
{ $q=$c=$k;
while(length($p) > 0)
{
last if $c=~/^$p/g;
$q=substr($c,length($p)-1);
$p=~s/(.$)//;
}
printf( "%9s\n", "$q $kv{$k}") ;
$p=$k;
}
print "Total $x";
' anurag.txt
Output:
A,D,J,P 1
Q 1
K,P 2
E,J,Q 2
K,Q 1
B,F,L,R 2
M,S 1
C,H,N,T 2
O,U 2
Total 14
$ cat tst.awk
BEGIN { FS="," }
!($0 in cnt) { recs[++numRecs] = $0 }
{ cnt[$0]++ }
END {
for (recNr=1; recNr<=numRecs; recNr++) {
rec = recs[recNr]
split(rec,f)
newVal = 0
for (i=1; i<=NF; i++) {
if (f[i] != p[i]) {
newVal = 1
}
printf "%s%s", (newVal ? f[i] : " "), OFS
p[i] = f[i]
}
print cnt[rec]
tot += cnt[rec]
}
print "Total", tot+0
}
$ awk -f tst.awk file
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
I'll propose a multi stage solution in the spirit of unix toolset.
create a sorted, counted, de-delimited data format
$ sort file | uniq -c | awk '{print $2,$1}' | tr ',' ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
now, the task is removing the longest left common substring from consecutive lines
... | awk 'NR==1 {p=$0}
NR>1 {k=0;
while(p~t=substr($0,1,++k));
gsub(/./," ",t); sub(/^ /,"",t);
p=$0; $0=t substr(p,k)}1'
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
whether it's easier to understand than one script will be seen.
I have not exactly an answer that produces your example output but I'm close enough to dare posting an answer
Now I have an answer that produces exactly your example output... :-)
$ cat ABCD
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
$ awk '{a[$0]+=1}END{for(i in a) print i","a[i];print "Total",NR}' ABCD |\
sort | \
awk -F, '
/Total/{print;next}
{print a1==$1?" ":$1,a2==$2?" ":$2,a3==$3?" ":$3,a4==$4?" ":$4,$5
a1=$1;a2=$2;a3=$3;a4=$4}'
A D J P 1
Q 1
K P 2
E J Q 2
K 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
$
The first awk script iterates on every line and at every line we increment the value of an array, a, element, indexed by the whole line value, next at the end (END target) we loop on the indices of a to print the index and the associated value, that is the count of the times we have that line in the data - eventually we output also the total number of lines processed, that is automatically updated in the variable NR, number of records.
The second awk script either prints the total line and skips any further processing or it compares each field (splitted on commas) with the corresponding field of the previous line and output the new field or a space accordingly.

replacing specific value (from another file) using awk

I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.

awk skipping records. getline command

this is a task related to data compression using fibonacci binary representation.
what i have is this text file:
result.txt
a 20
b 18
c 18
d 15
e 7
this file is a result of scanning a text file and counting the appearances of each char on the file using awk.
now i need to give each char its fibonacci-binary representation length.
since i'm new to ubuntu and teminal, i've done a program in java that receives a number and prints all the fibonacci codewords length up to the number and it's working.
this is exactly what i'm trying to do here. the problem is that it doesn't work...
the length of fibonacci codewords is also work as fibonnaci.
these are the rules:
f(1)=1 - there is 1 codeword of length 1.
f(2)=1 - there is 1 codeword of length 2.
f(3)=2 - there is 2 codeword of length 3.
f(4)=3 - there is 3 codeword of length 4.
and so on...
(i'm adding on more bit to each codeword so the first two lengths will be 2 and 3)
this is the code i've made: its name is scr5
{
a=1;
b=1;
len=2
print $1 , $2, len;
getline;
print $1 ,$2, len+1;
getline;
len=4;
for(i=1; i< num; i++){
c= a+b;
g=c;
while (c >= 1){
print $1 ,$2, len ;
if (getline<=0){
print "EOF"
exit;
}
c--;
i++;
}
a=b;
b=c;
len++;
}}
now i write on terminal:
n=5
awk -v num=$n -f scr5 a
and there are two problems:
1. it skips the third letter c.
2. on the forth letter d, it prints the length of the first letter, 2, instead of length 3.
i guess that there is a problem in the getline command.
thank u very much!
Search Google for getline and awk and you'll mostly find reasons to avoid getline completely! Often it's a sign you're not really doing things the "awk" way. Find an awk tutorial and work through the basics and I'm sure you'll see quickly why your attempt using getlines is not getting you off in the right direction.
In the script below, the BEGIN block is run once at the beginning before any input is read, and then the next block is automatically run once for each line of input --- without any need for getline.
Good luck!
$ cat fib.awk
BEGIN { prior_count = 0; count = 1; len = 1; remaining = count; }
{
if (remaining == 0) {
temp = count;
count += prior_count;
prior_count = temp;
remaining = count;
++len;
}
print $1, $2, len;
--remaining;
}
$ cat fib.txt
a 20
b 18
c 18
d 15
e 7
f 0
g 0
h 0
i 0
j 0
k 0
l 0
m 0
$ awk -f fib.awk fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6
The above solution, compressed form :
mawk 'BEGIN{ ___= __= _^=____=+_ } !_ { __+=(\
____=___+_*(_=___+=____))^!_ } $++NF = (_--<_)+__' fib.txt
a 20 1
b 18 2
c 18 3
d 15 3
e 7 4
f 0 4
g 0 4
h 0 5
i 0 5
j 0 5
k 0 5
l 0 5
m 0 6

How to transform genotypes into T/F -1/0/1 format using awk?

I have a very large dataset that I would like to transform from genotypes to a coded format. The genotypes should be represented as follows:
A A -> -1
A B -> 0
B B -> 1
I have thought about this using awk but I cannot seem to get a working solution that can read two columns and output a single code in place of the genotypes. The input file looks like this:
AnimalID Locus Allele1 Allele2
1 1 A B
1 2 A A
1 3 B B
2 1 B A
2 2 B A
2 3 A A
And should be coded to an output file to look like this:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1
I am assuming this can be done using boolean T/F? Any suggestions would be welcomed. Thanks.
Here is something to get you started:
I have stored the mapping in BEGIN block. If the locus is missing for a particular ID, this will just print blank for that. You didnt specify what B A would mean, so I took the liberty of mapping it to 0 based on your output.
awk '
BEGIN {
map["A","A"] = -1;
map["A","B"] = 0;
map["B","B"] = 1;
map["B","A"] = 0;
}
NR>1 {
idCount = (idCount<$1) ? $1 : idCount;
locusCount = (locusCount<$2) ? $2 : locusCount
code[$1,$2] = map[$3,$4]
}
END {
printf "%s ", "AnimalID";
for(cnt=1; cnt<=locusCount; cnt++) {
printf "%s%s", "Locus" cnt, ((cnt==locusCount) ? "\n" : " ")
}
for(cnt=1; cnt<=idCount; cnt++) {
printf "%s\t", cnt;
for(locus=1; locus<=locusCount; locus++) {
printf "%s%s", code[cnt,locus], ((locus==locusCount) ? "\n" : "\t")
}
}
}' inputFile
Output:
AnimalID Locus1 Locus2 Locus3
1 0 -1 1
2 0 0 -1

sum,count,distinct count fields with awk

I have a text file like this:
A B C D E
----------------------
x x e 2 10
y y g 1 8
z o e 2 9
o o q 1 10
p z e 3 22
x x e 1 11
z o a 1 24
y z b 1 25
I want to use awk do the same thing as this SQL does:
select A,
B,
count(distinct C),
sum(D),
sum(case when E>20 then E else 0 END)
from test
group by A,B
output:
A B count(distinct C) sum(D) sum(case when E>20 then E else 0 END)
-------------------------------------------------------
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24
Here is my solution but the distinct part is not completed:
awk '
{
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (i in idx4) print i, idx4[i], idx5[i]
}' OFS="\t" test
=============================================================================
I have completed this by hours, here is my code:
{
if (idx3[$1"|"$2, $3] == 0) {
idx3[$1"|"$2, $3]+=1;
}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
}
END {
for (j in idx3) {
split(j, idx, SUBSEP)
count[idx[1]]++
}
for (i in idx4) {
print i, count[i], idx4[i], idx5[i]
}
} OFS="\t"
#Scrutinizer has given a more readable code below, i think that is better.
Try this (similar to your own solution):
awk '
NR<3{
next
}
{
i=$1 OFS $2
D[i]+=$4
}
!A[i,$3]++{
C[i]++
}
$5>20{
E[i]+=$5
}
END{
for(i in D)print i, C[i], D[i], E[i]+0
}
' OFS='\t' infile
NR<3 is used to skip the two header lines. If they are not present in the input file you can leave that section out.
please try this script, I had test it can oupt the result as you expect.
awk '
{
if( NR<3) {next}
idx4[$1"|"$2]=idx4[$1"|"$2]+$4;
idx5[$1"|"$2]=$5>20?idx5[$1"|"$2]+$5:idx5[$1"|"$2]
if( index(TR[$1"|"$2],$3)==0 )
{
TR[$1"|"$2] = TR[$1"|"$2]"|"$3;
TRD[$1"|"$2] +=1;
}
}
END {
for (i in idx4) print i, TRD[i], idx4[i], idx5[i]+0
}' OFS="\t" test
$ gawk '{ a[$1,$2]["C"][$3]; a[$1,$2]["D"]+=$4; a[$1,$2]["E"]+=($5>20 ? $5:0) }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( i in a) {
split(i, b, SUBSEP)
print b[1], b[2], length(a[i]["C"]), a[i]["D"], a[i]["E"]
}
}' OFS='\t' file
o o 1 1 0
p z 1 3 22
x x 1 3 0
y y 1 1 0
y z 1 1 25
z o 2 3 24

Resources