UNIX: Getting count occurance of numbers from a CSV file - bash

I have a CSV file with first column & second column as ID,domain.
#Input.txt
1,google.com
1,cnn.com
1,dropbox.com
2,bbc.com
3,twitter.com
3,hello.com
3,example.com
4,twitter.com
.............
Now, I would like to get the count of IDs. Yes,this can be done in Excel/sheets but the file contains of about 1.5Million lines.
Expected Output:
1,3
2,1
3,3
4,1
I tried using cat Input.txt | grep -c 1 and that which gives me count of '1' as 3 but I would like to do it for individual ID count all at once. Can any one help me on how to achieve this ?

awk -F "," '{ ids[$1]++} END { for(id in ids) { print id, ids[id] } }' input
And input is the file with the data.
output:
1 3
2 1
3 3
4 1
Edit://
If you want a comma seperated output you need to set the output seperator like this:
awk -F "," 'BEGIN { OFS=","} { ids[$1]++} END { for(id in ids) { print id, ids[id] } }' input
output:
1,3
2,1
3,3
4,1

Here's one way, though the count is present in the 1. column:
$ zcat Input.txt.gz | cut -d , -f 1 | sort | uniq -c
3 1
1 2
3 3
1 4
Here's another way using awk:
$ awk -F , '{counter[$1]++};
END {for (id in counter) printf "%s,%d\n",id,counter[id];}' Input.txt |
sort
1,3
2,1
3,3
4,1

This will do the job in bash:
$ for i in {1..4}; do echo -n $i, >> OUTPUT && grep -c $i Input.txt >> OUTPUT; done
$ less OUTPUT
1,3
2,1
3,3
4,1

$ awk -F, '{ print $1 }' input.txt | uniq -c | awk '{ print $2 "," $1 }'
1,3
2,1
3,3
4,1

Here is a pure awk solution. It doesn't map the entire file in memory, so it will probably use less memory that #Joda's answer, but it assumes that the file is sorted:
awk -F, -v OFS=, '$1==prev{c++;next}{print prev,c; c=1}{prev=$1}END{print prev,c}' file

Related

Finding difference of values between corresponding fields in two CSV files

I have been trying to find difference of values between corresponding fields in two CSV files
$ cat f1.csv
A,B,25,35,50
C,D,30,40,36
$
$ cat f2.csv
E,F,20,40,50
G,H,22,40,40
$
Desired output:
5 -5 0
8 0 -4
I could able to achieve it like this:
$ paste -d "," f1.csv f2.csv
A,B,25,35,50,E,F,20,40,50
C,D,30,40,36,G,H,22,40,40
$
$ paste -d "," f1.csv f2.csv | awk -F, '{print $3-$8 " " $4-$9 " " $5-$10 }'
5 -5 0
8 0 -4
$
Is there any better way to achieve it with awk alone without paste command?
As first step replace only paste with awk:
awk -F ',' 'NR==FNR {file1[FNR]=$0; next} {print file1[FNR] FS $0}' f1.csv f2.csv
Output:
A,B,25,35,50,E,F,20,40,50
C,D,30,40,36,G,H,22,40,40
Then split file1[FNR] FS $0 to an array with , as field separator:
awk -F ',' 'NR==FNR {file1[FNR]=$0; next} {split(file1[FNR] FS $0, arr, FS); print arr[3]-arr[8], arr[4]-arr[9], arr[5]-arr[10]}' f1.csv f2.csv
Output:
5 -5 0
8 0 -4
From man awk:
FNR: The input record number in the current input file.
NR: The total number of input records seen so far.
Another way using nl and awk
$ (nl f1.csv;nl f2.csv) | sort | awk -F, ' {a1=$3;a2=$4;a3=$5; getline; print a1-$3,a2-$4,a3-$5 } '
5 -5 0
8 0 -4
$

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

BASH choosing and counting distinct based on two column

Hey guys so i got this dummy data:
115,IROM,1
125,FOLCOM,1
135,SE,1
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
144,BLIZZARC,1
166,STEAD,3
166,STEAD,3
166,STEAD,3
168,BANDOI,1
179,FOX,1
199,C4,2
199,C4,2
Desired output:
IROM,1
FOLCOM,1
SE,1
ATLUZ,3
BLIZZARC,1
STEAD,1
BANDOI,1
FOX,1
C4,1
which comes from counting the distinct game id (the 115,125,etc). so for example the
111,ATLUZ,1
121,ATLUZ,2
121,ATLUZ,2
142,ATLUZ,2
142,ATLUZ,2
Will be
ATLUZ,3
Since it have 3 distinct game id
I tried using
cut -d',' -f 2 game.csv|uniq -c
Where i got the following output
1 IROM
1 FOLCOM
1 SE
5 ATLUZ
1 BLIZZARC COMP
3 STEAD
1 BANDOI
1 FOX
2 C4
How do i fix this ? using bash ?
Before executing the cut command, do a uniq. This will remove the redundant lines and then you follow your command, i.e. apply cut to extract 2 field and do uniq -c to count character
uniq game.csv | cut -d',' -f 2 | uniq -c
Could you please try following too in a single awk.
awk -F, '
!a[$1,$2,$3]++{
b[$1,$2,$3]++
}
!f[$2]++{
g[++count]=$2
}
END{
for(i in b){
split(i,array,",")
c[array[2]]++
}
for(q=1;q<=count;q++){
print c[g[q]],g[q]
}
}' SUBSEP="," Input_file
It will give the order of output same as Input_file's 2nd field occurrence as follows.
1 IROM
1 FOLCOM
1 SE
3 ATLUZ
1 BLIZZARC
1 STEAD
1 BANDOI
1 FOX
1 C4
Using GNU datamash:
datamash -t, --sort --group 2 countunique 1 < input
Using awk:
awk -F, '!a[$1,$2]++{b[$2]++}END{for(i in b)print i FS b[i]}' input
Using sort, cut, uniq:
sort -u -t, -k2,2 -k1,1 input | cut -d, -f2 | uniq -c
Test run:
$ cat input
111,ATLUZ,1
121,ATLUZ,1
121,ATLUZ,2
142,ATLUZ,2
115,IROM,1
142,ATLUZ,2
$ datamash -t, --sort --group 2 countunique 1 < input
ATLUZ,3
IROM,1
As you can see, 121,ATLUZ,1 and 121,ATLUZ,2 are correctly considered to be just one game ID.
Less elegant, but you may use awk as well. If it is not granted that the same ID+NAME combos will always come consecutively, you have to count each by reading the whole file before output:
awk -F, '{c[$1,$2]+=1}END{for (ck in c){split(ck,ca,SUBSEP); print ca[2];g[ca[2]]+=1}for(gk in g){print gk,g[gk]}}' game.csv
This will count first every [COL1,COL2] pairs then for each COL2 it counts how many distinct [COL1,COL2] pairs are nonzero.
This also does the trick. The only thing is that your output is not sorted.
awk 'BEGIN{ FS = OFS = "," }{ a[$2 FS $1] }END{ for ( i in a ){ split(i, b, "," ); c[b[1]]++ } for ( i in c ) print i, c[i] }' yourfile
Output:
BANDOI,1
C4,1
STEAD,1
BLIZZARC,1
FOLCOM,1
ATLUZ,3
SE,1
IROM,1
FOX,1

Subtract length element two columns

I've a file from which I get two columns: cut -d $'\t' -f 4,5 file.txt
Now I would like to get the difference in length of each element between column 1 and 2.
Input from cut command
A T
AA T
AC TC
A CT
What I would expect
0
1
0
-1
Using awk.
awk ' {print length($1) - length($2)} ' cutoutput.txt
Or awk on the original file you can simply do:
awk ' {print length($4) - length($5)} ' file.txt
You probably can do this only with awk without using cut. Since you don't have the original input file, I would use the following with a | to your cut command:
cut -d $'\t' -f 4,5 file.txt | \
awk '{for (i=1;i<NF;i++) s=length($i)-length($NF); printf s"\n"}'

Problems in mapping indices using awk

Hi all I have this data files
File1
1 The hero
2 Chainsaw and the gang
3 .........
4 .........
where the first field is the id and the second field is the product name
File 2
The hero 12
The hero 2
Chainsaw and the gang 2
.......................
From these two files I want to have a third file
File 3
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................
As you can see I am just adding the indices reading from file 1
I used this method
awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3
where I am creating this associated array using File 1 and doing just lookup using product names from file 2
However my files are huge, I have like 20 million product names and this process is taking a lot of time. Any suggestions, how I can speed it up?
You can use this awk:
awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
The script you posted won't produce the output you want from the input files you posted so let's fix that first:
$ cat file1
1 The hero
2 Chainsaw and the gang
$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2
$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
Now, is that really too slow or were you doing some pre or post-processing and that was the real speed issue?
The obvious speed up is if your "file2" is sorted then you can delete the corresponding map[] value whenever the key changes so your map[] gets smaller every time you use it. e.g. something like this (untested):
$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2
Alternative approach when populating map[] uses too much time/memory and file2 is sorted:
$ awk '
{ key=$0
sub(/[[:space:]]+[^[:space:]]+$/,"",key)
if (key != prev) {
cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
cmd | getline val
close(cmd)
}
print $0, val
prev = key
}' file2
From comments you're having scaling problems with your lookups. The general fix for that is to merge sorted sequences:
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
<( sort -t $'\t' -k2 file1) \
<( sort -t $'\t' -sk1,1 file2)
I gather Windows can't do process substitution, so you have to use temporary files:
sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey
If you need to preserve the value lookup sequence use nl to put line numbers on the front and sort on those at the end.
If your issue is performance then try this perl script:
#!/usr/bin/perl -l
use strict;
use warnings;
my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";
while (<$fh1>) {
my ($v, $k) = /(\d+)\s+(.*)/;
$h{$k} = $v;
}
while (<$fh2>) {
my ($k, $v) = /(.*)\s+(\d+)$/;
print $fh3 "$k $v $h{$k}" if exists $h{$k};
}
Save the above script in say script.pl and run it as perl script.pl. Make sure the file1.txt and file2.txt are in the same directory as the script.

Resources