Count number of nonempty entries in each column of, e.g., comm output - bash

The Unix command comm file1 file2 has a 3 column output with lines unique to file1 in the first column, lines unique to file2 in the second, and lines shared by both in the 3rd (assuming file1 and file2 are sorted). It ends up looking something like this:
$ echo -e "alpha\nbravo\ncharlie" > file1
$ echo -e "alpha\nbravo\ndelta" > file2
$ comm file1 file2
alpha
bravo
charlie
delta
If I want the number of nonempty lines in each column, is there a general way to parse the output of comm and count those?
I know that for comm in particular I could just run
for i in {12,23,31}; do comm -$i file1 file2 | wc -l; done
but I'm curious about solutions that take the comm output file as a starting point, for the sake of getting better at Unix command line. I added the awk tag because I have a hunch there's a good awk solution.

The other answer covers your question of using awk to do the job quite well, but it is also worth mentioning that the GNU version of comm has a --total option which will print the sum of each column in a similar manner.

You may use this awk:
comm file1 file2 |
awk -F '\t' -v OFS='\n' '{ if ($1=="") if ($2=="") c3++; else c2++; else c1++ }
END { print c3, c2, c1 }'
2
1
1
Note that output of comm is tab delimited with these cases:
1st and 2nd empty column in common lines
1st empty column in lines unique to file2
1st non-empty column in lines unique to file1

The question is interesting, but not as easy as one would imagine, especially if you do not have the --total option.
A couple of things about comm:
comm works on sorted files
if a line appears n times in file1 and m times n < m times in file2, comm will output n-m entries in column 2 and n entries in column 3.
$ comm <(echo -e "1\n2\n3") <(echo "2\n2\n3\n4")
1
2
2
3
4
comm uses <tab>-character as a default separator, processing its output becomes problematic if your input contains this character.
$ comm <(echo -e "1\t2\n3") <(echo "2\n3\n4")
1 2 << this is the weird line
2
3
4
Luckily it has an option to define the delimiter (--output-delimiter=STR)
comm only adds a delimiter if other non-empty fields are following
$ comm --output-delimiter=SEP <(echo -e "1\n2\n3") <(echo "2\n3\n4")
1 << NO SEP (1 field)
SEPSEP2 << TWO SEP (3 fields)
SEPSEP3 << TWO SEP (3 fields)
SEP4 << ONE SEP (2 fields)
How can we solve it now:
We should clearly not use an ASCII symbol as a delimiter, this is asking for problems when processing ASCII files, so what you can do is use a non-printable character as a delimiter. You could use for example <start-of-heading>-character with octal value \001 (it does not accept the <null>-character). This generally solves the issues you might have due to point (3)
$ comm --output-delimiter=$'\001' <(echo -e "1\t2\n3") <(echo "2\n3\n4")
this output can now be piped into an extremely simple awk
$ awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
the above works because of point (4).
So you can just do:
$ comm --output-delimiter=$'\001' file1 file2 \
| awk -F "\001" '{a[NF]++}END{print a[1],a[2],a[3] }'
But I don't have that --output-delimiter option: This calls for the pure awk solution. We keep track of 3 arrays. a for file1 b for file2 and c for the combination. (c keeps track of all the entries). We make sure to keep point (2) into account.
$ awk '(NR==FNR) { a[$0]++; c[$0]++ }
(NR!=FNR) { b[$0]++; c[$0]-- }
END { for(i in c) {
if (c[i] < 0) { countb+=-c[i]; countc+=a[i] }
else if (c[i] == 0) { countc+=a[i] }
else { counta+= c[i]; countc+=b[i] }
}
print counta, countb, countc
}' file1 file2
We could essentially get rid of the array b as it can be derived from a and c, but I wanted to make it a bit more clear how it works; the other version would be:
$ awk '(NR==FNR) { a[$0]++; c[$0]++; next } { c[$0]-- }
END { for(i in c) {
counta+=(c[i]>0 ? c[i] : 0)
countb-=(c[i]<0 ? c[i] : 0)
countc+=a[i] - (c[i]>0 ? c[i] : 0)
}
print counta, countb, countc
}' file1 file2

Using Perl
$ comm file1 file2 | perl -lne ' /^\t\t/ and $kv{2}++; /^\t\S+/ and $kv{1}++; /^\S+/ and $kv{3}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
or
$ comm file1 file2 | perl -lne ' /(^\t\t)|(^\t\S+)|(^.)/ and $x=$+[0]>2?3:$+[0]; $kv{$x}++; END { print "col-$_:$kv{$_}" for(keys %kv) } '
col-3:1
col-1:1
col-2:2
$
where
col-1 -> first file
col-3 -> second file
col-2 -> both file

obviously you can do all in awk without comm or requiring sorted inputs.
$ awk 'NR==FNR {a[$1]; next}
{if($1 in a) {c3++; delete a[$1]}
else c2++}
END {print length(a),c2,c3}' file1 file2
1 1 2
that's counts for file1 only, file2 only, and common.
Note, this requires that the records are unique in each file.

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Problems in mapping indices using awk

Hi all I have this data files
File1
1 The hero
2 Chainsaw and the gang
3 .........
4 .........
where the first field is the id and the second field is the product name
File 2
The hero 12
The hero 2
Chainsaw and the gang 2
.......................
From these two files I want to have a third file
File 3
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
.......................
As you can see I am just adding the indices reading from file 1
I used this method
awk -F '\t' 'NR == FNR{a[$2]=$1; next}; {print $0, a[$1]}' File1 File2 > File 3
where I am creating this associated array using File 1 and doing just lookup using product names from file 2
However my files are huge, I have like 20 million product names and this process is taking a lot of time. Any suggestions, how I can speed it up?
You can use this awk:
awk 'FNR==NR{p=$1; $1=""; sub(/^ +/, ""); a[$0]=p;next} {q=$NF; $NF=""; sub(/ +$/, "")}
($0 in a) {print $0, q, a[$0]}' f1 f2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
The script you posted won't produce the output you want from the input files you posted so let's fix that first:
$ cat file1
1 The hero
2 Chainsaw and the gang
$ cat file2
The hero 12
The hero 2
Chainsaw and the gang 2
$ awk -F'\t' 'NR==FNR{map[$2]=$1;next} {key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key]}' file1 file2
The hero 12 1
The hero 2 1
Chainsaw and the gang 2 2
Now, is that really too slow or were you doing some pre or post-processing and that was the real speed issue?
The obvious speed up is if your "file2" is sorted then you can delete the corresponding map[] value whenever the key changes so your map[] gets smaller every time you use it. e.g. something like this (untested):
$ awk -F'\t' '
NR==FNR {map[$2]=$1; next}
{ key=$0; sub(/[[:space:]]+[^[:space:]]+$/,"",key); print $0, map[key] }
key != prev { delete map[prev] }
{ prev = key }
' file1 file2
Alternative approach when populating map[] uses too much time/memory and file2 is sorted:
$ awk '
{ key=$0
sub(/[[:space:]]+[^[:space:]]+$/,"",key)
if (key != prev) {
cmd = "awk -F\"\t\" -v key=\"" key "\" \047$2 == key{print $1;exit}\047 file1"
cmd | getline val
close(cmd)
}
print $0, val
prev = key
}' file2
From comments you're having scaling problems with your lookups. The general fix for that is to merge sorted sequences:
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 \
<( sort -t $'\t' -k2 file1) \
<( sort -t $'\t' -sk1,1 file2)
I gather Windows can't do process substitution, so you have to use temporary files:
sort -t $'\t' -k2 file1 >idlookup.bykey
sort -t $'\t' -sk1,1 file2 >values.bykey
join -t $'\t' -1 2 -2 1 -o 1.2,2.2,1.1 idlookup.bykey values.bykey
If you need to preserve the value lookup sequence use nl to put line numbers on the front and sort on those at the end.
If your issue is performance then try this perl script:
#!/usr/bin/perl -l
use strict;
use warnings;
my %h;
open my $fh1 , "<", "file1.txt";
open my $fh2 , "<", "file2.txt";
open my $fh3 , ">", "file3.txt";
while (<$fh1>) {
my ($v, $k) = /(\d+)\s+(.*)/;
$h{$k} = $v;
}
while (<$fh2>) {
my ($k, $v) = /(.*)\s+(\d+)$/;
print $fh3 "$k $v $h{$k}" if exists $h{$k};
}
Save the above script in say script.pl and run it as perl script.pl. Make sure the file1.txt and file2.txt are in the same directory as the script.

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources