How to find set difference of two files? - bash

I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:
for line in `cat file1`
do
if [ `grep -c "^$line$" file2` -eq 0]; then
echo $line
fi
done
It works, but it's slow. Is there a faster way of doing this?

The BashFAQ describes doing exactly this with comm, which is the canonically correct method.
# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)
diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.
comm has been part of the Single Unix Specification since SUS2 (1997).

If you simply want lines that are in file A, but not in B, you can sort the files, and compare them with diff.
sort A > A.sorted
sort B > B.sorted
diff -u A.sorted B.sorted | grep '^-'

The 'diff' program is standard unix program that looks at differences between files.
% cat A
a
b
c
d
% cat B
a
b
e
% diff A B
3,4c3
< c
< d
---
> e
With a simple grep and cut one can select the lines in A, not in B. Note that the cut is rather simplistic and spaces in the lines would throw it off... but the concept is there.
% diff A B | grep '^<' | cut -f2 -d" "
c
d

Related

bash - how do you sort within the lines of a text file

Using the linux command sort, how do you sort the lines within a text file?
Normal sort swaps the lines until they're sorted while I want to swap the words within the lines until they're sorted.
Example:
Input.txt
z y x v t
c b a
Output.txt
t v x y z
a b c
To sort words within lines using sort,
you would need to read line by line,
and call sort once for each line.
It gets quite tricky though,
and in any case,
running one sort process for each line wouldn't be very efficient.
You could do better by using Perl (thanks #glenn-jackman for the awesome tip!):
perl -lape '$_ = qq/#{[sort #F]}/' file
If you have gnu awk then it can be done in a single command using asort function:
awk '{for(i=1; i<=NF; i++) c[i]=$i; n=asort(c);
for (i=1; i<=n; i++) printf "%s%s", c[i], (i<n?OFS:RS); delete c}' file
t v x y z
a b c
Here's a fun way that actually uses the linux sort command (plus xargs):
while read line; do xargs -n1 <<< $line | sort | xargs; done < input.txt
Now, this makes several assumptions (which are probably not always true), but the main idea is xargs -n1 takes all the tokens in a line and emits them on separate lines in stdout. This output gets piped through sort and then a final xargs with no arguments puts them all back into a single line.
I was looking for a magic switch but found my own solution more intuitive:
$ line="102 103 101 102 101"
$ echo $(echo "${line}"|sed 's/\W\+/\n/g'|sort -un)
101 102 103
Thank you!
It's a little awkward, but this uses only a basic sort command, so it's perhaps a little more portable than something that requires GNU sort:
while read -r -a line; do
printf "%s " $(sort <<<"$(printf '%s\n' "${line[#]}")")
echo
done < input.txt
The echo is included to insert a newline, which printf doesn't include by default.

Finding items that are common to all the input files

I have a series of files of the type-
f1.txt f2.txt f3.txt
A B A
B G B
C H C
D I E
E L G
F M J
I want to find out the entries that are common to all three files. In this case the expected output would be B since that is the only letter that occurs is all three files.
If I had just two files, I could find out the common entries using comm -1 -2 f1.txt f2.txt.
But that doesn't work with multiple files. I thought about something like
sort -u f*.txt > index #to give me the total unique entries
while read i ; do *test if entry is present in all the files* ; done < index
I thought of iteratively doing the comm -12 f1.txt f2.txt | comm -12 - f3.txt but I have 100+ files so that's not practical. Performance does matter.
EDIT
I implemented the following-
sort -u f* > index
while read i
do
echo -n "$i "
grep -c "$i" f*.txt > temp
awk -F ":" '{a+=$2} END {print a}' temp
done < index | sort -rnk2
This gives the output-
B 3
G 2
E 2
C 2
A 2
M 1
L 1
J 1
I 1
H 1
F 1
D 1
From here I can see that the number of files is 3 and the occurrence of B is 3. Hence it occurs in all the files. I'm still looking for a better solution though.
awk '{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
The above assumes each value occurs no more than once in a given file, like in your example. If a value CAN occur multiple times in one file then:
awk '!seen[FILENAME,$0]++{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
or with GNU awk for true multi-dimensional arrays and ARGIND:
awk '{cnt[$0][ARGIND]} END{for (i in cnt) if (length(cnt[i])==ARGIND) print i}' *.txt
Using python
This python script with find the common lines among a large number of files:
#!/usr/bin/python
from glob import glob
fnames = glob('f*.txt')
with open(fnames[0]) as f:
lines = set(f.readlines())
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
print(''.join(lines))
Sample run:
$ python script.py
B
How it works:
fnames = glob('f*.txt')
This collects the names of files of interest.
with open(fnames[0]) as f:
lines = set(f.readlines())
This reads the first file and creates a set from its lines. This set is called lines.
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
For each subsequent file, this takes the intersection of lines with the lines of this file.
print(''.join(lines))
This prints out the resulting set of common lines.
Using grep and shell
Try:
$ grep -Ff f1.txt f2.txt | grep -Ff f3.txt
B
This works in two steps:
grep -Ff f1.txt f2.txt selects those lines from f2.txt that also occur in f1.txt. In other words, the output from this command consists of lines that f1.txt and f2.txt have in common.
grep -Ff f3.txt selects from its input all lines that are also in f3.txt.
Notes:
The -F option tells grep to treat its input as fixed strings, not regular expressions.
The -f option tells grep to get the patterns it is looking for from the file whose name follows.
The command above looks for complete matching lines. That means, for one, that leading or trailing white space is significant.
Use join:
$ join f1.txt <(join f2.txt f3.txt)
B
join expects the files to be sorted, though. This seems to work too:
$ join <(sort f1.txt) <(join <(sort f2.txt) <(sort f3.txt))
B
Note that Ed's answer is considerably faster than my suggestion, but I'll leave it for posterity :-)
I used GNU Parallel to apply comm to the files in pairs in parallel (so it should be fast) and do that repeatedly, passing the output of each iteration as the input to the next.
It converges when there is only one file left to process. If there are an odd number of files at any stage, the odd file is promoted forward to the next round and processed later.
#!/bin/bash
shopt -s nullglob
# Get list of files
files=(f*.txt)
iter=0
while : ; do
# Get number of files
n=${#files[#]}
echo DEBUG: Iter: $iter, Files: $n
# If only one file left, we have converged, cat it and exit
[ $n -eq 1 ] && { cat ${files[0]}; break; }
# Check if odd number of files, and promote and delete one if odd
if (( n % 2 )); then
mv ${files[0]} s-$iter-odd;
files=( ${files[#]:1} )
fi
parallel -n2 comm -1 -2 {1} {2} \> s-$iter-{#} ::: "${files[#]}"
files=(s-$iter-*)
(( iter=iter+1 ))
done
Sample Output
DEBUG: Iter: 0, Files: 110
DEBUG: Iter: 1, Files: 55
DEBUG: Iter: 2, Files: 28
DEBUG: Iter: 3, Files: 14
DEBUG: Iter: 4, Files: 7
DEBUG: Iter: 5, Files: 4
DEBUG: Iter: 6, Files: 2
DEBUG: Iter: 7, Files: 1
Basically, s-0-* is the output of the first pass, s-1-* is the output of the second pass...
If you would like to see the commands parallel would run, without it actually running any of them, use:
parallel --dry-run ...
If (but only if) all of your files have unique entries this should work too:
sort f*.txt | uniq -c \
| grep "^\s*$(ls f*.txt | wc -w)\s" \
| while read n content; do echo $content; done

How I can keep only the non repeated lines in a file?

Want I want to do is simply keep the lines which are not repeated in a huge file like this:
..
a
b
b
c
d
d
..
The desired output is then:
..
a
c
..
Many thanks in advance.
uniq has arg -u
-u, --unique only print unique lines
Example:
$ printf 'a\nb\nb\nc\nd\nd\n' | uniq -u
a
c
If your data is not sorted, do sort at first
$ printf 'd\na\nb\nb\nc\nd\n' | sort | uniq -u
Preserve the order:
$ cat foo
d
c
b
b
a
d
$ grep -f <(sort foo | uniq -u) foo
c
a
greps the file for patterns obtained by aforementioned uniq. I can imagine, though, that if your file is really huge then it will take a long time.
The same without somewhat ugly Process substitution:
$ sort foo | uniq -u | grep -f- foo
c
a
This awk should work to list only lines that are not repeated in file:
awk 'seen[$0]++{dup[$0]} END {for (i in seen) if (!(i in dup)) print i}' file
a
c
Just remember that original order of lines may change due to hashing of arrays in awk.
EDIT: To preserve the original order:
awk '$0 in seen{dup[$0]; next}
{seen[$0]++; a[++n]=$0}
END {for (i=1; i<=n; i++) if (!(a[i] in dup)) print a[i]}' file
a
c
This is job that is tailor made for awk which doesn't require multiple processes, pipes and process substitution and will be more efficient for bigger files.
When your file is sorted, it's simple:
cat file.txt | uniq > file2.txt
mv file2.txt file.txt

Simple diff/patch script for sorted unique file

How could I write a simple diff resp. patch script for applying additions and deletions to a list of lines in a file?
This could be a original file (it is sorted and each line is unique):
a
b
d
a simple patch file could look like this (or somehow as simple):
+ c
+ e
- b
The resulting file should look like (or in any other order, since sort could be applied anyways):
a
c
d
e
The normal patch formats can not be used since they include context, which might alter in this case.
Bash alternatives that read input files only once:
To generate patch you can:
comm -3 a.txt b.txt | sed 's/^\t/+ /;t;s/^/- /'
Because comm delimeters outputs from different files using tab, we can use that tab to detect if line should be added or removed.
To apply patch you can:
{ <patch.txt tee >(grep '^+ ' | cut -c3- >&5) |
grep '^- ' | cut -c3- | comm -13 - a.txt; } 5> >(cat)
The tee splits the input, that is the patch file, into two streams. The first part has + filtered and is outputted to file descriptor 5. The file descriptor 5 is opened to just >(cat) so it is just outputted on stdout. The second part has the minus - filtered and it is joined with a.txt and outputted. Because output should be line buffered, it should work.
A shell solution using comm, awk, and grep to apply such a patch would be:
A=a.txt B=b.txt P=patch.txt; { grep '^-' $P | cut -c 3- | comm -23 $A - ; grep '^+' $P | cut -c 3- } | sort -u > $B
to generate the patch file would be:
A=a.txt B=b.txt P=patch.txt; { comm -13 $A $B | awk '{print "+ " $0}' ; comm -23 $A $B | awk '{print "- " $0}' } > $P
Since nobody could give me an answer, I've created a small python script, which does exactly this job. https://github.com/white-gecko/simplepatch
To apply such a patch call it with (where outfile.txt is generated)
./simplepatch.py -m patch -i infile.txt -p patchfile.txt -o outfile.txt
To generate a patch/diff call it with (where patchfile.txt is generated)
./simplepatch.py -m diff -i infile.txt -o outfile.txt -p patchfile.txt

Bash set subtraction

How to subtract a set from another in Bash?
This is similar to: Is there a "set" data structure in bash? but different as it asks how to perform the subtraction, with code
set1: N lines as output by a filter
set2: M lines as output by a filter
how to get:
set3: with all lines in N which don't appear in M
comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort)
comm without option display 3 columns of output: 1: only in first file, 2: only in second file, 3: in both files. -23 removes the second and third columns.
$ cat > file1.list
A
B
C
$ cat > file2.list
A
C
D
$ comm file1.list file2.list
A
B
C
D
$ comm -12 file1.list file2.list # In both
A
C
$ comm -23 file1.list file2.list # Only in set 1
B
$ comm -13 file1.list file2.list # Only in set 2
D
Input files must be sorted.
GNU sort and comm depends on locale, for example output order may be different (but content must be the same)
(export LC_ALL=C; comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort))
uniq -u (manpage) is often the simplest tool for list subtraction:
Usage
uniq [OPTION]... [INPUT [OUTPUT]]
[...]
-u, --unique
only print unique lines
Example: list files found in directory a but not in b
$ ls a
file1 file2 file3
$ ls b
file1 file3
$ echo "$(ls a ; ls b)" | sort | uniq -u
file2
I've got a dead-simple 1-liner:
$ now=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log)
$ later=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log baz foo)
$ printf "%s\n" ${now[#]} ${later[#]} | sort | uniq -c | grep -vE '[ ]+2.*' | awk '{print $2}'
baz
foo
By definition, 2 sets intersect if they have elements in common. In this case, there are 2 sets, so any count of 2 is an intersection - simply "subtract" them with grep
I wrote a program recently called Setdown that does Set operations (like set difference) from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend the "set operations in unix shell" post. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other.
At any rate, I think that it's pretty cool and you should totally check it out.
You can use diff
# you should sort the output
ls > t1
cp t1 t2
I used vi to remove some entries from t2
$ cat t1
AEDWIP.writeMappings.sam
createTmpFile.sh*
find.out
grepMappingRate.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
$ cat t2
AEDWIP.writeMappings.sam
createTmpFile.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
diff reports lines in t1 that are not in t2
diff t1 t2
$ diff t1 t2
3,4d2
< find.out
< grepMappingRate.sh*
putting together version
diff t1 t2 | grep "^<" | cut -d " " -f 2
find.out
grepMappingRate.sh*

Resources