Bash set subtraction - bash

How to subtract a set from another in Bash?
This is similar to: Is there a "set" data structure in bash? but different as it asks how to perform the subtraction, with code
set1: N lines as output by a filter
set2: M lines as output by a filter
how to get:
set3: with all lines in N which don't appear in M

comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort)
comm without option display 3 columns of output: 1: only in first file, 2: only in second file, 3: in both files. -23 removes the second and third columns.
$ cat > file1.list
A
B
C
$ cat > file2.list
A
C
D
$ comm file1.list file2.list
A
B
C
D
$ comm -12 file1.list file2.list # In both
A
C
$ comm -23 file1.list file2.list # Only in set 1
B
$ comm -13 file1.list file2.list # Only in set 2
D
Input files must be sorted.
GNU sort and comm depends on locale, for example output order may be different (but content must be the same)
(export LC_ALL=C; comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort))

uniq -u (manpage) is often the simplest tool for list subtraction:
Usage
uniq [OPTION]... [INPUT [OUTPUT]]
[...]
-u, --unique
only print unique lines
Example: list files found in directory a but not in b
$ ls a
file1 file2 file3
$ ls b
file1 file3
$ echo "$(ls a ; ls b)" | sort | uniq -u
file2

I've got a dead-simple 1-liner:
$ now=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log)
$ later=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log baz foo)
$ printf "%s\n" ${now[#]} ${later[#]} | sort | uniq -c | grep -vE '[ ]+2.*' | awk '{print $2}'
baz
foo
By definition, 2 sets intersect if they have elements in common. In this case, there are 2 sets, so any count of 2 is an intersection - simply "subtract" them with grep

I wrote a program recently called Setdown that does Set operations (like set difference) from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend the "set operations in unix shell" post. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other.
At any rate, I think that it's pretty cool and you should totally check it out.

You can use diff
# you should sort the output
ls > t1
cp t1 t2
I used vi to remove some entries from t2
$ cat t1
AEDWIP.writeMappings.sam
createTmpFile.sh*
find.out
grepMappingRate.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
$ cat t2
AEDWIP.writeMappings.sam
createTmpFile.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
diff reports lines in t1 that are not in t2
diff t1 t2
$ diff t1 t2
3,4d2
< find.out
< grepMappingRate.sh*
putting together version
diff t1 t2 | grep "^<" | cut -d " " -f 2
find.out
grepMappingRate.sh*

Related

unix utility to compare lists and perform a set operation

I believe what I'm asking for is a sort of set operation. I need help trying to create a list of the following:
List1 contains:
1
2
3
A
B
C
List2 contains:
1
2
3
4
5
A
B
C
D
E
(I need this) - The Final list I need would be (4) items:
4
5
D
E
So obviously List2 contains more elements than List1.
Final list which I needs are the elements in List2 that are NOT in List1.
Which linux utility can I use to accomplish this? I have looked at sort, comm but i'm unsure how to do this correctly. Thanks for the help
Using awk with a straight forward logic.
awk 'FNR==NR{a[$0]; next}!($0 in a)' file1 file2
4
5
D
E
Using GNU comm utility, where according to the man comm page,
comm -3 file1 file2
Print lines in file1 not in file2, and vice versa.
Using it for your example
comm -3 file2 file1
4
5
D
E
You can do it with a simple grep command inverting the match with -v and reading the search terms from list1 with -f, e.g. grep -v -f list1 list2. Example use:
$ grep -v -f list1 list2
4
5
D
E
Linux provides a number of different ways to skin this cat.
You can try this :
$ diff list1.txt list2.txt | egrep '>|<' | awk '{ print $2 }' | sort -u
4
5
D
E
i hope help you

Finding items that are common to all the input files

I have a series of files of the type-
f1.txt f2.txt f3.txt
A B A
B G B
C H C
D I E
E L G
F M J
I want to find out the entries that are common to all three files. In this case the expected output would be B since that is the only letter that occurs is all three files.
If I had just two files, I could find out the common entries using comm -1 -2 f1.txt f2.txt.
But that doesn't work with multiple files. I thought about something like
sort -u f*.txt > index #to give me the total unique entries
while read i ; do *test if entry is present in all the files* ; done < index
I thought of iteratively doing the comm -12 f1.txt f2.txt | comm -12 - f3.txt but I have 100+ files so that's not practical. Performance does matter.
EDIT
I implemented the following-
sort -u f* > index
while read i
do
echo -n "$i "
grep -c "$i" f*.txt > temp
awk -F ":" '{a+=$2} END {print a}' temp
done < index | sort -rnk2
This gives the output-
B 3
G 2
E 2
C 2
A 2
M 1
L 1
J 1
I 1
H 1
F 1
D 1
From here I can see that the number of files is 3 and the occurrence of B is 3. Hence it occurs in all the files. I'm still looking for a better solution though.
awk '{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
The above assumes each value occurs no more than once in a given file, like in your example. If a value CAN occur multiple times in one file then:
awk '!seen[FILENAME,$0]++{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
or with GNU awk for true multi-dimensional arrays and ARGIND:
awk '{cnt[$0][ARGIND]} END{for (i in cnt) if (length(cnt[i])==ARGIND) print i}' *.txt
Using python
This python script with find the common lines among a large number of files:
#!/usr/bin/python
from glob import glob
fnames = glob('f*.txt')
with open(fnames[0]) as f:
lines = set(f.readlines())
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
print(''.join(lines))
Sample run:
$ python script.py
B
How it works:
fnames = glob('f*.txt')
This collects the names of files of interest.
with open(fnames[0]) as f:
lines = set(f.readlines())
This reads the first file and creates a set from its lines. This set is called lines.
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
For each subsequent file, this takes the intersection of lines with the lines of this file.
print(''.join(lines))
This prints out the resulting set of common lines.
Using grep and shell
Try:
$ grep -Ff f1.txt f2.txt | grep -Ff f3.txt
B
This works in two steps:
grep -Ff f1.txt f2.txt selects those lines from f2.txt that also occur in f1.txt. In other words, the output from this command consists of lines that f1.txt and f2.txt have in common.
grep -Ff f3.txt selects from its input all lines that are also in f3.txt.
Notes:
The -F option tells grep to treat its input as fixed strings, not regular expressions.
The -f option tells grep to get the patterns it is looking for from the file whose name follows.
The command above looks for complete matching lines. That means, for one, that leading or trailing white space is significant.
Use join:
$ join f1.txt <(join f2.txt f3.txt)
B
join expects the files to be sorted, though. This seems to work too:
$ join <(sort f1.txt) <(join <(sort f2.txt) <(sort f3.txt))
B
Note that Ed's answer is considerably faster than my suggestion, but I'll leave it for posterity :-)
I used GNU Parallel to apply comm to the files in pairs in parallel (so it should be fast) and do that repeatedly, passing the output of each iteration as the input to the next.
It converges when there is only one file left to process. If there are an odd number of files at any stage, the odd file is promoted forward to the next round and processed later.
#!/bin/bash
shopt -s nullglob
# Get list of files
files=(f*.txt)
iter=0
while : ; do
# Get number of files
n=${#files[#]}
echo DEBUG: Iter: $iter, Files: $n
# If only one file left, we have converged, cat it and exit
[ $n -eq 1 ] && { cat ${files[0]}; break; }
# Check if odd number of files, and promote and delete one if odd
if (( n % 2 )); then
mv ${files[0]} s-$iter-odd;
files=( ${files[#]:1} )
fi
parallel -n2 comm -1 -2 {1} {2} \> s-$iter-{#} ::: "${files[#]}"
files=(s-$iter-*)
(( iter=iter+1 ))
done
Sample Output
DEBUG: Iter: 0, Files: 110
DEBUG: Iter: 1, Files: 55
DEBUG: Iter: 2, Files: 28
DEBUG: Iter: 3, Files: 14
DEBUG: Iter: 4, Files: 7
DEBUG: Iter: 5, Files: 4
DEBUG: Iter: 6, Files: 2
DEBUG: Iter: 7, Files: 1
Basically, s-0-* is the output of the first pass, s-1-* is the output of the second pass...
If you would like to see the commands parallel would run, without it actually running any of them, use:
parallel --dry-run ...
If (but only if) all of your files have unique entries this should work too:
sort f*.txt | uniq -c \
| grep "^\s*$(ls f*.txt | wc -w)\s" \
| while read n content; do echo $content; done

How I can keep only the non repeated lines in a file?

Want I want to do is simply keep the lines which are not repeated in a huge file like this:
..
a
b
b
c
d
d
..
The desired output is then:
..
a
c
..
Many thanks in advance.
uniq has arg -u
-u, --unique only print unique lines
Example:
$ printf 'a\nb\nb\nc\nd\nd\n' | uniq -u
a
c
If your data is not sorted, do sort at first
$ printf 'd\na\nb\nb\nc\nd\n' | sort | uniq -u
Preserve the order:
$ cat foo
d
c
b
b
a
d
$ grep -f <(sort foo | uniq -u) foo
c
a
greps the file for patterns obtained by aforementioned uniq. I can imagine, though, that if your file is really huge then it will take a long time.
The same without somewhat ugly Process substitution:
$ sort foo | uniq -u | grep -f- foo
c
a
This awk should work to list only lines that are not repeated in file:
awk 'seen[$0]++{dup[$0]} END {for (i in seen) if (!(i in dup)) print i}' file
a
c
Just remember that original order of lines may change due to hashing of arrays in awk.
EDIT: To preserve the original order:
awk '$0 in seen{dup[$0]; next}
{seen[$0]++; a[++n]=$0}
END {for (i=1; i<=n; i++) if (!(a[i] in dup)) print a[i]}' file
a
c
This is job that is tailor made for awk which doesn't require multiple processes, pipes and process substitution and will be more efficient for bigger files.
When your file is sorted, it's simple:
cat file.txt | uniq > file2.txt
mv file2.txt file.txt

How to find set difference of two files?

I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:
for line in `cat file1`
do
if [ `grep -c "^$line$" file2` -eq 0]; then
echo $line
fi
done
It works, but it's slow. Is there a faster way of doing this?
The BashFAQ describes doing exactly this with comm, which is the canonically correct method.
# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)
diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.
comm has been part of the Single Unix Specification since SUS2 (1997).
If you simply want lines that are in file A, but not in B, you can sort the files, and compare them with diff.
sort A > A.sorted
sort B > B.sorted
diff -u A.sorted B.sorted | grep '^-'
The 'diff' program is standard unix program that looks at differences between files.
% cat A
a
b
c
d
% cat B
a
b
e
% diff A B
3,4c3
< c
< d
---
> e
With a simple grep and cut one can select the lines in A, not in B. Note that the cut is rather simplistic and spaces in the lines would throw it off... but the concept is there.
% diff A B | grep '^<' | cut -f2 -d" "
c
d

Deleting lines from one file which are in another file

I have a file f1:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2:
line2
line8
..
..
I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?
grep -v -x -f f2 f1 should do the trick.
Explanation:
-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2
One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).
Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2
For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0] creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.
if you have Ruby (1.9+)
#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
Which has O(N^2) complexity. If you want to care about performance, here's another version
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
diff was used to show there are no differences between the 2 files generated.
Some timing comparisons between various other answers:
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
comm can also be used with stdin and here strings:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
Seems to be a job suitable for the SQLite shell:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
Did you try this with sed?
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh
Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
Obviously won't work for huge files but it did the trick for me. A few notes:
I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
A Python way of filtering one list using another list.
Load files:
>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()
Remove '\n' string at the end of each line:
>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]
Print only the f1 lines that are also in the f2 file:
>>> [a for a in f1 if all(b not in a for b in f2)]
$ cat values.txt
apple
banana
car
taxi
$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...
I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.
$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
grep -v $x $from.final > $from.final.tmp
mv $from.final.tmp $from.final
done
executing...
$ ./weed_out source.txt
and you get a nicely cleaned up file....

Resources