Finding items that are common to all the input files - bash

I have a series of files of the type-
f1.txt f2.txt f3.txt
A B A
B G B
C H C
D I E
E L G
F M J
I want to find out the entries that are common to all three files. In this case the expected output would be B since that is the only letter that occurs is all three files.
If I had just two files, I could find out the common entries using comm -1 -2 f1.txt f2.txt.
But that doesn't work with multiple files. I thought about something like
sort -u f*.txt > index #to give me the total unique entries
while read i ; do *test if entry is present in all the files* ; done < index
I thought of iteratively doing the comm -12 f1.txt f2.txt | comm -12 - f3.txt but I have 100+ files so that's not practical. Performance does matter.
EDIT
I implemented the following-
sort -u f* > index
while read i
do
echo -n "$i "
grep -c "$i" f*.txt > temp
awk -F ":" '{a+=$2} END {print a}' temp
done < index | sort -rnk2
This gives the output-
B 3
G 2
E 2
C 2
A 2
M 1
L 1
J 1
I 1
H 1
F 1
D 1
From here I can see that the number of files is 3 and the occurrence of B is 3. Hence it occurs in all the files. I'm still looking for a better solution though.

awk '{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
The above assumes each value occurs no more than once in a given file, like in your example. If a value CAN occur multiple times in one file then:
awk '!seen[FILENAME,$0]++{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt
or with GNU awk for true multi-dimensional arrays and ARGIND:
awk '{cnt[$0][ARGIND]} END{for (i in cnt) if (length(cnt[i])==ARGIND) print i}' *.txt

Using python
This python script with find the common lines among a large number of files:
#!/usr/bin/python
from glob import glob
fnames = glob('f*.txt')
with open(fnames[0]) as f:
lines = set(f.readlines())
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
print(''.join(lines))
Sample run:
$ python script.py
B
How it works:
fnames = glob('f*.txt')
This collects the names of files of interest.
with open(fnames[0]) as f:
lines = set(f.readlines())
This reads the first file and creates a set from its lines. This set is called lines.
for fname in fnames[1:]:
with open(fname) as f:
lines = lines.intersection(f.readlines())
For each subsequent file, this takes the intersection of lines with the lines of this file.
print(''.join(lines))
This prints out the resulting set of common lines.
Using grep and shell
Try:
$ grep -Ff f1.txt f2.txt | grep -Ff f3.txt
B
This works in two steps:
grep -Ff f1.txt f2.txt selects those lines from f2.txt that also occur in f1.txt. In other words, the output from this command consists of lines that f1.txt and f2.txt have in common.
grep -Ff f3.txt selects from its input all lines that are also in f3.txt.
Notes:
The -F option tells grep to treat its input as fixed strings, not regular expressions.
The -f option tells grep to get the patterns it is looking for from the file whose name follows.
The command above looks for complete matching lines. That means, for one, that leading or trailing white space is significant.

Use join:
$ join f1.txt <(join f2.txt f3.txt)
B
join expects the files to be sorted, though. This seems to work too:
$ join <(sort f1.txt) <(join <(sort f2.txt) <(sort f3.txt))
B

Note that Ed's answer is considerably faster than my suggestion, but I'll leave it for posterity :-)
I used GNU Parallel to apply comm to the files in pairs in parallel (so it should be fast) and do that repeatedly, passing the output of each iteration as the input to the next.
It converges when there is only one file left to process. If there are an odd number of files at any stage, the odd file is promoted forward to the next round and processed later.
#!/bin/bash
shopt -s nullglob
# Get list of files
files=(f*.txt)
iter=0
while : ; do
# Get number of files
n=${#files[#]}
echo DEBUG: Iter: $iter, Files: $n
# If only one file left, we have converged, cat it and exit
[ $n -eq 1 ] && { cat ${files[0]}; break; }
# Check if odd number of files, and promote and delete one if odd
if (( n % 2 )); then
mv ${files[0]} s-$iter-odd;
files=( ${files[#]:1} )
fi
parallel -n2 comm -1 -2 {1} {2} \> s-$iter-{#} ::: "${files[#]}"
files=(s-$iter-*)
(( iter=iter+1 ))
done
Sample Output
DEBUG: Iter: 0, Files: 110
DEBUG: Iter: 1, Files: 55
DEBUG: Iter: 2, Files: 28
DEBUG: Iter: 3, Files: 14
DEBUG: Iter: 4, Files: 7
DEBUG: Iter: 5, Files: 4
DEBUG: Iter: 6, Files: 2
DEBUG: Iter: 7, Files: 1
Basically, s-0-* is the output of the first pass, s-1-* is the output of the second pass...
If you would like to see the commands parallel would run, without it actually running any of them, use:
parallel --dry-run ...

If (but only if) all of your files have unique entries this should work too:
sort f*.txt | uniq -c \
| grep "^\s*$(ls f*.txt | wc -w)\s" \
| while read n content; do echo $content; done

Related

cat multiple files into one using same amount of rows as file B from A B C

This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.

Substracting row-values from two different text files

I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.
Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'
given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6
Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Bash set subtraction

How to subtract a set from another in Bash?
This is similar to: Is there a "set" data structure in bash? but different as it asks how to perform the subtraction, with code
set1: N lines as output by a filter
set2: M lines as output by a filter
how to get:
set3: with all lines in N which don't appear in M
comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort)
comm without option display 3 columns of output: 1: only in first file, 2: only in second file, 3: in both files. -23 removes the second and third columns.
$ cat > file1.list
A
B
C
$ cat > file2.list
A
C
D
$ comm file1.list file2.list
A
B
C
D
$ comm -12 file1.list file2.list # In both
A
C
$ comm -23 file1.list file2.list # Only in set 1
B
$ comm -13 file1.list file2.list # Only in set 2
D
Input files must be sorted.
GNU sort and comm depends on locale, for example output order may be different (but content must be the same)
(export LC_ALL=C; comm -23 <(command_which_generate_N|sort) <(command_which_generate_M|sort))
uniq -u (manpage) is often the simplest tool for list subtraction:
Usage
uniq [OPTION]... [INPUT [OUTPUT]]
[...]
-u, --unique
only print unique lines
Example: list files found in directory a but not in b
$ ls a
file1 file2 file3
$ ls b
file1 file3
$ echo "$(ls a ; ls b)" | sort | uniq -u
file2
I've got a dead-simple 1-liner:
$ now=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log)
$ later=(ConfigQC DBScripts DRE DataUpload WFAdaptors.log baz foo)
$ printf "%s\n" ${now[#]} ${later[#]} | sort | uniq -c | grep -vE '[ ]+2.*' | awk '{print $2}'
baz
foo
By definition, 2 sets intersect if they have elements in common. In this case, there are 2 sets, so any count of 2 is an intersection - simply "subtract" them with grep
I wrote a program recently called Setdown that does Set operations (like set difference) from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection
Its pretty cool and you should check it out. I personally don't recommend the "set operations in unix shell" post. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other.
At any rate, I think that it's pretty cool and you should totally check it out.
You can use diff
# you should sort the output
ls > t1
cp t1 t2
I used vi to remove some entries from t2
$ cat t1
AEDWIP.writeMappings.sam
createTmpFile.sh*
find.out
grepMappingRate.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
$ cat t2
AEDWIP.writeMappings.sam
createTmpFile.sh*
salmonUnmapped.sh*
selectUnmappedReadsFromFastq.sh*
diff reports lines in t1 that are not in t2
diff t1 t2
$ diff t1 t2
3,4d2
< find.out
< grepMappingRate.sh*
putting together version
diff t1 t2 | grep "^<" | cut -d " " -f 2
find.out
grepMappingRate.sh*

How to find set difference of two files?

I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:
for line in `cat file1`
do
if [ `grep -c "^$line$" file2` -eq 0]; then
echo $line
fi
done
It works, but it's slow. Is there a faster way of doing this?
The BashFAQ describes doing exactly this with comm, which is the canonically correct method.
# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)
diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.
comm has been part of the Single Unix Specification since SUS2 (1997).
If you simply want lines that are in file A, but not in B, you can sort the files, and compare them with diff.
sort A > A.sorted
sort B > B.sorted
diff -u A.sorted B.sorted | grep '^-'
The 'diff' program is standard unix program that looks at differences between files.
% cat A
a
b
c
d
% cat B
a
b
e
% diff A B
3,4c3
< c
< d
---
> e
With a simple grep and cut one can select the lines in A, not in B. Note that the cut is rather simplistic and spaces in the lines would throw it off... but the concept is there.
% diff A B | grep '^<' | cut -f2 -d" "
c
d

Resources