Intersection of N files - bash

I've got a little problem in bash.
I've got N files containing filenames, and I would like to find the list of filenames which are contained in all the file (the intersection of the files).
When there is 2 files, I found this solution: sort file1 file2 | uniq -d, and this is actually doing what I want.
But how to generalize it to N files in the same folder ?
File1
1
2
3
4
File2
1
4
File3
2
3
4
Output expected:
4
Thanks in advance,
Best Regards.

I`m not Marc B, but still, here`s the implementation of his idea:
intersect() {
sort "$#" | uniq -cd | grep "^[^0-9]*$# "
}
# usage example
intersect file1 file2 file3
[EDIT:] To overcome the problem of duplicate lines within the same file, I`d do something like this:
intersect() {
for file in "$#"; do
sort -u "$file"
done | sort | uniq -cd | grep "^[^0-9]*$# "
}

Related

Merge unsorted lines from two files based on similar part

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?
(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2
Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

Counting occurrences of a pattern in a file, include zero for missing patterns

I am trying to count occurrences of a pattern (listed in pattern.txt) in a file (file.txt) using:
grep -o -w -f pattern.txt file.txt | sort | uniq -c > output.txt
This works great, but I would like the output to also include 0 for patterns that do not occur in the file.
How might I accomplish this?
You could add the patterns to the sort, then substract 1 after the uniq:
grep -o -w -F -f pattern.txt file.txt |\
sort - pattern.txt |\
uniq -c |\
awk '{ $1 = sprintf("%7d", $1-1) } 1' > output.txt
Note that this only makes sense if the patterns are fixed strings, so I have added the -F option to grep.
Also, this particular awk script will compact whitespace in the patterns. You'll need more complicated code to avoid that.
Let's face it, what are you doing?
You are looking in a file for some patterns. If they are present, you show them. Afterwards you count the whole thing.
Now you say that for the ones that are not present, you want the number also to be visible.
As a one-liner, I'm afraid this makes no sense.
Therefore I opt for another way of working: grep contains a switch (-v) for patterns which are NOT present. You can add yourself the number 0 in front of these patterns and append those to your list.
As a starting point: grep -v -f pattern.txt file.txt (not tested).
Assuming patterns do not contain white space ...
$ cat pattern.txt
abc
def
ghi
$ cat file.txt
line 1 no match
line 2 one abc match
line 3 two ghi matches abc
OP's current code:
$ grep -o -w -f pattern.txt file.txt | sort | uniq -c
2 abc
1 ghi
One idea using join and OP's current grep/sort/uniq:
join -1 1 -2 2 -a 1 -e 0 -o 2.1,1.1 pattern.txt <(grep -o -w -f pattern.txt file.txt| sort | uniq -c)
Where:
-1 1 -2 2 - join on file #1 (-1 == pattern.txt) field #1 and file #2 (-2 == grep/sort/uniq) field #2
-a 1 - include all rows from file #1 even if there is no match
-e 0 - fill in empty fields with the character 0
-o 1.1,2.1 - output format: file #2.field #1 + file #1.field #1
This generates:
2 abc
0 def
1 ghi
Personally, I'd probably opt for a more robust solution based on awk, but for simple strings (with no white space) this may be sufficient, ymmv ...

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

how to find matching records from 3 different files in unix

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com
If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com
Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1
comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)
Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com
To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources