Shell command to find lines common in two files - shell

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.

The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)

To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!

To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.

Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.

While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.

If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.

perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2

awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2

On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same

For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string

Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.

rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Related

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

How to combine two text files using bash

I have two text files that I wish to combine in bash so that every line in one file is combined with every file in the other file.
file1.txt
abc123
def346
ghj098
file2.txt
PSYC1001
PSYC1002
PSYC1003
I want to combine them so that line 1 of file1 is added to every line of file2, with a pipe de-limiter | in between them.
e.g.
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
Then the same for the other lines in file1 so I would end up with
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098<
I've been doing similar simpler text things in bash by copying examples from this site, but I've not found an example that can do this. Would love to hear your suggestion. I know it must be simple but I've not worked it out yet.
The shortest one - join command:
join -j2 -t'|' -o2.1,1.1 file1 file2
-t'|' - input/output field separator
-o FORMAT - FORMAT is one or more comma or blank separated specifications, each being FILENUM.FIELD or 0
The output:
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098
This awk one-liner should help you:
awk -v OFS="|" 'NR==FNR{a[NR]=$0;c=NR;next}{for(i=1;i<=c;i++){print a[i],$0}}' file2 file1
Test with your data:
kent$ awk -v OFS="|" 'NR==FNR{a[NR]=$0;c=NR;next}{for(i=1;i<=c;i++){print a[i],$0}}' f2 f1
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098
Here are 2 ways to do it in plain bash:
while IFS= read -u3 -r elem1; do
while IFS= read -u4 -r elem2; do
echo "$elem2|$elem1"
done 4<file2.txt
done 3<file1.txt
mapfile -t f1 < file1.txt
mapfile -t f2 < file2.txt
for elem1 in "${f1[#]}"; do
for elem2 in "${f2[#]}"; do
echo "$elem2|$elem1"
done
done
bash only
a1=( $(<f1) )
a2=( $(<f2) )
for i in "${a2[#]}"
do
for j in "${a1[#]}"
do
echo "${j}|${i}"
done
done

Combine two lines from different files when the same word is found in those lines

I'm new with bash, and I want to combine two lines from different files when the same word is found in those lines.
E.g.:
File 1:
organism 1
1 NC_001350
4 NC_001403
organism 2
1 NC_001461
1 NC_001499
File 2:
NC_001499 » Abelson murine leukemia virus
NC_001461 » Bovine viral diarrhea virus 1
NC_001403 » Fujinami sarcoma virus
NC_001350 » Saimiriine herpesvirus 2 complete genome
NC_022266 » Simian adenovirus 18
NC_028107 » Simian adenovirus 19 strain AA153
i wanted an output like:
File 3:
organism 1
1 NC_001350 » Saimiriine herpesvirus 2 complete genome
4 NC_001403 » Fujinami sarcoma virus
organism 2
1 NC_001461 » Bovine viral diarrhea virus 1
1 NC_001499 » Abelson murine leukemia virus
Is there any way to get anything like that output?
You can get something pretty similar to your desired output like this:
awk 'NR == FNR { a[$1] = $0; next }
{ print $1, ($2 in a ? a[$2] : $2) }' file2 file1
This reads in each line of file2 into an array a, using the first field as the key. Then for each line in file1 it prints the first field followed by the matching line in a if one is found, else the second field.
If the spacing is important, then it's a little more effort but totally possible.
For a more Bash 4 ish solution:
declare -A descriptions
while read line; do
name=$(echo "$line" | cut -d '»' -f 1 | xargs echo)
description=$(echo "$line" | cut -d '»' -f 2)
eval "descriptions['$name']=' »$description'"
done < file2
while read line; do
name=$(echo "$line" | cut -d ' ' -f 2)
if [[ -n "$name" && -n "${descriptions[$name]}" ]]; then
echo "${line}${descriptions[$name]}"
else
echo "$line"
fi
done < file1
We could create a sed-script from the second file and apply it to the first file. It is straight forward, we use the sed s command to construct another sed s command from each line and store in a variable for later usage:
sc=$(sed -rn 's#^\s+(\w+)([^\w]+)(.*)$#s/\1/\1\2\3/g;#g; p;' file2 )
sed "$sc" file1
The first command looks so weird, because we use # in the outer sed s and we use the more common / in the inner sed s command as delimiters.
Do a echo $sc to study the inner one. It just takes the parts of each line of file2 into different capture groups and then combines the captured strings to a s/find/replace/g; with
find is \1
replace is \1\2\3
You want to rebuild file2 into a sed-command file.
sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2
You can use process substitution to use the result without storing it in a temp file.
sed -f <(sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2) File1

how to find matching records from 3 different files in unix

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com
If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com
Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1
comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)
Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com
To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

find difference between two text files with one item per line [duplicate]

This question already has answers here:
How to remove the lines which appear on file B from another file A?
(12 answers)
Closed 6 years ago.
I have two files:
file 1
dsf
sdfsd
dsfsdf
file 2
ljljlj
lkklk
dsf
sdfsd
dsfsdf
I want to display what is in file 2 but not in file 1, so file 3 should look like
ljljlj
lkklk
grep -Fxvf file1 file2
What the flags mean:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
-x, --line-regexp
Select only those matches that exactly match the whole line.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
You can try
grep -f file1 file2
or
grep -v -F -x -f file1 file2
You can use the comm command to compare two sorted files
comm -13 <(sort file1) <(sort file2)
I successfully used
diff "${file1}" "${file2}" | grep "<" | sed 's/^<//g' > "${diff_file}"
Outputting the difference to a file.
if you are expecting them in a certain order, you can just use diff
diff file1 file2 | grep ">"
join -v 2 <(sort file1) <(sort file2)
A tried a slight variation on Luca's answer and it worked for me.
diff file1 file2 | grep ">" | sed 's/^> //g' > diff_file
Note that the searched pattern in sed is a > followed by a space.
file1
m1
m2
m3
file2
m2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} ($0 in file1)' file1 file2
m2
> What's awk command to get 'm1 and m3' ?? as in file1 and not in file2?
m1
m3
If you want to use loops You can try like this: (diff and cmp are much more efficient. )
while read line
do
flag = 0
while read line2
do
if ( "$line" = "$line2" )
then
flag = 1
fi
done < file1
if ( flag -eq 0 )
then
echo $line > file3
fi
done < file2
Note: The program is only to provide a basic insight into what can be done if u dont want to use system calls such as diff n comm..
an awk answer:
awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
With GNU sed:
sed 's#[^^]#[&]#g;s#\^#\\^#g;s#^#/^#;s#$#$/d#' file1 | sed -f- file2
How it works:
The first sed produces an output like this:
/^[d][s][f]$/d
/^[s][d][f][s][d]$/d
/^[d][s][f][s][d][f]$/d
Then it is used as a sed script by the second sed.

Resources