how to find matching records from 3 different files in unix - shell

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com

If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com

Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1

comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)

Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com

To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

Related

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

split numbers in and store them in different files using unix shell script

I have a file called "list.txt" which contains the following rows of numbers.
31056780
31909020
31092320
61093190
61094592
45090280
45902902
I need to now take all the rows starting with "31" and store them in another file call file31.txt take all the rows starting with "61" and store them in file61.txt, take all rows starting with "45" store it in file45.txt
file31.txt will contain.
31056780
31909020
31092320
file61.txt will contain.
61093190
61094592
file45.txt will contain.
45090280
45902902
I tried this command for all 3 but it does not do what i want it to do.
awk -F\" '/31*/ {print $0}' list.txt > file31
awk -F\" '/61*/ {print $0}' list.txt > file61
awk -F\" '/45*/ {print $0}' list.txt > file45
You can use output redirection inside a single awk script. It can construct the filename by concatenating the first two characters of the line.
awk '{ fn = "list" substr($0, 1, 2) ".txt"; print > fn }' list.txt
You could use grep or sed to filter the lines with a matching pattern, for example:
sed '/^31/!d' list.txt > list31.txt
Or in a for loop for every number you want:
for n in "31" "45" "61"; do sed '/^'"$n"'/!d' list.txt > list$n.txt; done
Hope it helps.
You can use:
awk '/^31/{print > "file31"} /^45/{print > "file45"} /^61/{print > "file61"}' file
for i in `cat list.txt | cut -c1-2 | uniq`; do cat list.txt | grep -P ^${i} > file${i}.txt; done
This command works fine and is generic enough to work for all cases.
Now let's understand how it works.
cat list.txt | cut -c1-2 | uniq
31
45
61
Next we loop over these unique identifiers to create the new files using
cat list.txt | grep -P ^${i}
grep -P finds strings with partial match - here ^ - means that we are looking at this partial string only at the beginning of the line.

Extract lines from a file in bash

I have a file like this
I would like to extract the line with the 0 and 1 (all lines in the file) into a seperate file. However, the sequence does not have to start with a 0 but could also start with a 1. However, the line always comes directly after the line (SITE:). Moreover, I would like to extract the line SITTE itself into a seperate file. Could somebody tell me how that is doable in bash?
Moreover, I would like to extract the line SITTE itself into a seperate file.
That’s the easy part:
grep '^SITE:' infile > outfile.site
Extracting the line after that is slightly harder:
grep --after-context=1 '^SITE:' infile \
| grep '^[01]*$' \
> outfile.nr
--after-context (or -A) specifies how many lines after the matching line to print as well. We then use the second grep to print only that line, and not the actually matching line (nor the delimiter which grep puts between each matching entry when specifying an after-context).
Alternatively, you could use the following to match the numeric lines:
grep '^[01]*$' infile > outfile.nr
That’s much easier, but it will find all lines consisting solely of 0s and 1s, regardless of whether they come after a line which starts with SITE:.
You could try something like :
$ egrep -o "^(0|1)+$" test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
$ grep "^SITE:" test.txt > test3.txt
$ cat test3.txt
SITE: 0 0.000340988542 0.0357651018
SITE: 1 0.000529755514 0.00324293642
SITE: 2 0.000577745511 0.052214098
Another solution, using bash :
$ while read; do [[ $REPLY =~ ^(0|1)+$ ]] && echo "$REPLY"; done < test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
To remove the characters 0 at beginning of the line :
$ egrep "^(0|1)+$" test.txt | sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
1010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000
11010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
UPDATE : New file format provided in comments :
$ egrep "^SITE:" test.txt|egrep -o "(0|1)+$"|sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
100000000000000000000001000001000000000000000000000000000000000000
1010010010000000000111101000010000001001010111111100000000000010010001101010100011101011110011100
10000000000
$ egrep "^SITE:" test.txt|sed "s/[01\ ]\{1,\}$//g" > test3.txt
$ cat test3.txt
SITE: 967 0.189021866 0.0169990123
SITE: 968 0.189149593 0.246619149
SITE: 969 0.189172266 6.84752689e-05
Here's a simple awk solution that matches all lines starting with SITE: and outputs the respective next line:
awk '/^SITE:/ { if (getline) print }' infile > outfile
Simply omit the { ... } block part to extract all lines starting with SITE: themselves to a separate file:
awk '/^SITE:/' infile > outfile
If you wanted to combine both operations:
outfile1 and outfile2 are the names of the 2 output files, passed to awk as variables f1 and f2:
awk -v f1=outfile1 -v f2=outfile2 \
'/^SITE:/ { print > f1; if (getline) print > f2 }' infile

find difference between two text files with one item per line [duplicate]

This question already has answers here:
How to remove the lines which appear on file B from another file A?
(12 answers)
Closed 6 years ago.
I have two files:
file 1
dsf
sdfsd
dsfsdf
file 2
ljljlj
lkklk
dsf
sdfsd
dsfsdf
I want to display what is in file 2 but not in file 1, so file 3 should look like
ljljlj
lkklk
grep -Fxvf file1 file2
What the flags mean:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
-x, --line-regexp
Select only those matches that exactly match the whole line.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
You can try
grep -f file1 file2
or
grep -v -F -x -f file1 file2
You can use the comm command to compare two sorted files
comm -13 <(sort file1) <(sort file2)
I successfully used
diff "${file1}" "${file2}" | grep "<" | sed 's/^<//g' > "${diff_file}"
Outputting the difference to a file.
if you are expecting them in a certain order, you can just use diff
diff file1 file2 | grep ">"
join -v 2 <(sort file1) <(sort file2)
A tried a slight variation on Luca's answer and it worked for me.
diff file1 file2 | grep ">" | sed 's/^> //g' > diff_file
Note that the searched pattern in sed is a > followed by a space.
file1
m1
m2
m3
file2
m2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
m4
m5
>awk 'NR == FNR {file1[$0]++; next} ($0 in file1)' file1 file2
m2
> What's awk command to get 'm1 and m3' ?? as in file1 and not in file2?
m1
m3
If you want to use loops You can try like this: (diff and cmp are much more efficient. )
while read line
do
flag = 0
while read line2
do
if ( "$line" = "$line2" )
then
flag = 1
fi
done < file1
if ( flag -eq 0 )
then
echo $line > file3
fi
done < file2
Note: The program is only to provide a basic insight into what can be done if u dont want to use system calls such as diff n comm..
an awk answer:
awk 'NR == FNR {file1[$0]++; next} !($0 in file1)' file1 file2
With GNU sed:
sed 's#[^^]#[&]#g;s#\^#\\^#g;s#^#/^#;s#$#$/d#' file1 | sed -f- file2
How it works:
The first sed produces an output like this:
/^[d][s][f]$/d
/^[s][d][f][s][d]$/d
/^[d][s][f][s][d][f]$/d
Then it is used as a sed script by the second sed.

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources