bash delete lines in file containing lines from another file - bash

file1 contains:
someword0
someword2
someword4
someword6
someword8
someword9
someword1
file2 contains:
someword2
someword3
someword4
someword5
someword7
someword11
someword1
So I wan't to have only lines from file1 which file2 doesn't contains. How can I do this in bash ?
That's the answer:
grep -v -x -f file2 file1
-v for select non-matching lines
-x for matching whole lines only
-f f2 to get patterns from f2.

You can use grep -vf:
grep -vwFf file2 file1
someword0
someword6
someword8
someword9
Check man grep for detailed info on all the grep options used here.

You could use the comm command as well:
comm -23 file1 file2
Explanation:
comm compares two files and prints, in 3 columns, lines unique to file1, file2, and lines in both.
Using the options -2 and -3 (or simply -23) suppresses printing of these columns, so you just get the lines unique to file1.

If your lines are unique, do a left join and filter out lines that exist in both tables.
join <(sort file1) <(sort file2) -o0,2.1 -a1 | awk '!$2'

Related

Finiding common lines for two files using bash

I am trying to compare two files and output a file which consists of common names for both.
File1
1990.A.BHT.s_fil 4.70
1991.H.BHT.s_fil 2.34
1992.O.BHT.s_fil 3.67
1993.C.BHT.s_fil -1.50
1994.I.BHT.s_fil -3.29
1995.K.BHT.s_fil -4.01
File2
1990.A.BHT_ScS.dat 1537 -2.21
1993.C.BHT_ScS.dat 1494 1.13
1994.I.BHT_ScS.dat 1545 0.15
1995.K.BHT_ScS.dat 1624 1.15
I want to compare the first parts of the names ** (ex:1990.A.BHT ) ** on both files and output a file which has common names with the values on 2nd column in file1 to file3
ex: file3 (output)
1990.A.BHT.s_fil 4.70
1993.C.BHT.s_fil -1.50
1994.I.BHT.s_fil -3.29
1995.K.BHT.s_fil -4.01
I used following codes which uses grep command
while read line
do
grep $line file1 >> file3
done < file2
and
grep -wf file1 file2 > file3
I sort the files before using this script.
But I get an empty file3. Can someone help me with this please?
You need to remove everything starting from _SCS.dat from the lines in file2. Then you can use that as a pattern to match lines in file1.
grep -F -f <(sed 's/_SCS\.dat.*//' file2) file1 > file3
The -F option matches fixed strings rather than treating them as regular expressions.
In your example data, the lines appear to be in sorted order. If you can guarantee that they always are, comm -1 -2 file1 file2 would do the job. If they can be unsorted, do a
comm -1 -2 <(sort file1) <(sort file2)

Diff to get changed line from second file

I have two files file1 and file2. I want to print the new line added to file2 using diff.
file1
/root/a
/root/b
/root/c
/root/d
file2
/root/new
/root/new_new
/root/a
/root/b
/root/c
/root/d
Expected output
/root/new
/root/new_new
I looked into man page but there was no any info on this
If you don't need to preserve the order, you could use the comm command like:
comm -13 <(sort file1) <(sort file2)
comm compares 2 sorted files and will print 3 columns of output. First is the lines unique to file1, then lines unique to file2 then lines common to both. You can supress any columns, so we turn of 1 and 3 in this example with -13 so we will see only lines unique to the second file.
or you could use grep:
grep -wvFf file1 file2
Here we use -f to have grep get its patterns from file1. We then tell it to treat them as fixed strings with -F instead of as patterns, match whole words with -w, and print only lines with no matches with -v
Following awk may help you on same. This will tell you all those lines which are present in Input_file2 and not in Input_file1.
awk 'FNR==NR{a[$0];next} !($0 in a)' Input_file1 Input_file2
Try using a combination of diff and sed.
The raw diff output is:
$ diff file1 file2
0a1,2
> /root/new
> /root/new_new
Add sed to strip out everything but the lines beginning with ">":
$ diff file1 file2 | sed -n -e 's/^> //p'
/root/new
/root/new_new
This preserves the order. Note that it also assumes you are only adding lines to the second file.

Finding common lines in two files that have some blank lines

I got two almost identical files, same amount of lines and it's a code.
I'm trying to create a file of the common lines between these two files and also have blank lines where the lines are different.
I tried using comm, and it works good but doesn't provide me the blank lines I need on the bad lines, it just eliminates the lines and the common file is shorter(line count).
This is what I tried:
comm -1 -2 file1 file2
comm needs sorted files. So, you could use command substitution like this:
comm -12 <(sort file1) <(sort file2)
If you want to skip blank lines (spaces), then:
comm -12 <(grep -Ev '^[ ]+$' file1 | sort) <(grep -Ev '^[ ]+$' file2 | sort)
To skip blank lines that have spaces or tabs:
comm -12 <(grep -Ev $'^[ \t]+$' file1 | sort) <(grep -Ev $'^[ \t]+$' file2 | sort)

Difference between two files without sorting

I have the files file1 and file2, where file2 is a subset of file1. That means, if I iterate over file1, there are some lines that are in file2, and some that aren't, but there is no line in file2 that is not in file1. There may be several lines with the same content in a file. Now I want to get the difference between them, that is, all lines of file1 that aren't in file2.
According to this well received answer
diff(1) isn't the answer, comm(1) is.
(For whatever reason)
But as I understand, for comm the files need to be sorted first. The problem: Both files are ordered (not sorted!), and this order needs to be kept. So what I really want is to iterate over file1, and check for every line, if it is also in file2. If not, write it to file3. If the same content occurs more than once, it should be kept more than once!
Is there any way to do this with the command line?
Try this with GNU grep:
grep -vFf file2 file1 > file3
Update:
grep -vxFf file2 file1 > file3
I think you do not want to sort for avoiding temp files. This is possible with process substitution:
diff <(sort file1) <(sort file2)
# or
comm <(sort file1) <(sort file2)
Edit: Using https://stackoverflow.com/a/4544925/3220113 I found another alternative (for text files with short lines):
diff -a --suppress-common-lines -y file2 file1 | sed 's/\s*>.//'

Join two files based on a matching column, but preserve non-matching lines?

I have two text files that I would like to join based upon matches in the second column. File1 is larger than file2, containing matches for all entries in file2, plus many non-matches.
Join works, and the output file joined the matching entries as expected. However, I would like to preserve the non-matching entries in file1, such that they still appear in the output file.
Both files are tab-delimited. File1 has 13 columns, and file2 has 4. I am matching between column 2 of file1 and column 1 of file2.
How can I do this such that the non-matching lines from file1 still appear in the output file (file3)?
I have been using the following code (bash):
join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,2.2,2.3,2.4 <(sort -k2,2 file1) <(sort -k1,1 file2) > file3
Thanks in advance for your help, I really appreciate it! Apologies for a novice question, I am a biologist who is attempting to improve his computational proficiency, and the learning curve has been steep.
Regards,
Anthony
From man join:
-a file_number
In addition to the default output, produce a line for each unpairable line in file file_number.
This seems to be exactly what you want.
So since you want the lines from file1 to also appear in the output, add -a 1 in your command.
join -a 1 -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13,2.2,2.3,2.4 <(sort -k2,2 file1) <(sort -k1,1 file2) > file3

Resources