Deleting lines from one file which are in another file - bash

I have a file f1:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2:
line2
line8
..
..
I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?

grep -v -x -f f2 f1 should do the trick.
Explanation:
-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2
One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2

For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0] creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

if you have Ruby (1.9+)
#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
Which has O(N^2) complexity. If you want to care about performance, here's another version
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
diff was used to show there are no differences between the 2 files generated.

Some timing comparisons between various other answers:
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
comm can also be used with stdin and here strings:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

Seems to be a job suitable for the SQLite shell:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

Did you try this with sed?
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
Obviously won't work for huge files but it did the trick for me. A few notes:
I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data

A Python way of filtering one list using another list.
Load files:
>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()
Remove '\n' string at the end of each line:
>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]
Print only the f1 lines that are also in the f2 file:
>>> [a for a in f1 if all(b not in a for b in f2)]

$ cat values.txt
apple
banana
car
taxi
$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...
I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.
$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
grep -v $x $from.final > $from.final.tmp
mv $from.final.tmp $from.final
done
executing...
$ ./weed_out source.txt
and you get a nicely cleaned up file....

Related

cat multiple files into one using same amount of rows as file B from A B C

This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

concatenating multiple files

I have multiple files, and in each file is the following:
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
That is, each file contains one gene sequence for species HM001 to HM050. I would like to concatenate all these files, so I have a single file that contains the genome for species HM001 to HM050:
>HM001
ATGCT...ATGAA...ATGTT
>HM002
ATGTC...ATGCT...ATGCT
>HM003
ATGCC...ATGC...ATGAT
The ellipses are not actually required in the final file. I suppose cat should be used, but I'm not sure how. Any ideas would be appreciated.
Data parsing and formatting will be alot easier with awk. Try this:
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
For files like:
==> f1 <==
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
==> f2 <==
>HM001
ATGDD...
>HM002
ATGDD...
>HM003
ATGDD...
==> f3 <==
>HM001
ATGEE...
>HM002
ATGEE...
>HM003
ATGEE...
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
>HM001
ATGCT... ATGDD... ATGEE...
>HM002
ATGTC... ATGDD... ATGEE...
>HM003
ATGCC... ATGDD... ATGEE...
Might I suggest converting your group of files into a CSV? It's almost
exactly what you're suggesting, and is easily incorporated into just
about any application for processing (e.g., Excel, R, python).
Up front, I'll assume that all species and gene sequences are simply
alpha-numeric, no spaces or quote-like characters. I'm also assuming
access to sed, sort, and uniq, which are all standard in *nix,
MacOSX, and easily accessible for windows via
msys or
cygwin, to name two.
First, generate an array of file names and species. I'm assuming the
files are named file1, file2, etc. Just adjust the first line
accordingly; it's just a glob, not an executed command.
FILES=($(file*))
SPECIES=($(sed -ne 's/^>//gp' file* | sort | uniq))
This gives us one line per species, sorted, with no repeats. This
ensures that our columns are independent and the set is complete.
Next, create a CSV header row with named columns, dumping it into a
CSV file named csvfile:
echo -n "\"Species\"" > csvfile
for fn in ${FILES[#]} ; do echo -n ",\"${fn}\"" ; done >> csvfile
echo >> csvfile
Now iterate through each gene sequence and extract it from all files:
for sp in ${SPECIES[#]} ; do
echo -n "\"${sp}\""
for fn in ${FILES[#]}; do
ANS=$(sed -ne '/>'${sp}'/,/^/ { /^[^>]/p }' ${fn})
echo -n ",\"${ANS}\""
done
echo
done >> csvfile
This works but is inefficient for larger data sets (i.e., large
numbers of files and/or species). Better implementations (e.g, python,
ruby, perl, even R) would read each file once, forming an
internally-maintained matrix, dictionary, or associative array, and
write out the CSV in one chunk.
What about appending them using echo - along these lines?:
find . -type f -exec bash -c 'echo "append this" >> "$0"' {} \;
Source: https://stackoverflow.com/a/15604608/1662973
I would do it using "type", but that is MSDOS. The above should work for you.
The simplest way I can think of is to use cat. For example (assuming you're on a *nix-type system):
cat file1 file2 file3 > outfile
Another awk implementation:
awk '
{key=$0; getline; value[key] = value[key] $0}
END {for (key in value) {print key; print value[key]}}
' file ...
Now, this will probably not output the keys in sorted order: array keys are inherently unsorted. To ensure sorted output, use gawk and
awk '
{key=$0; getline; val[key] = val[key] $0}
END {
n = asorti(val, keys)
for (i=1; i<=n; i++) {print keys[i]; print val[keys[i]]}
}
' file ...

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources