Fast diff of 2 large text files in shell?

Fast diff of 2 large text files in shell? - bash

I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.
I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.
This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).
Is there any way I can rewrite this script to accomplish the same function but in a faster time?
I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.
Example:
F1 contains:
A
B
E
F
F2 contains:
A
Y
B
Z
The output I'm expecting is when comparing F1 to F2:
E
F
And then comparing F2 to F1:
Y
Z

You want comm:
$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
A
B
E
F
Y
Z
Column 1 of comm output are those lines unique to f1. Column 2 are those lines unique to f2. Column 3 are lines found in both f1 and f2.
The parameters -1, -2, and -3 suppress the corresponding output. For example, if you want only the lines unique to f1, you can filter out the other columns:
$ comm -23 <(sort f1) <(sort f2)
E
F
Note that comm requires sorted input, which I supply in these examples using the bash command substitution syntax (<()). If you're not using bash, pre-sort into a temporary file.

Have you tried linux's diff?
Some useful options are -i, -w, -u, -y
Though, in that case, they'd have to have the same order (you could sort them first)

If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:
sort F1 F2 | uniq -u

Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.
grep -vFx -f F1.txt F2.txt:
Y
Z
grep -vFx -f F2.txt F1.txt:
E
F
Explanation:
-v to print lines not matching those in the "pattern file"
specified with -f
-F - interpret patterns as fixed strings and not regexes, gleaned
from this
question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x - match entire lines
Sorting is not required. - You get the resulting unique lines in the order they appear. This method takes longer because it cannot assume the inputs are sorted, but if you are looking at multiline records, sorting really trashes the context. The performance is okay if the files are similar, because grep -v skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.

Related

How do I get the list of all items in dir1 which don't exist in dir2?

I want to compute the difference between two directories - but not in the sense of diff, i.e. not of file and subdirectory contents, but rather just in terms of the list of items. Thus if the directories have the following files:
dir1
dir2
f1 f2 f4
f2 f3
I want to get f1 and f4.

You can use comm to compare two listings:
comm -23 <(ls dir1) <(ls dir2)
process substitution with <(cmd) passes the output of cmd as if it were a file name. It's similar to $(cmd) but instead of capturing the output as a string it generates a dynamic file name (usually /dev/fd/###).
comm prints three columns of information: lines unique to file 1, lines unique to file 2, and lines that appear in both. -23 hides the second and third columns and shows only lines unique to file 1.
You could extend this to do a recursive diff using find. If you do that you'll need to suppress the leading directories from the output, which can be done with a couple of strategic cds.
comm -23 <(cd dir1; find) <(cd dir2; find)

Edit: A naive diff-based solution + improvement due to #JohnKugelamn! :
diff --suppress-common-lines <(\ls dir1) <(\ls dir2) | egrep "^<" | cut -c3-
Instead of working on directories, we switch to working on files; then we use regular diff, taking only lines appearing in the first file, which diff marks by < - then finally removing that marking.
Naturally one could beautify the above by checking for errors, verifying we've gotten two arguments, printing usage information otherwise etc.

How can I delete the lines in a text file that exits in another text file [duplicate]

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.

If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here

grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another

awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...

Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)

You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details

This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)

You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'

You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.

Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.

To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt

Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt

To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

On the seeking for the pairs of identical files

I need to seek 2 dirs for the pair of files having identical tittles (but not the extensions!) and merge their titles within some new command.
first how to print only name of the files
1)Typically I use the following command within the for loop to select the full name of the file which is looped
for file in ./files/* do;
title=$(base name "file")
print title
done
What should I change in the above script to print as the title of only name of the file but not its extension?
2) how its possible to add some condition to check whether two files has the same names performing double looping over them e,g
# counter for the detected equal files
i=0
for file in ./files1/* do;
title=$(base name "file") #change it to avoid extension within the title
for file2 in ./files2/* do;
title2=$(basename "file2") #change it to avoid extension within the title2
if title1==title2
echo $title1 and $title2 'has been found!'
i=i+1
done
Thanks for help!
Gleb

You could start by fixing the syntax errors in your script, such as do followed by ; when it should be the other way round.
Then, the shell has operators to remove sub-strings from the start (##, #) and end (%%, %) in a variable. Here's how to list files without extensions, i.e. removing the shortest part that matches the glob .* from the right:
for file in *; do
printf '%s\n' "${file%.*}"
done
Read your shell manual to find out about these operators. It will pay for itself many times over in your programming career :-)
Do not believe anyone telling you to use ugly and expensive piping and forking with basename, cut, awk and such. That's all overkill.
On the other hand, maybe there's a better way to achieve your goal. Suppose you have files like this:
$ find files1 files2
files1
files1/file1.x
files1/file3.z
files1/file2.y
files2
files2/file1.x
files2/file4.b
files2/file3.a
Now create two lists of file names, extensions stripped:
ls files1 | sed -e 's/\.[^.]*$//' | sort > f1
ls files2 | sed -e 's/\.[^.]*$//' | sort > f2
The comm utility tests for lines common in two files:
$ comm f1 f2
file1
file2
file3
file4
The first column lists lines only in f1, the second only in f2 and the third common to both. Using the -1 -2 -3 options you can suppress unwanted columns. If you need to count only the common files (third column) , run
$ comm -1 -2 f1 f2 | wc -l
2

using xargs as an argument for cut

Say i have a file a.txt containing a word, followed by a number, followed by a newline on
and 3
now 2
for 2
something 7
completely 8
different 6
I need to select the nth char from every word (specified by the number next to the word)
cat a.txt | cut -d' ' -f2 | xargs -i -n1 cut a.txt -c {}
I tried this command, which selects the numbers and uses xargs to put them into the -c option from cut, but the cut command gets executed on every line, instead of a.txt being looped (which I had expected to happen) How can I resolve this problem?
EDIT: Since it seems to be unclear, i want to select a character from a word. The character which I need to select can be found next to the word, for example:
and 3, will give me d. I want to do this for the entire file, which will then form a word :)

A pure shell solution:
$ while read word num; do echo ${word:$((num-1)):1}; done < a.txt
d
o
o
i
e
r
This is using a classic while; do ... ; done shell loop and the read builtin. The general format is
while read variable1 variable2 ... variableN; do something; done < input_file
This will iterate over each line of your input file splitting it into as many variables as you've given. By default, it will split at whitespace but you can change that by changing the $IFS variable. If you give a single variable, the entire line will be saved, if you give more, it will populate as many variables as you give it and save the rest in the last one.
In this particular loop, we're reading the word into $word and the number into $num. Once we have the word, we can use the shell's string manipulation capabilities to extract a substring. The general format is
${string:start:length}
So, ${string:0:2} would extract the first two characters from the variable $string. Here, the variable is $word, the start is the number minus one (this starts counting at 0) and the length is one. The result is the single letter at the position given by the number.

I would suggest that you used awk:
awk '{print substr($1,$2,1)}' file
substr takes a substring of the first field starting from the number contained in the second field and of length 1.
Testing it out (using the original input from your question):
$ cat file
and 3
now 2
for 2
something 7
completely 8
different 6
$ awk '{print substr($1,$2,1)}' file
d
o
o
i
e
r

Comparing two lists with a shell script

Suppose I have two lists of numbers in files f1, f2, each number one per line. I want to see how many numbers in the first list are not in the second and vice versa. Currently I am using grep -f f2 -v f1 and then repeating this using a shell script. This is pretty slow (quadratic time hurts). Is there a nicer way of doing this?

I like 'comm' for this sort of thing.
(files need to be sorted.)
$ cat f1
1
2
3
$ cat f2
1
4
5
$ comm f1 f2
1
2
3
4
5
$ comm -12 f1 f2
1
$ comm -23 f1 f2
2
3
$ comm -13 f1 f2
4
5
$

Couldn't you just put each number in a single line and then diff(1) them? You might need to sort the lists beforehand, though for that to work properly.

In the special case where one file is a subset of the other, the following:
cat f1 f2 | sort | uniq -u
would list the lines only in the larger file. And of course piping to wc -l will show the count.
However, that isn't exactly what you described.
This one-liner serves my particular needs often, but I'd love to see a more general solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fast diff of 2 large text files in shell? - bash

Have you tried linux's diff? Some useful options are -i, -w, -u, -y Though, in that case, they'd have to have the same order (you could sort them first)

If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do: sort F1 F2 | uniq -u

Related

How do I get the list of all items in dir1 which don't exist in dir2?

How can I delete the lines in a text file that exits in another text file [duplicate]

On the seeking for the pairs of identical files

using xargs as an argument for cut

Comparing two lists with a shell script

Categories

Resources