Comparing two lists with a shell script - shell

Suppose I have two lists of numbers in files f1, f2, each number one per line. I want to see how many numbers in the first list are not in the second and vice versa. Currently I am using grep -f f2 -v f1 and then repeating this using a shell script. This is pretty slow (quadratic time hurts). Is there a nicer way of doing this?

I like 'comm' for this sort of thing.
(files need to be sorted.)
$ cat f1
1
2
3
$ cat f2
1
4
5
$ comm f1 f2
1
2
3
4
5
$ comm -12 f1 f2
1
$ comm -23 f1 f2
2
3
$ comm -13 f1 f2
4
5
$

Couldn't you just put each number in a single line and then diff(1) them? You might need to sort the lists beforehand, though for that to work properly.

In the special case where one file is a subset of the other, the following:
cat f1 f2 | sort | uniq -u
would list the lines only in the larger file. And of course piping to wc -l will show the count.
However, that isn't exactly what you described.
This one-liner serves my particular needs often, but I'd love to see a more general solution.

Related

How do I get the list of all items in dir1 which don't exist in dir2?

I want to compute the difference between two directories - but not in the sense of diff, i.e. not of file and subdirectory contents, but rather just in terms of the list of items. Thus if the directories have the following files:
dir1
dir2
f1 f2 f4
f2 f3
I want to get f1 and f4.
You can use comm to compare two listings:
comm -23 <(ls dir1) <(ls dir2)
process substitution with <(cmd) passes the output of cmd as if it were a file name. It's similar to $(cmd) but instead of capturing the output as a string it generates a dynamic file name (usually /dev/fd/###).
comm prints three columns of information: lines unique to file 1, lines unique to file 2, and lines that appear in both. -23 hides the second and third columns and shows only lines unique to file 1.
You could extend this to do a recursive diff using find. If you do that you'll need to suppress the leading directories from the output, which can be done with a couple of strategic cds.
comm -23 <(cd dir1; find) <(cd dir2; find)
Edit: A naive diff-based solution + improvement due to #JohnKugelamn! :
diff --suppress-common-lines <(\ls dir1) <(\ls dir2) | egrep "^<" | cut -c3-
Instead of working on directories, we switch to working on files; then we use regular diff, taking only lines appearing in the first file, which diff marks by < - then finally removing that marking.
Naturally one could beautify the above by checking for errors, verifying we've gotten two arguments, printing usage information otherwise etc.

How can I get output of 2 files with no-duplicates of lines [from any file]?

Given two files (so that at any file can be duplicates) in the following format:
file1 (file that contains only numbers) for example:
10
40
20
10
10
file2 (file that contains only numbers) for example:
30
40
10
30
0
How can I prints the contents of the files, so that, from any file, we will remove the duplications.
For example, the output according to the 2 file above, need to be:
10
40
20
30
40
10
0
Note: in the output, we can get duplications (at most, will be 2 number that appears two times) , but, from any file, we will take the content without duplications !
How can I do it with sort , uniq , cat using only one command?
Namely, something like that: cat file1 file2 | sort | uniq (but, of course, this command not good - it's not solve the problem, it's only for explain what I mean while I say "using only one command").
I will be happy to listen your ideas how do it :)
If I understood the question correctly, this awk should do it while preserving the order:
awk 'FNR==1{delete a}!a[$0]++' file1 file2
If you don't need to preserve the order, it can be as simple as:
sort -u file1; sort -u file2
If you don't want to use a list (;), something like this is also an option:
cat <(sort -u file1) <(sort -u file2)

Fast diff of 2 large text files in shell?

I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.
I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.
This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).
Is there any way I can rewrite this script to accomplish the same function but in a faster time?
I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.
Example:
F1 contains:
A
B
E
F
F2 contains:
A
Y
B
Z
The output I'm expecting is when comparing F1 to F2:
E
F
And then comparing F2 to F1:
Y
Z
You want comm:
$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
A
B
E
F
Y
Z
Column 1 of comm output are those lines unique to f1. Column 2 are those lines unique to f2. Column 3 are lines found in both f1 and f2.
The parameters -1, -2, and -3 suppress the corresponding output. For example, if you want only the lines unique to f1, you can filter out the other columns:
$ comm -23 <(sort f1) <(sort f2)
E
F
Note that comm requires sorted input, which I supply in these examples using the bash command substitution syntax (<()). If you're not using bash, pre-sort into a temporary file.
Have you tried linux's diff?
Some useful options are -i, -w, -u, -y
Though, in that case, they'd have to have the same order (you could sort them first)
If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:
sort F1 F2 | uniq -u
Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.
grep -vFx -f F1.txt F2.txt:
Y
Z
grep -vFx -f F2.txt F1.txt:
E
F
Explanation:
-v to print lines not matching those in the "pattern file"
specified with -f
-F - interpret patterns as fixed strings and not regexes, gleaned
from this
question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x - match entire lines
Sorting is not required. - You get the resulting unique lines in the order they appear. This method takes longer because it cannot assume the inputs are sorted, but if you are looking at multiline records, sorting really trashes the context. The performance is okay if the files are similar, because grep -v skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.

On the seeking for the pairs of identical files

I need to seek 2 dirs for the pair of files having identical tittles (but not the extensions!) and merge their titles within some new command.
first how to print only name of the files
1)Typically I use the following command within the for loop to select the full name of the file which is looped
for file in ./files/* do;
title=$(base name "file")
print title
done
What should I change in the above script to print as the title of only name of the file but not its extension?
2) how its possible to add some condition to check whether two files has the same names performing double looping over them e,g
# counter for the detected equal files
i=0
for file in ./files1/* do;
title=$(base name "file") #change it to avoid extension within the title
for file2 in ./files2/* do;
title2=$(basename "file2") #change it to avoid extension within the title2
if title1==title2
echo $title1 and $title2 'has been found!'
i=i+1
done
Thanks for help!
Gleb
You could start by fixing the syntax errors in your script, such as do followed by ; when it should be the other way round.
Then, the shell has operators to remove sub-strings from the start (##, #) and end (%%, %) in a variable. Here's how to list files without extensions, i.e. removing the shortest part that matches the glob .* from the right:
for file in *; do
printf '%s\n' "${file%.*}"
done
Read your shell manual to find out about these operators. It will pay for itself many times over in your programming career :-)
Do not believe anyone telling you to use ugly and expensive piping and forking with basename, cut, awk and such. That's all overkill.
On the other hand, maybe there's a better way to achieve your goal. Suppose you have files like this:
$ find files1 files2
files1
files1/file1.x
files1/file3.z
files1/file2.y
files2
files2/file1.x
files2/file4.b
files2/file3.a
Now create two lists of file names, extensions stripped:
ls files1 | sed -e 's/\.[^.]*$//' | sort > f1
ls files2 | sed -e 's/\.[^.]*$//' | sort > f2
The comm utility tests for lines common in two files:
$ comm f1 f2
file1
file2
file3
file4
The first column lists lines only in f1, the second only in f2 and the third common to both. Using the -1 -2 -3 options you can suppress unwanted columns. If you need to count only the common files (third column) , run
$ comm -1 -2 f1 f2 | wc -l
2

How to merge two files with unique IDs?

I have two files. File1 and File2.
File1:
1 a
2 b
File2:
1 a
2 c
3 d
I would like to generate a file that has the following:
1 a
2 c
3 d
The lines that File2 has either inserted or updated into File1, sort of how the UPSERT feature works in SQL.
I'm guessing here, since question is a bit vague. Anyway, here's something in awk that just uses the first value as a key to store the 2nd value. 2nd value always overwrites content in array if the key is found multiple times:
$ awk '{a[$1]=$0}END{for (i in a) print a[i]}' f1 f2
1 a
2 c
3 d
EDIT: The new version takes an arbitrary wide file instead of being tied to two fields.

Resources