See only the first line change using `diff` command [duplicate] - shell

Can I use the diff command to find out how many lines do two files differ in?
I don't want the contextual difference, just the total number of lines that are different between two files. Best if the result is just a single integer.

diff can do all the first part of the job but no counting; wc -l does the rest:
diff -y --suppress-common-lines file1 file2 | wc -l

Yes you can, and in true Linux fashion you can use a number of commands piped together to perform the task.
First you need to use the diff command, to get the differences in the files.
diff file1 file2
This will give you an output of a list of changes. The ones your interested in are the lines prefixed with a '>' symbol
You use the grep tool to filter these out as follows
diff file1 file2 | grep "^>"
finally, once you have a list of the changes your interested in, you simply use the wc command in line mode to count the number of changes.
diff file1 file2 | grep "^>" | wc -l
and you have a perfect example of the philosophy that Linux is all about.

Related

How to compare two file in bash script and find lines in one file not in another? [duplicate]

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man file is actually quite readable for this.
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.
Explanation
The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and ## lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.
The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.
Like konsolebox suggested, the posters grep solution
grep -v -f file2 file1
actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.
These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.
So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
grep -F -x -v -f file2 file1
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
Test:
seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
# Output:
1
2
This is also relatively fast, compared to grep.
whats the speed of as sort and diff?
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
Use combine from moreutils package, a sets utility that supports not, and, or, xor operations
combine file1 not file2
i.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.
So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
This seems quick for me :
comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
$ join -v 1 -t '' file1 file2
line2
line3
The -t makes sure that it compares the whole line, if you had a space in some of the lines.
You can use Python:
python -c '
lines_to_remove = set()
with open("file2", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("f1", "r") as f:
for line in f.readlines():
if line.strip() not in lines_to_remove:
print(line.strip())
'
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.
diff -y --suppress-common-lines file1.txt file2.txt

With bash how to get the new content between a file A and a file B [duplicate]

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man file is actually quite readable for this.
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.
Explanation
The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and ## lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.
The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.
Like konsolebox suggested, the posters grep solution
grep -v -f file2 file1
actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.
These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.
So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
grep -F -x -v -f file2 file1
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
Test:
seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
# Output:
1
2
This is also relatively fast, compared to grep.
whats the speed of as sort and diff?
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
Use combine from moreutils package, a sets utility that supports not, and, or, xor operations
combine file1 not file2
i.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.
So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
This seems quick for me :
comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
$ join -v 1 -t '' file1 file2
line2
line3
The -t makes sure that it compares the whole line, if you had a space in some of the lines.
You can use Python:
python -c '
lines_to_remove = set()
with open("file2", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("f1", "r") as f:
for line in f.readlines():
if line.strip() not in lines_to_remove:
print(line.strip())
'
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.
diff -y --suppress-common-lines file1.txt file2.txt

How to print out the number of lines in an output command

For example:
cat /etc/passwd
What is the easiest way to count and display the number of lines the command outputs?
wc is the unix utility which counts characters, words, lines etc. Try man wc to learn more about it. The -l option makes it print only the number of lines (and not characters and other stuff).
So, wc -l <filename> will print the number of lines in the file <filename>.
You asked about how to count number of lines output from a command line program in general. To do that, you can use pipes in unix. So, you can pipe the output of any command to wc -l. In your example, cat /etc/password is the command line program you want to count. For that you should do:
cat /etc/password | wc -l

grep like command to find matching lines plus neighbourhood lines

grep command is really powerful and I use it a lot.
Sometime I have the necessity to find something with grep looking inside many many files to find the string I barely remember helping myself with -i (ignore case) option, -r (recursive) and also -v (exclude).
But what I really need is to have a special output from grep which highlight the matching line(s) plus the neighbourhood lines (given the matching line I'd like to see, let's say, the 2 preceding and the 2 subsequent lines).
Is there a way to get this result using bash?
Grep itself will do this
grep -A 2 -B 2 foo myfile.txt
most greps allow the "context" flag making it a bit more readable:
grep --context=3 foo myfile.txt
You can omit -C
grep -2 foo myfile.txt
is equal to
grep -C 2 foo myfile.txt

Get the newest file based on timestamp

I am new to shell scripting so i need some help need how to go about with this problem.
I have a directory which contains files in the following format. The files are in a diretory called /incoming/external/data
AA_20100806.dat
AA_20100807.dat
AA_20100808.dat
AA_20100809.dat
AA_20100810.dat
AA_20100811.dat
AA_20100812.dat
As you can see the filename of the file includes a timestamp. i.e. [RANGE]_[YYYYMMDD].dat
What i need to do is find out which of these files has the newest date using the timestamp on the filename not the system timestamp and store the filename in a variable and move it to another directory and move the rest to a different directory.
For those who just want an answer, here it is:
ls | sort -n -t _ -k 2 | tail -1
Here's the thought process that led me here.
I'm going to assume the [RANGE] portion could be anything.
Start with what we know.
Working Directory: /incoming/external/data
Format of the Files: [RANGE]_[YYYYMMDD].dat
We need to find the most recent [YYYYMMDD] file in the directory, and we need to store that filename.
Available tools (I'm only listing the relevant tools for this problem ... identifying them becomes easier with practice):
ls
sed
awk (or nawk)
sort
tail
I guess we don't need sed, since we can work with the entire output of ls command. Using ls, awk, sort, and tail we can get the correct file like so (bear in mind that you'll have to check the syntax against what your OS will accept):
NEWESTFILE=`ls | awk -F_ '{print $1 $2}' | sort -n -k 2,2 | tail -1`
Then it's just a matter of putting the underscore back in, which shouldn't be too hard.
EDIT: I had a little time, so I got around to fixing the command, at least for use in Solaris.
Here's the convoluted first pass (this assumes that ALL files in the directory are in the same format: [RANGE]_[yyyymmdd].dat). I'm betting there are better ways to do this, but this works with my own test data (in fact, I found a better way just now; see below):
ls | awk -F_ '{print $1 " " $2}' | sort -n -k 2 | tail -1 | sed 's/ /_/'
... while writing this out, I discovered that you can just do this:
ls | sort -n -t _ -k 2 | tail -1
I'll break it down into parts.
ls
Simple enough ... gets the directory listing, just filenames. Now I can pipe that into the next command.
awk -F_ '{print $1 " " $2}'
This is the AWK command. it allows you to take an input line and modify it in a specific way. Here, all I'm doing is specifying that awk should break the input wherever there is an underscord (_). I do this with the -F option. This gives me two halves of each filename. I then tell awk to output the first half ($1), followed by a space (" ")
, followed by the second half ($2). Note that the space was the part that was missing from my initial suggestion. Also, this is unnecessary, since you can specify a separator in the sort command below.
Now the output is split into [RANGE] [yyyymmdd].dat on each line. Now we can sort this:
sort -n -k 2
This takes the input and sorts it based on the 2nd field. The sort command uses whitespace as a separator by default. While writing this update, I found the documentation for sort, which allows you to specify the separator, so AWK and SED are unnecessary. Take the ls and pipe it through the following sort:
sort -n -t _ -k 2
This achieves the same result. Now you only want the last file, so:
tail -1
If you used awk to separate the file (which is just adding extra complexity, so don't do it sheepish), you can replace the space with an underscore again with sed:
sed 's/ /_/'
Some good info here, but I'm sure most people aren't going to read down to the bottom like this.
This should work:
newest=$(ls | sort -t _ -k 2,2 | tail -n 1)
others=($(ls | sort -t _ -k 2,2 | head -n -1))
mv "$newest" newdir
mv "${others[#]}" otherdir
It won't work if there are spaces in the filenames although you could modify the IFS variable to affect that.
Try:
$ ls -lr
Hope it helps.
Use:
ls -r -1 AA_*.dat | head -n 1
(assuming there are no other files matching AA_*.dat)
ls -1 AA* |sort -r|tail -1
Due to the naming convention of the files, alphabetical order is the same as date order. I'm pretty sure that in bash '*' expands out alphabetically (but can not find any evidence in the manual page), ls certainly does, so the file with the newest date, would be the last one alphabetically.
Therefore, in bash
mv $(ls | tail -1) first-directory
mv * second-directory
Should do the trick.
If you want to be more specific about the choice of file, then replace * with something else - for example AA_*.dat
My solution to this is similar to others, but a little simpler.
ls -tr | tail -1
What is actually does is to rely on ls to sort the output, then uses tail to get the last listed file name.
This solution will not work if the filename you require has a leading dot (e.g. .profile).
This solution does work if the file name contains a space.

Resources