diff: how to use '--ignore-matching-lines' option

diff: how to use '--ignore-matching-lines' option - shell

I have two files:
$ cat xx
aaa
bbb
ccc
ddd
eee
$ cat zz
aaa
bbb
ccc
#ddd
eee
I want to diff them, while ignoring comments.
I tried all possible permutations, but nothing works:
diff --ignore-matching-lines='#' -u xx zz
diff --ignore-matching-lines='#.*' -u xx zz
diff --ignore-matching-lines='^#.*' -u xx zz
how can I diff two files, while ignoring given regex, such as anything starting with # ?

That not how the -I option in diff works, see this Giles's comment on Unix.SE and also on the man page - 1.4 Suppressing Differences Whose Lines All Match a Regular Expression
In short, the -I option works, if all the differences (insertions/deletions or changes) between the files match the RE defined. In your case, the diff between your two files, as seen in the output
diff f1 f2
4c4
< ddd
---
> #ddd
i.e. 4th line change in both the files, ddd and #ddd are the "hunks" as defined in the man page, together don't match any of your REs #, #.* or ^#.*. So when such an indifference exists, the action will be to print both the matching and the non-matching lines. Quoting the manual,
for each nonignorable change, diff prints the complete set of changes in its vicinity, including the ignorable ones.
The same would have worked better, if the file f1 did not contain the line ddd, i.e.
f1
aaa
bbb
ccc
eee
f2
aaa
bbb
ccc
#ddd
eee
where doing
diff f1 f2
3a4
> #ddd
would result in just one "hunk", #ddd which can be marked for ignoring with a pattern like ^# i.e. ignore any lines starting with a #, as you can see will produce the desired output (no lines)
diff -u -I '^#' f1 f2
So given your input contains the uncommented line ddd in f1, it will be not straightforward to define an RE to match a commented and an uncommented line. But diff does support including multiple -I flags as
diff -I '^#' -I 'ddd' f1 f2
but that cannot be valid, as you cannot know the exclude pattern beforehand to include in the ignore pattern.
As a workaround, you can simply ignore lines starting with # on either of the files, before passing it to diff i.e.
diff <(grep -v '^#' f1) <(grep -v '^#' f2)
4d3
< ddd

Related

Shell script: Insert multiple lines into a file ONLY after a specified pattern appears for the FIRST time. (The pattern appears multiple times)

I want to insert multiple lines into a file using shell script. Let us consider my original file: original.txt:
aaa
bbb
ccc
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
and my insert file: toinsert.txt
111
222
333
Now I have to insert the three lines from the 'toinsert.txt' file ONLY after the line 'ccc' appears for the FIRST time in the 'original.txt' file. Note: the 'ccc' pattern appears more than one time in my 'original.txt' file. After inserting ONLY after the pattern appears for the FIRST time, my file should change like this:
aaa
bbb
ccc
111
222
333
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
I should do the above insertion using a shell script. Can someone help me?
Note2: I found a similar case, with a partial solution:
sed -i -e '/ccc/r toinsert.txt' original.txt
which actually does the insertion multiple times (for every time the ccc pattern shows up).

Use ed, not sed, to edit files:
printf "%s\n" "/ccc/r toinsert.txt" w | ed -s original.txt
It inserts the contents of the other file after the first line containing ccc, but unlike your sed version, only after the first.

This might work for you (GNU sed):
sed '0,/ccc/!b;/ccc/r insertFile' file
Use a range:
If the current line is in the range following the first occurrence of ccc, break from further processing and implicitly print as usual.
Otherwise if the current line does contain ccc,insert lines from insertFile.
N.B. This uses the address 0 which allows the regexp to occur on line 1 and is specific to GNU sed.
or:
sed -e '/ccc/!b;r insertFile' -e ':a;n;ba' file
Use a loop:
If a line does not contain ccc, no further processing and print as usual.
Otherwise, insert lines from insertFile and then using a loop, fetch/print the remaining lines until the end of the file.
N.B. The r command insists on being delimited from other sed commands by a newline. The -e option simulates this effect and thus the sed commands are split across two -e options.
or:
sed 'x;/./{x;b};x;/ccc/!b;h;r insertFile' file
Use a flag:
If the hold space is not empty (the flag has already been set), no further processing and print as usual.
Otherwise, if the line does not contain ccc, no further processing and print as usual.
Otherwise, copy the current line to the hold space (set the flag) and insert lines from insertFile.
N.B. In all cases the r command inserts lines from insertFile after the current line is printed.

How to join all lines with '\n' in bash script [duplicate]

This question already has answers here:
Replace newlines with literal \n
(6 answers)
Closed 3 years ago.
I'm writing a bash script which calls vim to modify another file, then join all the lines in the file using '\n'.
Code I tried in script:
vi filefff (then I modify the text in filefff)
cat filefff
new=$(cat filefff | sed 'N;N;s/\n/\\n/g')
echo $new
Here is the problem:
for example, if there are two lines in the file: first-line aa, second-line
bb,
aa
bb
then I change the file to:
aa
bb
cc
dd
ee
the result of echo $new is aa"\n"bb cc"\n"dd ee"\n".The command only joined some of the lines.
And then I append some more lines:
aa
bb
cc
dd
ee
ff
gg
hh
the result is aa"\n"bb cc"\n"dd ee"\n"ff, the 'hh' is gone.
So I'd like to know why and how to join all the lines with '\n', no matter how many lines I'm going to append to the file.

As enhancement to 'sed' or 'tr' solutions suggested by comments, which can produce VERY long line, consider the following option, which can produce more human-friendly output, allowing a cap on the maximum line length (200 in the examples below)
# Use fold to limit line length
cat filefff | tr '\n' ' ' | fold -w200
# Use fmt to combine lines
cat filefff | fmt -w200
# Use xargs to format
cat filefff | xargs -s200
Note that the 'fmt' will assume line breaks are required when an empty line is provided.

Fast diff of 2 large text files in shell?

I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.
I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.
This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).
Is there any way I can rewrite this script to accomplish the same function but in a faster time?
I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.
Example:
F1 contains:
A
B
E
F
F2 contains:
A
Y
B
Z
The output I'm expecting is when comparing F1 to F2:
E
F
And then comparing F2 to F1:
Y
Z

You want comm:
$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
A
B
E
F
Y
Z
Column 1 of comm output are those lines unique to f1. Column 2 are those lines unique to f2. Column 3 are lines found in both f1 and f2.
The parameters -1, -2, and -3 suppress the corresponding output. For example, if you want only the lines unique to f1, you can filter out the other columns:
$ comm -23 <(sort f1) <(sort f2)
E
F
Note that comm requires sorted input, which I supply in these examples using the bash command substitution syntax (<()). If you're not using bash, pre-sort into a temporary file.

Have you tried linux's diff?
Some useful options are -i, -w, -u, -y
Though, in that case, they'd have to have the same order (you could sort them first)

If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:
sort F1 F2 | uniq -u

Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.
grep -vFx -f F1.txt F2.txt:
Y
Z
grep -vFx -f F2.txt F1.txt:
E
F
Explanation:
-v to print lines not matching those in the "pattern file"
specified with -f
-F - interpret patterns as fixed strings and not regexes, gleaned
from this
question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x - match entire lines
Sorting is not required. - You get the resulting unique lines in the order they appear. This method takes longer because it cannot assume the inputs are sorted, but if you are looking at multiline records, sorting really trashes the context. The performance is okay if the files are similar, because grep -v skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.

Extracting numeric pattern from file line

I have a file that has the following format:
EDouble entry for scenario XX AAA 70337262003 Line 000000003350
EDouble entry for scenario XX AAA 70337262003 Line 000000003347
EDouble entry for scenario XX AAA 71375201001 Line 000000003353
EDouble entry for scenario XX AAA 71375201001 Line 000000003351
EDouble entry (different date/time) for scenario YY AAA 10722963407 Line 000000000447
EDouble entry for scenario YY AAA 55173006602 Line 000000002868
EDouble entry (different date/time) for scenario YY AAA 60404822801 Line 000000003285
What I want to do is basically strip away all the alphabet characters and output a file that contains:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
I've thought of a couple ways that could assist me in getting there, simply listing some ideas since I don't have a ready solution. I could strip all alphabetic characters with:
tr -d '[[:alpha:]]'
but that would still mean I would need to process the file further to separate the first number from the second. Sed could perhaps provide a simpler solution since the second number will always start with 0.
sed -n 's/.*\[1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1- 9]\).*/\1/p'
to find the pattern, and only printing pattern – but the above command doesn't output anything. Could someone help me please? It's not necessary to accomplish this with sed, I imagine awk with gsub and grep have something similar?

Print third to last column:
awk '{print $(NF-2)}' file
Output:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

So If you prefer sed, use this:
sed -rn "s#.*([1-9][0-9]{10}).*#\1#p" file.txt

With grep you can do this:
grep -o '[1-9][0-9]\{10\}' file
With sed:
sed -n 's/.*\([1-9][0-9]\{10\}\).*/\1/p' file
There's a narrow margin of error targeting 11 digits, as the numbers starting with 0 are 12 digits long. A more robust solution considering that fact would be:
sed -n 's/.*[[:blank:]]\([1-9][0-9]\{10\}\).*/\1/p' file
i.e make sure to match a [[:blank:]] before the number.

I see that AAA is constant in all rows behind the number.
Therefore you can use this:
$ grep -oP '(?<=AAA\s)\s*\d+' data
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

This one extracts a group of digits followed by a word boundary, but not followed by the end of the line:
$ grep -Po '\d+\b(?!$)' infile
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
-P enables Perl regular expressions
-o retains only the match
\d+\b greedily matches digits followed by a word boundary
(?!$) is a "negative look-ahead": if the next character is the end of the line, don't match

Grep based on pattern

Sample Text:
This is a test
This is aaaa test
This is aaa test
This is test a
This aa is test
I have just started learning unix commands like grep, awk and sed and have a quick question. If my text file contains the above text how can I just print out lines that use the letter ‘a’ 2 or fewer times.
I tried using awk, but don’t understand the syntax to add up all the instances of ‘a’ and only print the lines that have ‘a’ 2 or fewer times. I understand comparing numbers based on columns like awk ‘$1 <=2’ but don’t know how to use that with characters as well. Any help would be appreciated.
Essentially it should print out:
This is a test
This is test a
This aa is test
For Clarity: I don't want to remove the extra As, but rather only print the lines that contain two or fewer As.

Using awk
awk '!/aaa+/' file
This is a test
This is test a
This aa is test
Do not print lines with three or more a together.
Same with sed
sed '/aaa\+/d' file
This is a test
This is test a
This aa is test
Default for sed is to print all line. /aaa\+/d tells sed to delete lines with 3 or more a

like this?
kent$ grep -v 'aaa\+' file
This is a test
This is test a
This aa is test
Update
I just saw the comment, if your requirement is anywhere on the line, no matter consecutive or not, see the example (with awk):
kent$ cat f
1a a
2a
3
4a a a aa
5aaaaaaaaaa
kent$ awk 'gsub(/a/,"a")<3' f
1a a
2a
3
without gsub:
kent$ awk -F'a' 'NF<4' f
1a a
2a
3

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio