Compare two text files line by line, finding differences but ignoring numerical values differences - bash

I'm working on a bash script to compare two similar text files line by line and find the eventual differences between each line of the files, I should point the difference and tell in which line the difference is, but I should ignore the numerical values in this comparison.
Example:
Process is running; process found : 12603 process is listening on port 1200
Process is running; process found : 43023 process is listening on port 1200
In the example above, the script shouldn't find any difference since it's just the process id and it changes all the time.
But otherwise I want it to notify me of the differences between the lines.
Example:
Process is running; process found : 12603 process is listening on port 1200
Process is not running; process found : 43023 process is not listening on port 1200
I already have a working script to find the differences, and i've used the following function to find the difference and ignore the numerical values, but it's not working perfectly, Any suggestions ?
COMPARE_FILES()
{
awk 'NR==FNR{a[FNR]=$0;next}$0!~a[FNR]{print $0}' $1 $2
}
Where $1 and $2 are the two files to compare.

Would you please try the following:
COMPARE_FILES() {
awk '
NR==FNR {a[FNR]=$0; next}
{
b=$0; gsub(/[0-9]+/,"",b)
c=a[FNR]; gsub(/[0-9]+/,"",c)
if (b != c) {printf "< %s\n> %s\n", $0, a[FNR]}
}' "$1" "$2"
}

Any suggestions ?
Jettison digits before making comparison, I would ameloriate your code following way replace
NR==FNR{a[FNR]=$0;next}$0!~a[FNR]{print $0}
using
NR==FNR{a[FNR]=$0;next}gensub(/[[:digit:]]/,"","g",$0)!~gensub(/[[:digit:]]/,"","g",a[FNR]){print $0}
Explanation: I harness gensub string function as it does return new string (gsub alter selected variable value). I replace [:digit:] character using empty string (i.e. delete it) globally.

Using any awk:
compare_files() {
awk '{key=$0; gsub(/[0-9]+(\.[0-9]+)?/,0,key)} NR==FNR{a[FNR]=key; next} key!~a[FNR]' "${#}"
}
The above doesn't just remove the digits, it replaces every set of numbers, whether they're integers like 17 or decimals like 17.31, with the number 0 to avoid false matches.
For example, given input like:
file1: foo 1234 bar
file2: foo bar
If you just remove the digits then those 2 lines incorrectly become identical:
file1: foo bar
file2: foo bar
whereas if you replace all numbers with a 0 then they correctly remain different:
file1: foo 0 bar
file2: foo bar
Note that with the above though we're comparing the lines after converting numbers to 0, we're not modifying the original lines so the output would show the original lines, not the modified ones, for ease of further investigating the differences.

Related

bash: substract constant number after prefix

I have a large text file with many entries like this:
/locus_tag="PREFIX_05485"
including the empty spaces in the beginning. Unfortunately, the first identifier does not start with 00001.
The only part in this line that is changing is the number.
I would like to change the PREFIX (this I can do easily with sed), but I also want to decrease the number so it looks like this:
/locus_tag="myNewPrefix_00001"
(the next entry should be ..."myNewPrefix_00002" and so on). Alternatively, the entry could also be without leading zeros.
As far as I know, sed cannot calculate (like substracting a constant number). Any ideas how I can solve that?
Thank you very much. If the question is unclear, please let me know and I will try to improve it.
EDIT: Sometimes the same number occurs twice (this should also be the case in the modified file, for instance
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
should be in the end
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"
You may use awk:
awk -v pf='myNewPrefix' 'BEGIN{FS=OFS="="}
$1 ~ /\/locus_tag$/ && split($2, a, /_/) == 2 {
$2 = sprintf("\"%s_%05d\"", pf, (a[2] in seen ? i : ++i)); seen[a[2]]
} 1' file
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"
Check this Perl one liner
/tmp> cat littlebird.txt
abcdef
/locus_tag="PREFIX_12345"
hello hai
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
123 456
end
/tmp> perl -pe 'BEGIN{$r=qr/PREFIX_(.+)["]/} if(/$r/) {$kv{$1}++;$kv{$1}==1 and $kv2{$1}=sprintf("%04d",++$i) for(keys %kv) } s/$r/PREFIX_$kv2{$1}/g ' littlebird.txt
abcdef
/locus_tag="PREFIX_0001
hello hai
/locus_tag="PREFIX_0001
/locus_tag="PREFIX_0002
/locus_tag="PREFIX_0003
123 456
end
/tmp>

Replace column after performing actions using awk

Below here I'm trying to remove comma only from msg column.
Input file ("abc.txt" has many entries as below):
alert tcp any any -> any [10,112,34] (msg:"Its an Test, Rule"; reference:url,view/Main; sid:1234; rev:1;)
Expected Output:
alert tcp any any -> any [10,112,34] (msg:"Its an Test Rule"; reference:url,view/Main; sid:1234; rev:1;)
This is what i have tried using awk:
awk -F ';' '{for(i=1;i<=NF;i++){if(match($i,"msg:")>0){split($i, array, "\"");tmessage=array[2];gsub("[',']","",tmessage);message=tmessage; }}print message'} abc.txt
The problem with having awk rewrite your fields is that output for modified lines will be field-separated by OFS, which is static.
The way around this is to avoid dealing with fields, and just handle the string replacement on $0. You could piece together the parts of the line manually, like this:
awk '{x=index($0,"msg:"); y=index(substr($0,x),";"); s=substr($0,x,y); gsub(/,/,"",s); print substr($0,1,x-1) s substr($0,x+y)}' input.txt
Or spelled out for easier reading:
{
x=index($0,"msg:") # find the offset of the interesting bit
y=index(substr($0,x),";") # find the length of that bit
s=substr($0,x,y) # clip the bit
gsub(/,/,"",s) # replace commas,
print substr($0,1,x-1) s substr($0,x+y) # print the result.
}

bash, using awk command for printing lines characters if condition

Before starting to explain my issue I have to say that it's the first time I'm using bash and the awk command.
I have a file containing a lot of lines and I am interested in printing some of these lines if certain characters of the line satisfy a condition. I already have a simple method which is working but I intend to try with awk to see if it can be faster. The command I'm trying was inspired by a colleague at work but I don't fully understand it.
My file looks like :
# 15247.479
1 23775U 96005A 18088.90328565 -.00000293 +00000-0 +00000-0 0 9992
2 23775 014.2616 019.1859 0018427 174.9850 255.8427 00.99889926081074
# 15250.479
1 23775U 96005A 18088.35358271 -.00000295 +00000-0 +00000-0 0 9990
2 23775 014.2614 019.1913 0018425 174.9634 058.1812 00.99890136081067
The 4th field number refers to a date and I want to print the lines starting with 1 and 2 if the bold number if superior to startDate and inferior to endDate.
I am trying with :
< $file awk ' BEGIN {ok=0}
{date=substring($0,19,10) if ($date>='$firstTime' && $date<= '$lastTime' ) {print; ok=1} else ok=0;next}
{if (ok) print}'
This returns a syntax error but I fear it is not the only problem. I don't really understand what the $0 in substring refers to.
Thanks everyone for the help !
Per the question about $0:
Awk is a language built for processing tables and has language features specific to both filtering and manipulating tabular data. One language feature is automatic field splitting.
If you see a $ in front of a variable or constant, it is referring to a "field." When awk sees $field_number being used in a variable context, awk splits the current record buffer based upon what is in the FS variable and allows you to work on that just as you would any other variable -- just that the backing store for that variable is the record buffer.
$0 is a special field referring to the whole of the record buffer. There are some interesting notes in the awk documentation about the side effects on $0 of assigning $field_number variables, FS and OFS that are worth an in depth read.
Here is my answer to your application:
(1) First, LC_ALL may help us for speed. I'm using ll/ul for lower and upper limits -- the reason for which will be apparent later. Specifying them as variables outside the script helps our readability. It is good practice to properly quote shell variables.
(2) It is good practice to use BEGIN { ... }, as you did in your attempt, to formally initialize variables. If using gawk, we can use LINT = 1 to test things like this.
(3) /^#/ is probably the simplest (and fastest) pattern for our reset. We use next because we never want to apply the limits to this line and we never want to see this line in our output (even if ll = ul = "").
(4) It is surprisingly easy to make a mistake on limits. Implement limits consistently one way, and our readers will thank us. We remember to check corner cases where ll and/or ul are blank. One corner case is where we have already triggered our limits and we are waiting for /^#/ -- we don't want to rescan the limits again while ok.
(5) The default action of a pattern is to print.
(6) Remembering to quote our filename variable will save us someday when we inevitably encounter the stray "$file" with spaces in the name.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" ' # (1)
BEGIN { ok = 0 } # (2)
/^#/ { ok = 0; next } # (3)
!ok { ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul) } # (4)
ok # <- print if ok # (5)
' "$file" # (6)
You're missing a ; between the variable assignment and if. And instead of concatenating shell variables, assign them to awk variables. There's no need to initialize ok=0, uninitialized variables are automatically treated as falsey. And if you want to access a field of the input, use $n where n is the field number, rather than substr().
You need to set ok=0 when you get to the next line beginning with #, otherwise you'll just keep printing the rest of the file.
awk -v firstTime="$firstTime" -v lastTime="$lastTime" '
NF > 3 && $4 > firstTime && $4 <= lastTime { print; ok=1 }
$1 == "#" { ok = 0 }
ok { print }' "$file"
This answer is based upon my original but taking into account some new information that #clem sent us in comment -- to the effect that we now know that the line we need to test is always immediately subsequent to the line matching /^#/. Therefore, when we match in this new solution, we immediately do a getline to grab the next line, and set ok based upon that next line's data. We now only check against limits on the line subsequent to our match, and we do not check against limits on lines where we shouldn't.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" '
BEGIN { ok = 0 }
/^#/ {
getline
ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul)
}
ok # <- print if ok
' "$file"

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

Bash script replace two fields in a text file using variables

This should be a simple fix but I cannot wrap my head around it at the moment.
I have a comma-delimited file called my_course that contains a list of courses with some information about them.
I need to get user input about the last two fields and change them accordingly.
Each field is constructed like:
CourseNumber,CourseTitle,CreditHours,Status,Grade
Example file:
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
I get the user input for 3 things: Course Number, Status (0 or 1), and Grade (A,B,C,N/A)
So far I have tried matching the line containing the course number and changing the last two fields. I haven't been about to figure out how to modify the last two fields using sed so I'm using this horrible jumble of awk and sed:
temporary=$(awk -v status=$status -v grade=$grade '
BEGIN { FS="," }; $(NF)=""; $(NF-1)="";
/'$cNum'/ {printf $0","status","grade;}' my_course)
sed -i "s/CSC$cNum.*/$temporary/g" my_course
The issue that I'm running into here is the number of fields in the course title can range from 1 to 4 so I can't just easily print the first n fields. I've tried removing the last two fields and appending the new values for status and grade but that isn't working for me.
Note: I have already done checks to ensure that the user inputs valid data.
Use a simple awk-script:
BEGIN {
FS=","
OFS=FS
}
$0 ~ course {
$(NF-1)=status
$NF=grade
} {print}
and on the cmd-line, set three parameters for the various parameters like course, status and grade.
in action:
$ cat input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC3210" -vstatus="1" -vgrade="A" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,1,A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC1010" -vstatus="1" -vgrade="B" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,1,B
It doesn't matter how much commas you have in course name as long as you look only at last two commas:
sed -i "/CSC$cNum/ s/.,[^,]*$$/$status,$grade/"
The trick is to use $ in pattern to match the end of line. $$ because of double quotes.
And don't bother building the "temporary" line - apply substitution only to line that matches course number.

Resources