Merging Two, Nearly Similar Text Files - shell

Suppose we have ~/file1:
line1
line2
line3
...and ~/file2:
line1
lineNEW
line3
Notice that thes two files are nearly identical, except line2 differs from lineNEW.
Question: How can I merge these two files to produce one that reads as follows:
line1
line2
lineNEW
line3
That is, how can I merge the two files so that all unique lines are captured (without overlap) into a third file? Note that the order of the lines doesn't matter (as long as all unique lines are being captured).

awk '{
print
getline line < second
if ($0 != line) print line
}' second=file2 file1
will do it

Considered the command below. It is more robust since it also works for files where a new line has been added instead of replaced (see f1 and f2 below).
First, I executed it using your files. I divided the command(s) into two lines so that it fits nicely in the "code block":
$ (awk '{ print NR, $0 }' file1; awk '{ print NR, $0 }' file2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces your expected output:
line1
line2
lineNEW
line3
I also used these two extra files to test it:
f1:
line1 stuff after a tab
line2 line2
line3
line4
line5
line6
f2:
line1 stuff after a tab
lineNEW
line2 line2
line3
line4
line5
line6
Here is the command:
$ (awk '{ print NR, $0 }' f1; awk '{ print NR, $0 }' f2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces this output:
line1 stuff after a tab
line2 line2
lineNEW
line3
line4
line5
line6

When you do not care about the order, just sort them:
cat ~/file1 ~/file2 | sort -u > ~/file3

Related

Move all lines between header and footer into new file in unix

I have file records like below, header, data & footer records.
I need to move only data part to another file. New file should only contain lines between Header2 and Footer1.
I have tried t head -n 30 fiename | tail 10 > newfile
as data record counts may vary .
example records from source file .
Header1
Header2
Header3
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
Footer1
Footer2
Footer3
Output file should have:
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
There are different ways.
grep:
grep -v -E "Header|Footer" source.txt
awk:
awk '! /Header.|Footer./ { print }' source.txt
You can replace the "Header" and "Footer" values by whatever you use to identify each lines.

can you print a record in awk [duplicate]

This question already has answers here:
How to parse multi line records (with awk?)
(2 answers)
Closed 7 years ago.
Probably a simple question but I haven't found an answer. I have a file with multiple records separated by a blank line. Each field in the file is separated by a newline. I simply want to print out the entire first record or the entire third record.
awk 'BEGIN{FS="";} {print $1}' output.txt
The above prints out the first letter of each line of the first record
awk 'BEGIN{FS="\n"; RS=""} {print $1}' output.txt
The above prints out the first field of the first record.
It seems a simple enough problem but I can't seem to solve it. Records have an indeterminate amount of fields (lines). They are simply separated by a blank line
A
ok here is a sample:
line1 record1
line2 record1
line3 record1
line4 record1
line1 record2
line2 record2
line3 record2
line4 record2
line5 record2
line1 record3
line1 record4
line2 record4
Now I want the entire first record and the entire 3rd record.
awk 'NR==1 || NR==3 {print $0}' output.txt
line1 record1
line3 record1
First and third lines of first record. no good
awk 'NR==1 || NR==3' output.txt
line1 record1
line3 record1
First and third lines of first record. no good
awk 'NR==1 || NR==3 {print $0}' output.txt
line1 record1
line3 record1
First and 3rd line of the first record. no good.
awk 'BEGIN{FS="\n"; RS=""} NR==1' output.txt
line1 record1
line2 record1
line3 record1
line4 record1
line1 record2
line2 record2
line3 record2
line4 record2
line5 record2
line1 record3
line1 record4
line2 record4
All printed out. no good
I simply want the first and third records.
The first being:
line1 record1
line2 record1
line3 record1
line4 record1
and the third being:
line1 record3
Ok so nothing spelled out seems to work for me and I'm well confused. Here is the shell output:
$ awk -v RS= -v ORS='\n\n' 'NR ~ /^(1|3)$/' output.txt
line1 record1
line2 record1
line3 record1
line4 record1
line1 record2
line2 record2
line3 record2
line4 record2
line5 record2
line1 record3
line1 record4
line2 record4
$ cat output.txt
line1 record1
line2 record1
line3 record1
line4 record1
line1 record2
line2 record2
line3 record2
line4 record2
line5 record2
line1 record3
line1 record4
line2 record4
$
I am very confused as to why this isn't working.
here is my system and the awk I'm using:
$ awk -V | head -1
GNU Awk 4.0.1
$ uname -a
Linux IEDUB2TJ5262 3.13.0-68-generic #111-Ubuntu SMP Fri Nov 6 18:17:06 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$
Is there something I could be missing here?
A
This is THE idiomatic awk way to do what you want and it works in all awks, not just gawk:
$ awk -v RS= -v ORS='\n\n' 'NR ~ /^(1|3)$/' file
line1 record1
line2 record1
line3 record1
line4 record1
line1 record3
See http://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line and google awk paragraph mode.
If the above does not work for you then there is something wrong with your input file (or, far less likely, your awk is broken).
awk 'NR==1 || NR==3 {print $0}' output.txt
Everything before the curly braces is called a restriction.
It will determine whether the command (everything within the parentheses) will be executed. NR means Number of Record, so it means that output will be restricted to the first and third record.
Every awk program is just a collection of restrictions and commands.
EDIT:
Actually I just realized that {print $0} is the default action, if no command is provided, this means that:
awk 'NR==1 || NR==3' output.txt
is sufficient.
EDIT:
After you've explained yourself a bit more, I suggest this:
awk 'BEGIN {RS='\n\n'} NR==1 || NR==3' output.txt
It considers everything a record, that is seperated from each other by two newlines.
awk 'BEGIN{FS="\n"; RS=""} NR==1' output.txt
prints the first record.

extract different lines from files using Bash

I have two files and I use the "comm -23 file1 file2" command to extract the lines that are different from a file to another.
I would also need something that extracts the different lines but also preserves the string "line_$NR".
Example:
file1:
line_1: This is line0
line_2: This is line1
line_3: This is line2
line_4: This is line3
file2:
line_1: This is line1
line_2: This is line2
line_3: This is line3
I need this output:
differences file1 file2:
line_1: This is line0.
In conclusion I need to extract the differences as if the file has not line_$NR at the beginning but when I print the result I need to also print line_$NR.
Try using awk
awk -F: 'NR==FNR {a[$2]; next} !($2 in a)' file2 file1
Output:
line_1: This is line0
Short Description
awk -F: ' # Set filed separator as ':'. $1 contains line_<n> and $2 contains 'This is line_<m>'
NR==FNR { # If Number of records equal to relative number of records, i.e. first file is being parsed
a[$2]; # store $2 as a key in associative array 'a'
next # Don't process further. Go to next record.
}
!($2 in a) # Print a line if $2 of that line is not a key of array 'a'
' file2 file1
Additional Requirement (In comment)
And if I have multiple ":" in a line : "line_1: This :is: line0"
doesn't work. How can I only take the line_x
In that case, try following (GNU awk only)
awk -F'line_[0-9]+:' 'NR==FNR {a[$2]; next} !($2 in a)' file2 file1
this awk line is longer, however it would work no matter where the differences were located:
awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' file1 file2
test:
kent$ head f*
==> f1 <==
line_1: This is line0
line_2: This is line1
line_3: This is line2
line_4: This is line3
==> f2 <==
line_1: This is line1
line_2: This is line2
line_3: This is line3
#test f1 f2
kent$ awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' f1 f2
line_1: This is line0
#test f2 f1:
kent$ awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' f2 f1
line_1: This is line0

Omit the last line with sed

I'm having the following file content.
2013-07-30 debug
line1
2013-07-30 info
line2
line3
2013-07-30 debug
line4
line5
I want to get the following output with sed.
2013-07-30 info
line2
line3
This command gives me nearly the output I want
sed -n '/info/I,/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/{p}' myfile.txt
2013-07-30 info
line2
line3
2013-07-30 debug
How do I omit the last line here?
IMO, sed starts to become unwieldy as soon as you have to add conditions into it. I realize you did not tag the question with awk, but here is an awk program to print only "info" sections.
awk -v type="info" '
$1 ~ /^[0-9]{4}-[0-9]{2}-[0-9]{2}$/ {p = ($2 == type)}
p
' myfile.txt
2013-07-30 info
line2
line3
Try:
sed -n '/info/I p; //,/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/{ //! p}' myfile.txt
It prints first match, and in range omits both edges but the first one is already printed, so only skips the second one. It yields:
2013-07-30 info
line2
line3
This might work for you (GNU sed):
sed -r '/info/I{:a;n;/^[0-9]{4}(-[0-9]{2}){2}/!ba;s/^/\n/;D};d' file
or if you prefer:
sed '/info/I{:a;n;/^....-..-.. /!ba;s/^/\n/;D};d' file
N.B. This caters for consecutive patterns

Separate by blank lines in bash

I have an input like this:
Block 1:
line1
line2
line3
line4
Block 2:
line1
line2
Block 3:
line1
line2
line3
This is an example, is there an elegant way to print Block 2 and its lines only without rely on their names? It would be like "separate the blocks by the blank line and print the second block".
try this:
awk '!$0{i++;next;}i==1' yourFile
considering performance, also can add exit after 2nd block was processed:
awk '!$0{i++;next;}i==1;i>1{exit;}' yourFile
test:
kent$ cat t
Block 1:
line1
line2
line3
line4
Block 2:
line1
line2
Block 3:
line1
line2
line3
kent$ awk '!$0{i++;next;}i==1' t
Block 2:
line1
line2
kent$ awk '!$0{i++;next;}i==1;i>1{exit;}' t
Block 2:
line1
line2
Set the record separater to the empty string to separate on blank lines. To
print the second block:
$ awk -v RS= 'NR==2{ print }'
(Note that this only separates on lines that do not contain any whitespace.
A line containing only white space is not considered a blank line.)

Resources