extract different lines from files using Bash - bash

I have two files and I use the "comm -23 file1 file2" command to extract the lines that are different from a file to another.
I would also need something that extracts the different lines but also preserves the string "line_$NR".
Example:
file1:
line_1: This is line0
line_2: This is line1
line_3: This is line2
line_4: This is line3
file2:
line_1: This is line1
line_2: This is line2
line_3: This is line3
I need this output:
differences file1 file2:
line_1: This is line0.
In conclusion I need to extract the differences as if the file has not line_$NR at the beginning but when I print the result I need to also print line_$NR.

Try using awk
awk -F: 'NR==FNR {a[$2]; next} !($2 in a)' file2 file1
Output:
line_1: This is line0
Short Description
awk -F: ' # Set filed separator as ':'. $1 contains line_<n> and $2 contains 'This is line_<m>'
NR==FNR { # If Number of records equal to relative number of records, i.e. first file is being parsed
a[$2]; # store $2 as a key in associative array 'a'
next # Don't process further. Go to next record.
}
!($2 in a) # Print a line if $2 of that line is not a key of array 'a'
' file2 file1
Additional Requirement (In comment)
And if I have multiple ":" in a line : "line_1: This :is: line0"
doesn't work. How can I only take the line_x
In that case, try following (GNU awk only)
awk -F'line_[0-9]+:' 'NR==FNR {a[$2]; next} !($2 in a)' file2 file1

this awk line is longer, however it would work no matter where the differences were located:
awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' file1 file2
test:
kent$ head f*
==> f1 <==
line_1: This is line0
line_2: This is line1
line_3: This is line2
line_4: This is line3
==> f2 <==
line_1: This is line1
line_2: This is line2
line_3: This is line3
#test f1 f2
kent$ awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' f1 f2
line_1: This is line0
#test f2 f1:
kent$ awk 'NR==FNR{a[$NF]=$0;next}a[$NF]{a[$NF]=0;next}7;END{for(x in a)if(a[x])print a[x]}' f2 f1
line_1: This is line0

Related

Comparing the data from one file to another and print the output

I have three files of name - File1, File2 and File3. The data of the three files is shown below:
File1:
// Class of "A2" of type "ONE".
// Class of "A3" of type "ONE".
// Class of "D1" of type "TWO".
// Class of "D2" of type "TWO".
// Class of "D3" of type "FOUR".
// Class of "D6" of type "FIVE."
File2:
#CLASS_NAMES = ("one",
"two",
"three",);
#CLASS_LIST_NAMES = ("ONE.A1",
"ONE.A2",
"ONE.A3",
"TWO.D1",
"TWO.D2");
File3:
D3
D4
D5
I need to check in File1 Class "D3" is present in the File2 of #CLASS_LIST_NAMES or not.
If it is not present in File2 of #CLASS_LIST_NAMES then I need to check in File3 if D3 is present there or not.
If D3 is present in File3 then the output should be as PASS and if it not present in both File2 and File3 the output should be FAIL.
Similarly I need to check for all the Class list-(A2, A3, D1, D2....) from File1 if they are present in the File2 of #CLASS_LIST_NAMES or not and if they are not present in File2, I need to verify with File3 and print the output as PASS or FAIL.
I tried the below code:
#!/bin/bash
sed -n '/#CLASS_LIST_NAMES =/,/)/p' File2
I'm stuck at here, can anyone tell me what need to be done next.
Deisred_Output: As from File1 - D6 is not found in both File2 and File3 it should print as FAIL. The output should be like below:
Fail: D6 is not found
You can achieve this with grep and awk
Use GNU grep which supports -P option
awk 'NR==FNR{a[$0]; next} !($0 in a){print "Fail: "$0 " is not found"}' <(cat file3 <(grep -Po '(?<=\.)[^"]+' file2)) <(grep -Po '(?<=of ")\w+' file1)
If you want to extract the classnames present only in the #CLASS_LIST_NAMES statement use below one.
awk 'NR==FNR{a[$0]; next} !($0 in a){print "Fail: "$0 " is not found"}' <(cat file3 <(sed -n '/#CLASS_LIST_NAMES/,/;$/p' | grep -Po '(?<=\.)[^"]+' file2)) <(grep -Po '(?<=of ")\w+' file1)
If the no of spaces in the file1 are not consistent, you can process using awk
# expects the 4th column is the variable, input format shouldn't change
awk 'NR==FNR{a[$0]; next} {gsub("\"","",$4)} !($4 in a){print "Fail: "$4" is not found"}' <(cat file3 <(sed -n '/#CLASS_LIST_NAMES/,/;$/p' | grep -Po '(?<=\.)[^"]+' file2)) file1
# alternate way using FPAT if the position of actual field can change, but it occurs first between double quotes
awk 'NR==FNR{a[$0]; next} {gsub("\"","",$1)} !($1 in a){print "Fail: "$1" is not found"}' <(cat file3 <(sed -n '/#CLASS_LIST_NAMES/,/;$/p' | grep -Po '(?<=\.)[^"]+' file2)) FPAT="\"[^ \"]+" file1

Add sequential number at the beginning of files

I have 5 files I want to add sequential numbers and tabulation at the beginning of each file but the second file should start with the last number from the first file and so on here's an example:
file1
line1
line2
....
line13
file2
line1
line2
file5
line1
line2
Output file1
1 line1
........
13 line13
output file2
14 line1
15 line2
And so on
if you want to concatenate files and number lines, use cat:
cat -n file1 file2 file3 file4 file5
if you want to create a separate output file for each input file, use awk:
awk '{
printf "%d\t%s\n",NR,$0 > ("output_"FILENAME)
}' file1 file2 file3 file4 file5
reads file1..5, numbers lines and outputs them to output_file1..5. note that if you have too many files then above awk command will fail with an error like too many open file descriptors., in that case use following, it closes the previous file when input file changes.
awk '
FILENAME!=f{close("output_"f);f=FILENAME}
{printf "%d\t%s\n",NR,$0 > ("output_"f)}
' file1 file2 file3 file4 file5

Awk syntax error in loop

cat file1
xi=zaoshui jiao=#E0488_5#
chi=fan da qiu=#E0488_3#
gong=zuo you xi #E0977_5#
cat file2
#E0488_3# #E21562_3#
#E0488_5# #E21562_5#
#E0977_3# #E21630_3#
#E0977_5# #E21630_5#
#E0977_6# #E21631_1#
Purpose: if $NF in file1 found in file2 $1, than replace $NF in file1 with file2 $2.otherwise, makes no change.
My Code:
awk 'NR==FNR{a[$1]=$2;next}
{split($NF,a,"=");for($NF in a){$NF=a[$NF]}}1' test2.txt test1.txt
But it comes error:
awk: cmd. line:1: NR==FNR{a[$1]=$2;next}{split($NF,a,"=");for($NF in a){$NF=a[$NF]}}1
awk: cmd. line:1: ^ syntax error
Does my code look right? It seems grammar issue happens. How can I improve it?
My expect output:
xi=zaoshui jiao=#E21562_5#
chi=fan da qiu=#E21562_3#
gong=zuo you xi #E21630_5#
for($NF in a) is not valid syntax, ($NF gives value)
it can be like
for (var in array)
body
Read More from : GNU AWK Scanning-an-Array
Used sub($NF,a[$NF]) to retain your original field separator, since last record, last field has space before, whereas other lines last field has = before, assuming values doesn't repeat other than last field.
Test Results:
$ cat file1
xi=zaoshui jiao=#E0488_5#
chi=fan da qiu=#E0488_3#
gong=zuo you xi #E0977_5#
$ cat file2
#E0488_3# #E21562_3#
#E0488_5# #E21562_5#
#E0977_3# #E21630_3#
#E0977_5# #E21630_5#
#E0977_6# #E21631_1#
$ awk 'FNR==NR{a[$1]=$NF;next}($NF in a){sub($NF,a[$NF])}1' file2 FS='[ =]' file1
xi=zaoshui jiao=#E21562_5#
chi=fan da qiu=#E21562_3#
gong=zuo you xi #E21630_5#
Not sure completely but could you please try following and do let me know if this helps you.
awk 'FNR==NR{a[$1]=$NF;next} ($NF in a){$NF=a[$NF]} 1' FS="=" file2 FS='[= ]' OFS="=" file1
Output will be as follows.
xi=zaoshui jiao=#E0488_5#
chi=fan da qiu=#E0488_3#
gong=zuo you xi #E0977_5#
EDIT: Adding explanation too now for same.
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named file2 is being read.
a[$1]=$NF; ##making an array named a whose index is $1 of current line and value is last field of the current line.
next ##next will skip all the further statements now.
}
($NF in a){ ##Checking condition here if last field of current line of Input_file file1 is present in array a if yes then do following.
$NF=a[$NF] ##creating last field value to array a value whose index is $NF of current line in Input_file file1.
}
1 ##1 will print the lines for Input_file file1.
' FS="=" file2 FS='[= ]' OFS="=" file1 ##Setting FS="=" for file2 and setting FS value to either = or space for file1 and setting OFS value to = for file1 too.
My code is as below, hope it could be helpful even if it's not the most efficient answer.
awk '$NF ~ /=/ {gsub("="," # ",$NF)}{print $0}' file1 > file3
cat file3
xi=zaoshui jiao # #E0488_5#
chi=fan da qiu # #E0488_3#
gong=zuo you xi #E0977_5#
As you said ,replace file1 with file3, if $NF of file3 found in file2 $1, than replace $NF of file3 with file2 $2
awk 'NR==FNR {a[$1]=$2;next}($NF in a){$NF=a[$NF]}1' file2 file3 | sed 's/ # /=/g'
xi=zaoshui jiao=#E21562_5#
chi=fan da qiu=#E21562_3#
gong=zuo you xi #E21630_5#

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

Merging Two, Nearly Similar Text Files

Suppose we have ~/file1:
line1
line2
line3
...and ~/file2:
line1
lineNEW
line3
Notice that thes two files are nearly identical, except line2 differs from lineNEW.
Question: How can I merge these two files to produce one that reads as follows:
line1
line2
lineNEW
line3
That is, how can I merge the two files so that all unique lines are captured (without overlap) into a third file? Note that the order of the lines doesn't matter (as long as all unique lines are being captured).
awk '{
print
getline line < second
if ($0 != line) print line
}' second=file2 file1
will do it
Considered the command below. It is more robust since it also works for files where a new line has been added instead of replaced (see f1 and f2 below).
First, I executed it using your files. I divided the command(s) into two lines so that it fits nicely in the "code block":
$ (awk '{ print NR, $0 }' file1; awk '{ print NR, $0 }' file2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces your expected output:
line1
line2
lineNEW
line3
I also used these two extra files to test it:
f1:
line1 stuff after a tab
line2 line2
line3
line4
line5
line6
f2:
line1 stuff after a tab
lineNEW
line2 line2
line3
line4
line5
line6
Here is the command:
$ (awk '{ print NR, $0 }' f1; awk '{ print NR, $0 }' f2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces this output:
line1 stuff after a tab
line2 line2
lineNEW
line3
line4
line5
line6
When you do not care about the order, just sort them:
cat ~/file1 ~/file2 | sort -u > ~/file3

Resources