awk to compare two files [duplicate] - bash

This question already has answers here:
Fast way of finding lines in one file that are not in another?
(11 answers)
Closed 7 years ago.
I am trying to compare two files and want to print the matching lines... The lines present in the files will be unique
File1.txt
GERMANY
FRANCE
UK
POLLAND
File2.txt
POLLAND
GERMANY
I tried with below command
awk 'BEGIN { FS="\n" } ; NR==FNR{A[$1]++;NEXT}A[$1]' File1.txt File2.txt
but it is printing the matching record twice, I want them to be printed once...
UPDATE
expected output
POLLAND
GERMANY
Current Output
POLLAND
GERMANY
POLLAND
GERMANY

grep together with -f (for file) is best for this:
$ grep -f f1 f2
POLLAND
GERMANY
And in fact, to get exact matches and no regex, use respectively -w and -F:
$ grep -wFf f1 f2
POLLAND
GERMANY
If you really have to do it with awk, then you can use:
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2
POLLAND
GERMANY
FNR==NR is performed when reading the first file.
{a[$1]; next} stores in a[] the lines of the first file and goes to the next line.
$1 in a is evaluated when looping through the second file. It checks if the current line is within the a[] array.
Why wasn't your script working?
Because you used NEXT instead of next. So it was treated as a constant instead of a command.
Also, because the BEGIN { FS="\n" } was wrong, as the default FS is a space and it is ok to be like that. Setting it as a new line was making it misbehave.

Your command should maybe be:
awk 'NR==FNR{A[$1]++;next}A[$1]' file1 file2
You have a stray semi-colon after the closing brace of BEGIN{} and also have "NEXT" in capital letters and have mis-spelled your filename.

Try this one-liner:
awk 'NR==FNR{name[$1]++;next}$1 in name' file1.txt file2.txt
You iterate through first file NR==FNR storing the names in an array called names.
You use next to prevent the second action from happneing until first file is completely stored in array.
Once the first file is complete, you start the next file by checking if it is present in the array. It will print out the name if it exits.
FS is field separator. You don't need to set that to new line. You need RS which is Record Separator to be new line. But we don't do that here because that it the default value.

If you don't have to use awk, a better alternative might be the GNU coreutil, comm. From the man page:
comm -12 file1 file2 Print only lines present in both file1 and file2.

Related

File grep for specific column

How to search each line in first file against a specific column in a second comma separated file so that whole line in the first file matches the whole column in the second file.
grep -Ff file1 file2, will search the entire line in the second file, but i want to search on a specific column.
Eg.
file1.txt
20
300
file2.txt
200,10
220,2
300,5
I want the result to only match 300,5 and not the first 2 rows.
$ awk -F, 'NR==FNR{a[$1]; next} $1 in a' file{1,2}
there are many answers already on this site with explanation of how this works, please refer to them.

egrep -v match lines containing some same text on each line

So I have two files.
Example of file 1 content.
/n01/mysqldata1/mysql-bin.000001
/n01/mysqldata1/mysql-bin.000002
/n01/mysqldata1/mysql-bin.000003
/n01/mysqldata1/mysql-bin.000004
/n01/mysqldata1/mysql-bin.000005
/n01/mysqldata1/mysql-bin.000006
Example of file 2 content.
/n01/mysqlarch1/mysql-bin.000004
/n01/mysqlarch1/mysql-bin.000001
/n01/mysqlarch2/mysql-bin.000005
So I want to match based only on mysql-bin.00000X and not the rest of the file path in each file as they differ between file1 and file2.
Here's the command I'm trying to run
cat file1 | egrep -v file2
The output I'm hoping for here would be...
/n01/mysqldata1/mysql-bin.000002
/n01/mysqldata1/mysql-bin.000003
/n01/mysqldata1/mysql-bin.000006
Any help would be much appreciated.
Just compare based on everything from /:
$ awk -F/ 'FNR==NR {a[$NF]; next} !($NF in a)' f2 f1
/n01/mysqldata1/mysql-bin.000002
/n01/mysqldata1/mysql-bin.000003
/n01/mysqldata1/mysql-bin.000006
Explanation
This reads file2 in memory and then compares with file1.
-F/ set the field separator to /.
FNR==NR {a[$NF]; next} while reading the first file (file2), store every last piece into an array a[]. Since we set the field separator to /, this is the mysql-bin.00000X part.
!($NF in a) when reading the second file (file1) check if the last field (mysql-bin.00000X part) is in the array a[]. If it does not, print the line.
I'm having one problem that I've noticed when testing. If file2 is
empty nothing is returned at all where as I would expected every line
in file1 to be returned. Is this something you could help me with
please? – user2841861.
Then the problem is that FNR==NR matches when reading the second file. To prevent this, just cross check that the "reading into a[] array" action is done on the first file:
awk -F/ 'FNR==NR && argv[1]==FILENAME {a[$NF]; next} !($NF in a)' f2 f1
^^^^^^^^^^^^^^^^^^^^
From man awk:
ARGV
The command-line arguments available to awk programs are stored in an
array called ARGV. ARGC is the number of command-line arguments
present. See section Other Command Line Arguments. Unlike most awk
arrays, ARGV is indexed from zero to ARGC - 1

grep matching specific position in lines using words from other file

I have 2 file
file1:
12342015010198765hello
12342015010188765hello
12342015010178765hello
whose each line contains fields at fixed positions, for example, position 13 - 17 is for account_id
file2:
98765
88765
which contains a list of account_ids.
In Korn Shell, I want to print lines from file1 whose position 13 - 17 match one of account_id in file2.
I can't do
grep -f file2 file1
because account_id in file2 can match other fields at other positions.
I have tried using pattern in file2:
^.{12}98765.*
but did not work.
Using awk
$ awk 'NR==FNR{a[$1]=1;next;} substr($0,13,5) in a' file2 file1
12342015010198765hello
12342015010188765hello
How it works
NR==FNR{a[$1]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read so far. Thus, if FNR==NR, we are reading the first file which is file2.
Each ID in in file2 is saved in array a. Then, we skip the rest of the commands and jump to the next line.
substr($0,13,5) in a
If we reach this command, we are working on the second file, file1.
This condition is true if the 5 character long substring that starts at position 13 is in array a. If the condition is true, then awk performs the default action which is to print the line.
Using grep
You mentioned trying
grep '^.{12}98765.*' file2
That uses extended regex syntax which means that -E is required. Also, there is no value in matching .* at the end: it will always match. Thus, try:
$ grep -E '^.{12}98765' file1
12342015010198765hello
To get both lines:
$ grep -E '^.{12}[89]8765' file1
12342015010198765hello
12342015010188765hello
This works because [89]8765 just happens to match the IDs of interest in file2. The awk solution, of course, provides more flexibility in what IDs to match.
Using sed with extended regex:
sed -r 's#.*#/^.{12}&/p#' file2 |sed -nr -f- file1
Using Basic regex:
sed 's#.*#/^.\\{12\\}&/p#' file1 |sed -n -f- file
Explanation:
sed -r 's#.*#/^.{12}&/p#' file2
will generate an output:
/.{12}98765/p
/.{12}88765/p
which is then used as a sed script for the next sed after pipe, which outputs:
12342015010198765hello
12342015010188765hello
Using Grep
The most convenient is to put each alternative in a separate line of the file.
You can look at this question:
grep multiple patterns single file argument list too long

Using cut and grep commands in unix

I have a file (file1.txt) with text as:
aaa,,,,,
aaa,10001781,,,,
aaa,10001782,,,,
bbb,10001783,,,,
My file2 contents are:
11111111
10001781
11111222
I need to search second field of file1 in file2 and delete the line from file1 if pattern is matching.So output will be:
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
Can I use grep and cut commands for this?
This prints lines from file1.txt only if the second field is not in file2:
$ awk -F, 'FNR==NR{a[$1]=1; next;} !a[$2]' file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
How it works
This works by reading file2 and keeping track of all lines seen in an associative array a. Then, lines in file1.txt are printed only if its column 2 is not in a. In more detail:
FNR==NR{a[$1]=1; next;}
When reading file2, set a[$1] to 1 to signal that we have seen the value on this line. We then instruct awk to skip the rest of the commands and start over on the next line.
This section is only run for file2 because file2 is listed first on the command line and FNR==NR only when we are reading the first file listed on the command line. This is because FNR is the number of lines read from the current file and NR is the total number of lines read so far. These two are equal only for the first file.
!a[$2]
When reading file1.txt, a[$2] evaluates to true if column 2 was seen in file2. Since ! is negation, !a[$2] evaluates to true when column 2 was not seen. When this evaluates to true, the line is printed.
Alternative
This is the same logic, expressed in a slightly different style, as suggested in the comments by Tom Fenech:
$ awk -F, 'FNR==NR{a[$1]; next;} !($2 in a)' file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
Soulution with grep
$ grep -vf file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
John1024's awk soulution would be faster for large files though.

Remove first columns then leave remaining line untouched in awk

I am trying to use awk to remove first three fields in a text file. Removing the first three fields is easy. But the rest of the line gets messed up by awk: the delimiters are changed from tab to space
Here is what I have tried:
head pivot.threeb.tsv | awk 'BEGIN {IFS="\t"} {$1=$2=$3=""; print }'
The first three columns are properly removed. The Problem is the output ends up with the tabs between columns $4 $5 $6 etc converted to spaces.
Update: The other question for which this was marked as duplicate was created later than this one : look at the dates.
first as ED commented, you have to use FS as field separator in awk.
tab becomes space in your output, because you didn't define OFS.
awk 'BEGIN{FS=OFS="\t"}{$1=$2=$3="";print}' file
this will remove the first 3 fields, and leave rest text "untouched"( you will see the leading 3 tabs). also in output the <tab> would be kept.
awk 'BEGIN{FS=OFS="\t"}{print $4,$5,$6}' file
will output without leading spaces/tabs. but If you have 500 columns you have to do it in a loop, or use sub function or consider other tools, cut, for example.
Actually this can be done in a very simple cut command like this:
cut -f4- inFile
If you don't want the field separation altered then use sed to remove the first 3 columns instead:
sed -r 's/(\S+\s+){3}//' file
To store the changes back to the file you can use the -i option:
sed -ri 's/(\S+\s+){3}//' file
awk '{for (i=4; i<NF; i++) printf $i " "; print $NF}'

Resources