Joining two text files based on a common field (ip address) - bash

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance

$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1

Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)

Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

Related

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

Vlookup using awk command

I have two files in my linux server.
File 1
9190784
9197256
9170546
9184139
9196854
File 2
S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP
I want to write a script or a single line command in bash using awk command, so as whatever the element in File -1 should match the same with column 1 in File -2 and print Column 1, Column2 and Column3. Also if any entry is not found it should print entry from file 1 and print NA in Column 2 and Column 3
Output : it should redirect the output to a new file as below.
new_file
9190784,TGM,AP
9197256,NA,NA
9170546,NA,NA
9184139,AGM,KN
9196854,TGM,AP
I hope the query is understandable. Anyone please help me on the same.
standard join operation with awk
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[$2]=$3 OFS $4; next}
{print $1, (($1 in a)?a[$1]:"NA" OFS "NA")} file2 file1
substring variation (not tested)
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[substr($2,1,7)]=$3 OFS $4; next}
{key=substr($1,1,7);
print $1, ((key in a)?a[key]:"NA" OFS "NA")} file2 file1
Does it have to be awk? It's done with join:
Having two files:
echo '9190784
9197256
9170546
9184139
9196854' >file2
echo 'S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP' > file1
One can join the on , as separator on the second field from the first file1 -12 with removed the first header line tail -n +2 and sorted using the second field sort -t, -k2 with the first field from the second file -21 sorted sort.
join -t, -12 -21 -o1.2,1.3,1.4 <(tail -n +2 file1 | sort -t, -k2) <(sort file2)
will output:
9184139,AGM,KN
9196854,TGM,AP

shell - compare files and update matching string awk/sed/diff/grep/csv

I need to compare 2 csv files and make modifications to the second column. I wrote out the logic out of how I would want to achieve this however, it seems to confuse the thread a lot more than I wanted too so I'll just write out the example.
Any help would be appreciated. Thanks in advance.
file1
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName3
user4,distinguishedName4
user5,distinguishedName5
file2
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
desired outcome:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
The solution using join command combined with awk command:
join -t',' -j1 -a1 -a2 file1 file2 | awk -F',' '{if(NF==3) $0=$1FS$3}1'
The output:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
Explanation:
-- for join command:
-t',' - defines field separator
-j1 - tells to join on first field 1
-a FILENUM - print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-- for awk command:
NF - contains a total number of fields
FS - field separator(i.e. ,)
if(NF==3) $0=$1FS$3 - the condition, checks if there's a complement third field(as result of joining the files on lines with common first field) to perform the replacement
https://linux.die.net/man/1/join
awk to the rescue!
awk -F, '!a[$1]++' file2 file1
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
user2,distinguishedName2
user4,distinguishedName4
this order is based on file2 and file1 record order, if you want sorted order just pipe to sort
awk ... | sort

Compare columns in two text files and match lines

I want to compare the second column (delimited by a whitespace) in file1:
n01443537/n01443537_481.JPEG n01443537
n01629819/n01629819_420.JPEG n01629819
n02883205/n02883205_461.JPEG n02883205
With the second column (delimited by a whitespace) in file2:
val_8447.JPEG n09256479
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
val_8480.JPEG n03089624
If there is a match, I would like to print out the corresponding line of file2.
Desired output in this example:
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
I tried the following, but the output file is empty:
awk -F' ' 'NR==FNR{c[$2]++;next};c[$2] > 0' file1.txt file2.txt > file3.txt
Also tried this, but the result was the same (empty output file):
awk 'NR==FNR{a[$2];next}$2 in a' file1 file2 > file3.txt
GNU join exists for this purpose.
join -o "2.1 2.2" -j 2 <(sort -k 2 file1) <(sort -k 2 file2)
Using awk:
awk 'FNR==NR{a[$NF]; next} $NF in a' file1 file2
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
Here is a grep alternative with process substitution:
grep -f <(awk '{print " " $NF "$"}' file1) file2
Using print " " $NF "$" to create a regex like " n01443537$" so that we match only last column in grep.

Resources