Compare columns in two text files and match lines - bash

I want to compare the second column (delimited by a whitespace) in file1:
n01443537/n01443537_481.JPEG n01443537
n01629819/n01629819_420.JPEG n01629819
n02883205/n02883205_461.JPEG n02883205
With the second column (delimited by a whitespace) in file2:
val_8447.JPEG n09256479
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
val_8480.JPEG n03089624
If there is a match, I would like to print out the corresponding line of file2.
Desired output in this example:
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
I tried the following, but the output file is empty:
awk -F' ' 'NR==FNR{c[$2]++;next};c[$2] > 0' file1.txt file2.txt > file3.txt
Also tried this, but the result was the same (empty output file):
awk 'NR==FNR{a[$2];next}$2 in a' file1 file2 > file3.txt

GNU join exists for this purpose.
join -o "2.1 2.2" -j 2 <(sort -k 2 file1) <(sort -k 2 file2)

Using awk:
awk 'FNR==NR{a[$NF]; next} $NF in a' file1 file2
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
Here is a grep alternative with process substitution:
grep -f <(awk '{print " " $NF "$"}' file1) file2
Using print " " $NF "$" to create a regex like " n01443537$" so that we match only last column in grep.

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Vlookup using awk command

I have two files in my linux server.
File 1
9190784
9197256
9170546
9184139
9196854
File 2
S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP
I want to write a script or a single line command in bash using awk command, so as whatever the element in File -1 should match the same with column 1 in File -2 and print Column 1, Column2 and Column3. Also if any entry is not found it should print entry from file 1 and print NA in Column 2 and Column 3
Output : it should redirect the output to a new file as below.
new_file
9190784,TGM,AP
9197256,NA,NA
9170546,NA,NA
9184139,AGM,KN
9196854,TGM,AP
I hope the query is understandable. Anyone please help me on the same.
standard join operation with awk
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[$2]=$3 OFS $4; next}
{print $1, (($1 in a)?a[$1]:"NA" OFS "NA")} file2 file1
substring variation (not tested)
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[substr($2,1,7)]=$3 OFS $4; next}
{key=substr($1,1,7);
print $1, ((key in a)?a[key]:"NA" OFS "NA")} file2 file1
Does it have to be awk? It's done with join:
Having two files:
echo '9190784
9197256
9170546
9184139
9196854' >file2
echo 'S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP' > file1
One can join the on , as separator on the second field from the first file1 -12 with removed the first header line tail -n +2 and sorted using the second field sort -t, -k2 with the first field from the second file -21 sorted sort.
join -t, -12 -21 -o1.2,1.3,1.4 <(tail -n +2 file1 | sort -t, -k2) <(sort file2)
will output:
9184139,AGM,KN
9196854,TGM,AP

Joining two text files based on a common field (ip address)

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance
$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1
Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)
Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

Compare Joining different fields

I have 2 files
file1.txt [syntax field1:filed2:field3:...]
123456:07102015174037:100 --> this should be matched
123457:03102015174037:354
123456:03102015174037:1
1234556:03102015174037:0
file2.txt [syntax field3:filed4:field1:...]
100:03102015174037:123456 --> this should be matched
101:03145415174037:1234556
I wanted to check if
From file1.txt combination of field1 & field2 exists in file2.txt at field3 & field1
At End I want to print only file1.txt content which are matched
so I ended up doing [To get only matched columns]
awk -F ':' '{print $1,$2}' file1.txt >> tmpfile1.txt
awk -F ':' '{print $1,$3}' file2.txt >> tmpfile2.txt
grep -f tmpfile1.txt tmpfile2.txt> match.txt
grep -f match.txt file1.txt >> file1updated.txt
cat fil1updated
100:03102015174037:123456
Is there are one step & efficient way of doing this [Its like joins on columns in SQL]
You can run this inline of course. But I prefer to create an awk script as such (join.awk)
NR==FNR {
a[$1 FS $3] = $0; next
}
$3 FS $1 in a {print a[$3 FS $1]}
Then you can get your result by running
awk -F: -f join.awk file2.txt file1.txt

Resources