How to check whehter rows of a file within the rows of another file - bash

I am fresh to Shell or Bash. I have file1 with one column and about 5000 rows and file2 have five columns with 240k rows. How can I check whether the values of the 5000 rows in file1 within or not the second column of file2?
$wc -l file1
$5188
$wc -l file2
$240,888

You can do this with awk, something like this:
awk 'NR == FNR {a[$2] = $1; next} {if ($2 in a){print(a[$2], $1)}}' file1 file2
Basically you read the first file in and store its contents in an array "a". Then you read the second file and check if the second field of each line is contained within array "a" and print it if it is.
My answer assumes your fields are separated by white space, if they are not you will have to change the separator. So, if your fields are separated by commas, you will need:
awk -F, .....

The above syntax does work, and it can be further simplified as:
awk 'FNR==NR{a[$1]=$2; next} {print $1, a[$1]}' file2 file1

Related

awk for string comparison with multiple delimiters

I have a file with multiple delimiters, I m looking to compare the value after the first / when read from right left with another file.
code :-
awk -F'[/|]' NR==FNR{a[$3]; next} ($1 in a )' file1 file2 > output
cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
cat file2
customer
trust
expected output :-
AAB/BBC/customer|fed|12931|
/customer|fed|12931|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
$ awk -F\| ' # just use one FS
NR==FNR {
a[$1]
next
}
{
n=split($1,t,/\//) # ... and use split to the 1st field
if(t[n] in a) # and compare the last split part
print
}' file2 file1
Output:
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
If you use this [/|] you will have 2 delimiters and you will not know what the value after the last pipe was.
Reading your question, you want to compare the first value after the last slash without pipe chars.
If there has to be a / present in the string, you can set that as the field separator and check if there are at least 2 fields using NF > 1
Then take the last field using $NF, split on | and check if the first part is present in one of the values of file2 which are stored in array a
$cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
customer
Example code
awk -F/ '
NR==FNR {a[$1];next}
NF > 1 {
split($NF, t, "|")
if(t[1] in a) print
}
' file2 file1
Output
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|

Compare two files with awk - check if values from two columns in file1 are included somewhere in two columns in file2

I want to check (each line) if values from file1 (in column $2 && $3) are somewhere included in file2 (in column $1 && $2). If yes, then I would like to print $1, $2, $3 from file1 as well as $3 from file2 (as a 4th column).
File1:
# 139.51 -62.48
# 137.36 -63.36
# 135.44 -64.09
File2:
137.35 -63.36 6.349
137.36 -63.36 6.348
137.37 -63.36 6.346
I've got so far:
awk 'NR == FNR {a[$1$2];c[FNR] =$3;next} $2$3 in a {print $1, $2, $3, c[FNR]}' $file2 $file1 > $output
But somehow, the resulting values in $4 are not equal to the 3rd column of file2. Could someone help me out? Thank you so much! :)
I am new in programming, and use awk and shell so far, so I am always happy about explanations! Thank you!
Since you haven't shown your expected output, so based on your statements only writing this code.
awk 'FNR==NR{a[$2,$3]=$0;next} (($1,$2) in a){print a[$1,$2],$NF}' fiLE1 fiLE2
Output will be as follows.
# 137.36 -63.36 6.348

search 2 fields in a file in another huge file, passing 2nd file only once

file1 has 100,000 lines. Each line has 2 fields such as:
test 12345678
test2 43213423
Another file has millions of lines. Here is an example of how the above file entries look in file2:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'
I would like a way to grep these 2 fields from file1 so that I can find any line that contains both, but the gotcha is, I would like to search the 100,000 entries through the 2nd file once as looping a grep is very slow as it could loop 100,000 x 10,000,000.
Is that at all possible?
You can do this in awk:
awk -F"['[:blank:]]+" 'NR == FNR { a[$1,$2]; next } $4 SUBSEP $5 in a' file1 file2
First set the field separator so that the quotes around the fields in the second file are consumed.
The first block applies to the first file and sets keys in the array a. The comma in the array index translates to the control character SUBSEP in the key.
Lines are printed in the second file when the third and fourth fields (with the SUBSEP in between) match one of the keys. Due to the ' at the start of the line, the first field $1 is actually an empty string, so the fields you want are $4 and $5.
If your fields are always quoted in the second file, then you can do this instead:
awk -v q="'" 'NR == FNR { a[q $1 q,q $2 q]; next } $3 SUBSEP $4 in a' file file2
This inserts the quotes into the array a, so the fields in the second file match without having to consume the quotes.
fgrep and sed method:
sed "s/\b/'/g;s/\b/**/g" file1 | fgrep -f - file2
Modify a stream from file1 with sed to match the format of the second file, (i.e. surround the fields with single quotes and asterisks), and send the stream to standard output. The fgrep -f - inputs that stream as a list of fixed strings (but no regexps) and finds every matching line in file2.
Output:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'

Compare two csv files, use the first three columns as identifier, then print common lines

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.
Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]
awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

Merge two files in linux with different column

I have two files in linux, the first file has 4 columns and the second has 2 columns. I want to merge these files into a new file that has the first 3 columns from file 1 and the first column from file 2. I tried awk, but my data from file 2 was placed under file 1.
paste file1 file2 | awk '{print $1,$2,$3,$5}'
Not sure which columns you want from each file, but something like this should work:
paste <file1> <file2> | awk '{print $1,$2,$3,$5}'
The first three columns would be picked from file1, and the fourth skipped, then pick the first column from the second file.
If the files have the same number of rows, you can do something like:
awk '{ getline v < "file2"; split( v, a ); print a[2], $1, $3 }' file1
to print colums 1 and 3 from file 1 and column 2 from file2.
you can try this one without paste command:
awk '{print $1}{print $2}{print $3}' file1 >> mergedfile
awk '{print $2}' file2 >> mergedfile

Resources