Using sed to index files [duplicate] - bash

file1 contains multiple alphabetic sequences:
AETYUIOOILAKSJ
EAYEURIOPOSIDK
RYXURIAJSKDMAO
URITORIEJAHSJD
YWQIAKSJDHFKCM
HAJSUDIDSJSIAJ
AJDHDPFDIXSIBJ
JAQIAUXCNCVUFO
while file2 contains indexes of the sequences which I want to pull out and transfer to another file. For example, 3T means I want the sequence with a T at position 3 from within file1.
In reality both files are very large with thousands of indexes and sequences.
file2:
3T
10K
14D
1J
Desired output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
Ideally the output should match the order of indexes in file2. In other words the first index "3T" matches sequence "AETYUIOOILAKSJ" and thus this is the first sequence in the new file.
Things I have tried:
grep -f file2 file1
grep -fov file2 file1 # possibly to filter for those non-matching entries
I have also used the command line tool sift but am still having difficulty.
Thanks

$ cat tst.awk
NR==FNR {
lgth = length($0)
pos2char[substr($0,1,lgth-1)] = substr($0,lgth,1)
next
}
{
for (pos in pos2char) {
if ( substr($0,pos,1) == pos2char[pos] ) {
print
next
}
}
}
$ awk -f tst.awk file2 file1
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

With awk + grep pipeline:
awk '{ pat=sprintf("%*s", int($0)-1, ""); gsub(" ", ".", pat);
printf "^%s%s\n", pat, substr($0, length) }' file2 | grep -f- file1
The output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

Here you go:
awk 'NR==FNR {b[$0]++;next} {for (i in b) {a=match($0,"[A-Z]");n=substr($0,1,(a-1));s=substr($0,a);t=substr(i,n,1);if (t==s) print i}}' file1 file2
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
Some more readable:
awk '
NR==FNR {
b[$0]++;
next
}
{
for (i in b) {
a=match($0,"[A-Z]");
n=substr($0,1,(a-1));
s=substr($0,a);
t=substr(i,n,1);
if (t==s)
print i
}
}
' file1 file2
With comments:
awk '
NR==FNR { # For the first file
b[$0]++; # Store file1 in in array b
next
}
{
for (i in b) { # Loop trough elements in array b
a=match($0,"[A-Z]"); # For file2 find where letters starts
n=substr($0,1,(a-1)); # Store the number part of file2 in n
s=substr($0,a); # Store the letters part of file2 in s
t=substr(i,n,1); # from file1 find string at position n
if (t==s) # test if string found is equal to letter to find s
print i # if yes, print the line
}
}
' file1 file2

awk '(NR==FNR){a[$0]=substr($0,length);next}
{ for(key in a) if (a[key] == substr($0,key+0,1)) { print; break }
}' file2 file1
Here, the array a[key] is a associative array with the following key-value pairs:
key: value
3T T
10K K
... ...
When processing file2 with the line: (NR==FNR){a[$0]=substr($0,length);next}: we extract the value beforehand so we don't have to do it later on. The index is easily extracted with a math operation. Eg. "10K"+0=10 in Awk.
Processing file1 is done with the next line. Here we just check if the character matches for any of the entries in the associative array.

With GNU awk and grep:
awk -v FPAT='[0-9]+|[A-Z]+' '{ print "^.{" $1-1 "}" $2 }' file1 | grep -Ef - file2
Output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

Related

awk for string comparison with multiple delimiters

I have a file with multiple delimiters, I m looking to compare the value after the first / when read from right left with another file.
code :-
awk -F'[/|]' NR==FNR{a[$3]; next} ($1 in a )' file1 file2 > output
cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
cat file2
customer
trust
expected output :-
AAB/BBC/customer|fed|12931|
/customer|fed|12931|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
$ awk -F\| ' # just use one FS
NR==FNR {
a[$1]
next
}
{
n=split($1,t,/\//) # ... and use split to the 1st field
if(t[n] in a) # and compare the last split part
print
}' file2 file1
Output:
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|
If you use this [/|] you will have 2 delimiters and you will not know what the value after the last pipe was.
Reading your question, you want to compare the first value after the last slash without pipe chars.
If there has to be a / present in the string, you can set that as the field separator and check if there are at least 2 fields using NF > 1
Then take the last field using $NF, split on | and check if the first part is present in one of the values of file2 which are stored in array a
$cat file1
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
BXC/DEF/OTTA|fed|92374|
AVD/customer|FST|8736481|
FFS/T6TT/BOSTON|money|18922|
GTS/trust/YYYY|opt|62376|
XXY/IJSH/trust|opt|62376|
customer
Example code
awk -F/ '
NR==FNR {a[$1];next}
NF > 1 {
split($NF, t, "|")
if(t[1] in a) print
}
' file2 file1
Output
AAB/BBC/customer|fed|12931|
/customer|fed|982311|
AVD/customer|FST|8736481|
XXY/IJSH/trust|opt|62376|

match index notation of file1 to the index of file2 and pull out matching rows

file1 contains multiple alphabetic sequences:
AETYUIOOILAKSJ
EAYEURIOPOSIDK
RYXURIAJSKDMAO
URITORIEJAHSJD
YWQIAKSJDHFKCM
HAJSUDIDSJSIAJ
AJDHDPFDIXSIBJ
JAQIAUXCNCVUFO
while file2 contains indexes of the sequences which I want to pull out and transfer to another file. For example, 3T means I want the sequence with a T at position 3 from within file1.
In reality both files are very large with thousands of indexes and sequences.
file2:
3T
10K
14D
1J
Desired output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
Ideally the output should match the order of indexes in file2. In other words the first index "3T" matches sequence "AETYUIOOILAKSJ" and thus this is the first sequence in the new file.
Things I have tried:
grep -f file2 file1
grep -fov file2 file1 # possibly to filter for those non-matching entries
I have also used the command line tool sift but am still having difficulty.
Thanks
$ cat tst.awk
NR==FNR {
lgth = length($0)
pos2char[substr($0,1,lgth-1)] = substr($0,lgth,1)
next
}
{
for (pos in pos2char) {
if ( substr($0,pos,1) == pos2char[pos] ) {
print
next
}
}
}
$ awk -f tst.awk file2 file1
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
With awk + grep pipeline:
awk '{ pat=sprintf("%*s", int($0)-1, ""); gsub(" ", ".", pat);
printf "^%s%s\n", pat, substr($0, length) }' file2 | grep -f- file1
The output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
Here you go:
awk 'NR==FNR {b[$0]++;next} {for (i in b) {a=match($0,"[A-Z]");n=substr($0,1,(a-1));s=substr($0,a);t=substr(i,n,1);if (t==s) print i}}' file1 file2
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO
Some more readable:
awk '
NR==FNR {
b[$0]++;
next
}
{
for (i in b) {
a=match($0,"[A-Z]");
n=substr($0,1,(a-1));
s=substr($0,a);
t=substr(i,n,1);
if (t==s)
print i
}
}
' file1 file2
With comments:
awk '
NR==FNR { # For the first file
b[$0]++; # Store file1 in in array b
next
}
{
for (i in b) { # Loop trough elements in array b
a=match($0,"[A-Z]"); # For file2 find where letters starts
n=substr($0,1,(a-1)); # Store the number part of file2 in n
s=substr($0,a); # Store the letters part of file2 in s
t=substr(i,n,1); # from file1 find string at position n
if (t==s) # test if string found is equal to letter to find s
print i # if yes, print the line
}
}
' file1 file2
awk '(NR==FNR){a[$0]=substr($0,length);next}
{ for(key in a) if (a[key] == substr($0,key+0,1)) { print; break }
}' file2 file1
Here, the array a[key] is a associative array with the following key-value pairs:
key: value
3T T
10K K
... ...
When processing file2 with the line: (NR==FNR){a[$0]=substr($0,length);next}: we extract the value beforehand so we don't have to do it later on. The index is easily extracted with a math operation. Eg. "10K"+0=10 in Awk.
Processing file1 is done with the next line. Here we just check if the character matches for any of the entries in the associative array.
With GNU awk and grep:
awk -v FPAT='[0-9]+|[A-Z]+' '{ print "^.{" $1-1 "}" $2 }' file1 | grep -Ef - file2
Output:
AETYUIOOILAKSJ
RYXURIAJSKDMAO
URITORIEJAHSJD
JAQIAUXCNCVUFO

awk - Compare columns from two files and replace text in first file

I have two files. The first has 1 column and the second has 3 columns. I want to compare first columns of both files. If there is a coincidence, replace column 2 and 3 for specific values; if not, print the same line.
File 1:
$ cat file1
26
28
30
File 2:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,r,1510139756
27,a,0
28,r,1510244156
29,a,0
30,r,1510157364
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Desired output:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
I am using gawk to do this (it's inside a shell script and I am using solaris) but I can't get the output right. It only prints the lines that matches:
$fuente="file2"
gawk -v fuente="$fuente" 'FNR==NR{a[FNR]=$1; next}{print $1,$2="a",$3="0" }' $fuente file1 > file3
The output I got:
$ cat file3
26 a 0
28 a 0
30 a 0
awk one-liner:
awk 'NR==FNR{ a[$1]; next }$1 in a{ $2="a"; $3=0 }1' file1 FS=',' OFS=',' file2
The output:
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Really spread out for clarity; called (fuente.awk) like so:
awk -F \, -v fuente=file1 -f fuente.awk file2 # -F == IFS
BEGIN {
OFS="," # set OFS to make printing easier
while (getline x < fuente > 0) # safe way; read file into array
{
a[++i]=x # stuff indexed array
}
}
{ # For each line in file2
for (k=1 ; k<=i ; k++) # Lop over array (elements in file1)
{
if (($1==a[k]) && (! flag))
{
print($1,"a",0) # Found print new line
flag=1 # print only once
}
}
if (! flag) # Not found
{
print($0) # print original
}
flag=0 # reset flag
}
END { }

Fuse two csv files

I am trying to fuse two csv files this way using BASH.
files1.csv :
Col1;Col2
a;b
b:c
file2.csv
Col3;Col4
1;2
3;4
result.csv
Col1;Col2;Col3;Col4
a;b;0;0
b;c;0;0
0;0;1;2
0;0;3;4
The '0's in the result files are just empty cells.
I tried using paste command but it doesn't fuse it the way I want.
paste -d';' file1 file2
Is there a way to do it using BASH?
Thanks.
One in awk:
$ awk -v OFS=";" '
FNR==1 { a[1]=a[1] (a[1]==""?"":OFS) $0; next } # mind headers
FNR==NR { a[NR]=$0 OFS 0 OFS 0; next } # hash file1
{ a[NR]=0 OFS 0 OFS $0 } # hash file2
END { for(i=1;i<=NR;i++)if(i in a)print a[i] } # output
' file1 file2
Col1;Col2;Col3;Col4
a;b;0;0
b:c;0;0
0;0;1;2
0;0;3;4

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Resources