Using a command-line utility to perform the following map-updates - shell

I'm a complete newbie to using command-line utilities and am wondering how to process information as following:
mapping.txt:
80 001 002
81 011 012 013 014
82 021 022
...
input.txt:
81 103823044
80 103823054
81 103823064
...
Desired output.txt:
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|001|
103823054|002|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
I've done simple mapping wherein the column numbers are fixed but I'm unsure of how to map a dynamic number of columns to the desired output

If order is not important, join and awk can do the job easily.
$ join <(sort input.txt) <(sort mapping.txt) | awk -v OFS="|" '{for (i=3;i<NF;i++) print $2, $i OFS}'
103823054|001|
103823044|011|
103823044|012|
103823044|013|
103823064|011|
103823064|012|
103823064|013|

Here's a GNU awk script that uses multi-dimensional arrays to do what you want:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next }
$1 in a { for(k in a[$1]) print $2, k, "" }
If you save that to a file like script.awk and then chmod +x script.awk you can run it like:
$ ./script.awk mapping.txt input.txt
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|002|
103823054|001|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
Here's a breakdown of the script:
BEGIN - set the output field separator to |
FNR==NR - process the first file (mapping.txt) and store the data in a multi-dimensional array by $1 first, then by the other fields. next to skip any other line processing.
$1 in a - test to see if the line has a mapping. If so, print the corresponding mappings out in order(also GNU awk). The commas in the print command are converted to the OFS value.
It could be remade a "one-liner" like:
awk -v OFS="|" 'FNR==NR {for(i=2;i<=NF;i++) a[$1][$i]; next} $1 in a {for(k in a[$1]) print $2, k, ""}' mapping.txt input.txt
Here's a version of the script that uses a single dimensional array to store $0 then split()s it later to preserve order:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { a[$1]=$0; next }
$1 in a { c=split(a[$1], b); for(i=2;i<=c;i++) print $2, b[i], "" }

Related

awk - Compare columns from two files and replace text in first file

I have two files. The first has 1 column and the second has 3 columns. I want to compare first columns of both files. If there is a coincidence, replace column 2 and 3 for specific values; if not, print the same line.
File 1:
$ cat file1
26
28
30
File 2:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,r,1510139756
27,a,0
28,r,1510244156
29,a,0
30,r,1510157364
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Desired output:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
I am using gawk to do this (it's inside a shell script and I am using solaris) but I can't get the output right. It only prints the lines that matches:
$fuente="file2"
gawk -v fuente="$fuente" 'FNR==NR{a[FNR]=$1; next}{print $1,$2="a",$3="0" }' $fuente file1 > file3
The output I got:
$ cat file3
26 a 0
28 a 0
30 a 0
awk one-liner:
awk 'NR==FNR{ a[$1]; next }$1 in a{ $2="a"; $3=0 }1' file1 FS=',' OFS=',' file2
The output:
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Really spread out for clarity; called (fuente.awk) like so:
awk -F \, -v fuente=file1 -f fuente.awk file2 # -F == IFS
BEGIN {
OFS="," # set OFS to make printing easier
while (getline x < fuente > 0) # safe way; read file into array
{
a[++i]=x # stuff indexed array
}
}
{ # For each line in file2
for (k=1 ; k<=i ; k++) # Lop over array (elements in file1)
{
if (($1==a[k]) && (! flag))
{
print($1,"a",0) # Found print new line
flag=1 # print only once
}
}
if (! flag) # Not found
{
print($0) # print original
}
flag=0 # reset flag
}
END { }

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

Fuse two csv files

I am trying to fuse two csv files this way using BASH.
files1.csv :
Col1;Col2
a;b
b:c
file2.csv
Col3;Col4
1;2
3;4
result.csv
Col1;Col2;Col3;Col4
a;b;0;0
b;c;0;0
0;0;1;2
0;0;3;4
The '0's in the result files are just empty cells.
I tried using paste command but it doesn't fuse it the way I want.
paste -d';' file1 file2
Is there a way to do it using BASH?
Thanks.
One in awk:
$ awk -v OFS=";" '
FNR==1 { a[1]=a[1] (a[1]==""?"":OFS) $0; next } # mind headers
FNR==NR { a[NR]=$0 OFS 0 OFS 0; next } # hash file1
{ a[NR]=0 OFS 0 OFS $0 } # hash file2
END { for(i=1;i<=NR;i++)if(i in a)print a[i] } # output
' file1 file2
Col1;Col2;Col3;Col4
a;b;0;0
b:c;0;0
0;0;1;2
0;0;3;4

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Iterate through list in bash and run multiple grep commands

I would like to iterate through a list and grep for the items, then use awk to pull out important information from each grep result. (This is the way I thought to do it, but awk and grep aren't necessary if there is a better way).
The input file contains a number of lines that looks similar to this:
chr1 12345 . A G 3e-12 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;
I have a number of locations that should match some part of the second column.
locList="123, 789"
And for each matching location I would like to get the information from columns 4 and 5 and write them to an output file with the corresponding location.
So the output for the above list should be:
123 A G
Something like this is what I'm thinking:
for i in locList; do
grep i inputFile.txt | awk '{print $2,$4,$5}'
done
Invoking grep/awk once per location will be highly inefficient. You want to invoke a single command that will do your parsing. For example, awk:
awk -v locList="12345 789" '
BEGIN {
# parse the location list, and create an array where
# the locations are the array indexes
n = split(locList, a)
for (i=1; i<=n; i++) locations[a[i]] = 1
}
$2 in locations {print $2, $4, $5}
' file
revised requirements
awk -v locList="123 789" '
BEGIN { n = split(locList, patterns) }
{
for (i=1; i<=n; i++) {
if ($2 ~ "^" patterns[i]) {
print $2, $4, $5
break
}
}
}
' file
The ~ operator is the regular expression matching operator.
That will output 12345 A G from your sample input. If you just want to output 123 A G then print patterns[i] instead of $2.
awk -v locList='123|789' '$2~"^("locList")" {print $2,$4,$5}' file
or if you prefer:
locList='123, 789'
awk -v locList="^(${locList//, /|})" '$2~locList {print $2,$4,$5}' file
or whatever other permutation you like. The point is you don't need a loop at all - just create a regexp from the list of numbers in locList and test that regexp once.
What I would do :
locList="123 789"
for i in $locList; do awk -vvar=$i '$2 ~ var{print $4, $5}' file; done

Resources