AWK - Processing multiple file through for loop and conditional check - bash

File 1: myfilename_WEEK.csv
w27_2018,257,1,26.20,0.00,24.26
w28_2018,257,1,7.97,0.00,24.26
w29_2018,257,1,34.86,0.00,24.26
w30_2018,257,1,3.29,0.00,24.26
File 2: myfilename_MONTH.csv
m07_2018,257,1,94.78,0.00,121.31
m08_2018,257,1,719.60,0.00,262.47
m09_2018,257,1,14925.60,0.00,13903.24
m10_2018,257,1,51099.66,0.00,81600.69
File 3: myfilename_HALF.csv
h02_2018,257,1,155345.19,480029.21,235802.91
h01_2019,257,1,273961.84,552545.36,140706.27
h02_2018,258,1,3250552.06,1299785.91,3697749.57
h01_2019,258,1,3582585.66,2670427.72,4009391.28
calendar_file:
20180805,08/05/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,01,SUNDAY
20180806,08/06/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,02,MONDAY
...
20180811,08/11/2018,w27_2018,WK27 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,27,WEEK 27,07,SATURDAY
20180812,08/12/2018,w28_2018,WK28 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,28,WEEK 28,01,SUNDAY
..
20180816,08/16/2018,w28_2018,WK28 2018,m07_2018,AUG 2018,q03_2018,Q03 2018,h02_2018,H02 2018,a2018,FY2018,28,WEEK 28,05,THURSDAY
Expected output (newlines added for readability):
2018,w27_2018,WK27 2018,257,1,26.20,0.00,24.26
2018,w27_2018,WK27 2018,258,1,97192.07,9028.38,52130.32
2018,w27_2018,WK27 2018,300,1,181.44,0.00,-69.72
2018,m07_2018,AUG 2018,257,1,94.78,0.00,121.31
2018,m07_2018,AUG 2018,258,1,509253.46,45141.91,399648.71
2018,m07_2018,AUG 2018,300,1,409.10,0.00,-348.60
2018,h02_2018,H02 2018,257,1,155345.19,480029.21,235802.91
2018,h02_2018,H02 2018,258,1,3250552.06,1299785.91,3697749.57
2018,h02_2018,H02 2018,300,1,1112.93,0.00,-1164.35
I would like to join all myfilename_* to add a label and Fiscal Year using calendar_file:
Individual commands are:
awk -F, 'NR==FNR {a[$3]=substr($12,3,4) FS $3 FS $4; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_WEEK.csv >> my_report.csv
awk -F, 'NR==FNR {a[$5]=substr($12,3,4) FS $5 FS $6; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_MONTH.csv >> my_report.csv
awk -F, 'NR==FNR {a[$9]=substr($12,3,4) FS $9 FS $10; next} {print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6}' calendar_file myfilename_HALF.csv >> my_report.csv
I am trying to join all of these into a single loop:
I have tried the following but it doesn't work:
for exp_file in `ls myfilename_*.csv`
do
awk -F, '\
{ \
if(NR==FNR && FILENAME ~ /WEEK/) {a[$3]=substr($12,3,4) FS $3 FS $4; next} ;\
if(NR==FNR && FILENAME ~ /MONTH/) {a[$5]=substr($12,3,4) FS $5 FS $6; next} ;\
if(NR==FNR && FILENAME ~ /HALF/) {a[$9]=substr($12,3,4) FS $9 FS $10; next} ;\
{print a[$1] FS $2 FS $3 FS $4 FS $5 FS $6} \
}' calendar_file $exp_file >> my_report.csv
done
How can I achieve this? Thanks for your help in advance!

Firt way(GNU awk, if you don't have GNU awk please leave comment):
awk -F, 'NR==FNR{y=substr($12,3,4); a[$3]=y FS $3 FS $4; b[$5]=y FS $5 FS $6; c[$9]=y FS $9 FS $10; next} FNR==1{printf nl;nl="\n"} match(FILENAME, /myfilename_([A-Z]*)/, f){NF=6;switch(f[1]){case "WEEK": $1=a[$1];break; case "MONTH": $1=b[$1];break; case "HALF": $1=c[$1];}}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Multiple lines for readability:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
b[$5]=y FS $5 FS $6;
c[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
NF=6; ## To limit results for 6 columns only, can remove it here.
switch(f[1]){
case "WEEK":
$1=a[$1];
break;
case "MONTH":
$1=b[$1];
break;
case "HALF":
$1=c[$1];
}
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
An update to it:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
b[$5]=y FS $5 FS $6;
c[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
NF=6; ## To limit results for 6 columns only, can remove in your case.
$1 = f[1]=="WEEK" ? a[$1] : ( f[1]=="MONTH" ? b[$1] : (f[1]=="HALF" ? c[$1] : $1) )
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Second Way, more concise and without using switch (also GNU awk):
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3 "WEEK"]=y FS $3 FS $4;
a[$5 "MONTH"]=y FS $5 FS $6;
a[$9 "HALF"]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
match(FILENAME, /myfilename_([A-Z]*)/, f){
$1=a[$1 f[1]];
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Third way: If your data are all corresponding to their filenames, like you showed in your samples, there's a third way which removes the need of match, so it can work on other awks:
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3 "w"]=y FS $3 FS $4;
a[$5 "m"]=y FS $5 FS $6;
a[$9 "h"]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
$1~/^([wmh])[0-9]{2}_[0-9]{4}/{ ## Check first fields to make sure it matches, the checking is optional if your data is all like you showed.
$1=a[$1 substr($1,1,1)]
print
}' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv
Given a second thought, based on your data-filename relations, there's actually no need to check the first letter (nor the filename):
awk -F, '
NR==FNR{
y=substr($12,3,4);
a[$3]=y FS $3 FS $4;
a[$5]=y FS $5 FS $6;
a[$9]=y FS $9 FS $10;
next
}
FNR==1{printf nl;nl=ORS} ## The newlines between sectors, if you do not need those newlines then remove this line.
{ ## Add $1~/^([wmh])[0-9]{2}_[0-9]{4}/ to the beginning of this line if you want to check and make sure first column.
$1=a[$1]
}1' OFS=, calendar_file myfilename_{WEEK,MONTH,HALF}.csv

Here is another awk solution which is portable, efficient and doesn't rely on input filenames but the order they are given on command line.
awk -F ',' -v OFS=',' '
NR==FNR {
y=substr($12,3,4)
a[ARGV[2],$3]=y OFS $3 OFS $4 # week
a[ARGV[3],$5]=y OFS $5 OFS $6 # month
a[ARGV[4],$9]=y OFS $9 OFS $10 # half
next
}
{
$1=a[FILENAME,$1]
} 1' calendar.csv week.csv month.csv half.csv
Note that if your calendar file is sorted, there is no need to parse fiscal year field again and again for each line. Something like this would be way more efficient in that case:
if(p!=$12) y=substr(p=$12,3,4)

Related

Error with awk (newline or end of string)

I am having an issue with the following command:
awk ‘{if ($1 ~ /^##contig/) {next}else if ($1 ~ /^#/) {print $0; next}else {print $0 | “sort -k1,1V -k2,2n”}’ file.vcf > out.vcf
It gives the following error:
^ unexpected newline or end of string
Your command contains "fancy quotes" instead of normal ones, in addition to a missing }.
awk '{if ($1 ~ /^##contig/) {next} else if ($1 ~ /^#/) {print $0; next} else {print $0 | "sort -k1,1V -k2,2n"} }' file.vcf > out.vcf
Changing your command to the above should work as expected.

How to use bash to split column by strings?

Given the tab delimited file with eight columns:
22 51244237 rs575160859 C T 100 PASS AC=19;AF=0.00379393;AN=5008;NS=2504;DP=13345;EAS_AF=0;AMR_AF=0.0043;AFR_AF=0;EUR_AF=0.0099;SAS_AF=0.0061;AA=.|||;VT=SNP
How can I use bash to create a new tab delimited file from information in the eighth column with the columns: AF; EAS_AF; AMR_AF; AFR_AF; EUR_AF; SAS_AF and the corresponding numeric value?
ie:
#AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF
0.00379393 0 0.0043 0 0.0099 0.0061
I understand I could split the eigth column by ";" (https://unix.stackexchange.com/questions/156919/splitting-a-column-using-awk) and then remove the unwanted text columns and text strings (ie "AF="), but is there a more efficient way to do this?
Thanks
Could you please try following.
awk '
{
match($0,/AF[^;]*/)
af=substr($0,RSTART,RLENGTH)
match($0,/EAS_AF[^;]*/)
eas=substr($0,RSTART,RLENGTH)
match($0,/AMR_AF[^;]*/)
amr=substr($0,RSTART,RLENGTH)
match($0,/AFR_AF[^;]*/)
afr=substr($0,RSTART,RLENGTH)
match($0,/EUR_AF[^;]*/)
eur=substr($0,RSTART,RLENGTH)
match($0,/SAS_AF[^;]*/)
sas=substr($0,RSTART,RLENGTH)
VAL=af OFS ac OFS eas OFS amr OFS afr OFS eur OFS sas
split(VAL,array,"[= ]")
print array[1],array[4],array[6],array[8],array[10],array[12] ORS array[2],array[5],array[7],array[9],array[11],array[13]
}' Input_file | column -t
Explanation: Adding explanation for above code too here.
awk '
{
match($0,/AF[^;]*/) ##Using match out of the box awk function for matching AF string till semi colon.
af=substr($0,RSTART,RLENGTH) ##creating variable named af whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/EAS_AF[^;]*/) ##Using match out of the box awk function for matching EAS_AF string till semi colon.
eas=substr($0,RSTART,RLENGTH) ##creating variable named eas whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/AMR_AF[^;]*/) ##Using match out of the box awk function for matching AMR_AF string till semi colon.
amr=substr($0,RSTART,RLENGTH) ##creating variable named amr whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/AFR_AF[^;]*/) ##Using match out of the box awk function for matching AFR_AF string till semi colon.
afr=substr($0,RSTART,RLENGTH) ##creating variable named afr whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/EUR_AF[^;]*/) ##Using match out of the box awk function for matching EUR_AF string till semi colon.
eur=substr($0,RSTART,RLENGTH) ##creating variable named eur whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/SAS_AF[^;]*/) ##Using match out of the box awk function for matching SAS_AF string till semi colon.
sas=substr($0,RSTART,RLENGTH) ##creating variable named sas whose value is substring of indexes of RSTART to till value of RLENGTH.
VAL=af OFS ac OFS eas OFS amr OFS afr OFS eur OFS sas ##Creating variable VAL whose value is values of all above mentioned variables.
split(VAL,array,"[= ]") ##Using split function of awk to split it into array named array with delimiter space OR =.
print array[1],array[4],array[6],array[8],array[10],array[12] ORS array[2],array[5],array[7],array[9],array[11],array[13] ##Printing all array values as per OP.
af=ac=eas=amr=afr=eur=sas="" ##Nullifying all variables mentioned above.
}' Input_file | column -t ##Mentioning Input_file name here and passing awk output to column command to take output in TAB format.
Split column by ";"
awk -F";" '$1=$1' OFS="\t" file.temp > tmp && mv tmp file.temp
Remove unwanted columns (new header: CHROM POS ID REF ALT QUAL FILTER AC AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF)
awk '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $13, $14, $15, $16, $17}' file.temp > tmp && mv tmp file.temp
Remove unwanted strings
awk '{ gsub("SAS_AF=", "", $14); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("EUR_AF=", "", $13); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AFR_AF=", "", $12); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AMR_AF=", "", $11); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("EAS_AF=", "", $10); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AF=", "", $9); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AC=", "", $8); print }' file.temp > tmp && mv tmp file.temp
This is how to really approach this task:
$ cat tst.awk
BEGIN {
FS=OFS="\t"
numFlds = split("AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF",fldNames,/ /)
printf "#"
for (i=1; i<=numFlds; i++) {
printf "%s%s", fldNames[i], (i<numFlds ? OFS : ORS)
}
}
{
nf = split($8,tmp,/[;=]/)
for (i=1; i<nf; i+=2) {
fldName = tmp[i]
fldVal = tmp[i+1]
name2val[fldName] = fldVal
}
for (i=1; i<=numFlds; i++) {
fldName = fldNames[i]
fldVal = name2val[fldName]
printf "%s%s", fldVal, (i<numFlds ? OFS : ORS)
}
}
$ awk -f tst.awk file
#AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF
0.00379393 0 0.0043 0 0.0099 0.0061
The alignment in the output only looks off because it's tab-separated as required.

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

Print a column in awk if it matches, if not then still print the line (without that column)

I'm trying to do some filtering with awk but I'm currently running into an issue. I can make awk match a regex and print the line with the column but I cannot make it print the line without the column.
awk -v OFS='\t' '$6 ~ /^07/ {print $3, $4, $5, $6}' file
Is currently what I have. Can I make awk print the line without the sixth column if it doesn't match the regex?
Set $6 to the empty string if the regex doesn't match. As simple as that. This should do it:
awk -v OFS='\t' '{ if ($6 ~ /^07/) { print $3, $4, $5, $6 } else { $6 = ""; print $0; } }' file
Note that $0 is the entire line, including $2 (which you didn't seem to use). It will print every column except the 6th column.
If you just want to print $3, $4 and $5 when there isn't a match, use this instead:
awk -v OFS='\t' '{ if ($6 ~ /^07/) print $3, $4, $5, $6; else print $3, $4, $5 }' file

awk match and merge two files on basis of key values

I have two files in which $3,$4 = $3,$2.
file1:
1211,A2,ittp,1,IPSG,a2,PA,3000,3000
1311,A4,iztv,1,IPSG,a4,PA,240,250
1411,B4,iztq,0,IPSG,b4,PA,230,250
file2:
TP,0,nttp,0.865556,0.866667
TP,1,ittp,50.7956,50.65
TP,1,iztv,5.42444,13.8467
TP,0,iztq,645.194,490.609
I want to merge these files and print a new file like if file1 $3,$4 = file2 $3,$2 then print merged file like
TP,1211,A2,ittp,1,IPSG,a2,PA,3000,3000,0.865556,0.866667
TP,1311,A4,iztv,1,IPSG,a4,PA,240,250,50.7956,50.65
TP,1411,B4,iztq,0,IPSG,b4,PA,230,250,5.42444,13.8467
BOTH THE FILES ARE CSV FILES.
I tried using awk but I'm not getting the desired output. It's printing only file1.
$ awk -F, 'NR==FNR{a[$3,$4]=$3$2;next}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 a[$1] }' OFS=, 1.csv 2.csv
awk -F, 'BEGIN {OFS=",";}
NR == FNR {a[$3,$4] = $0;}
NR != FNR && a[$3,$2] {print $1, a[$3,$2], $4, $5;}' 1.csv 2.csv
One way with awk:
awk 'NR==FNR{a[$4,$3]=$0;next}($2,$3) in a{print $1,a[$2,$3],$4,$5}' FS=, OFS=, f1 f2
TP,1211,A2,ittp,1,IPSG,a2,PA,3000,3000,50.7956,50.65
TP,1311,A4,iztv,1,IPSG,a4,PA,240,250,5.42444,13.8467
TP,1411,B4,iztq,0,IPSG,b4,PA,230,250,645.194,490.609
Using Join
If i1 and i2 are the input files
cat i1.txt | awk -F',' '{print $3 "-" $4 "," $1 "," $2 "," $5 "," $6 "," $7 "," $8 "," $9}' | sort > s1.txt
cat i2.txt | awk -F',' '{print $3 "-" $2 "," $1 "," $4 "," $5 }' | sort > s2.txt
join -t',' s1.txt s2.txt | tr '-' ',' > t12.txt
cat t12.txt | awk -F ',' '{print $10 "," $3 "," $4 "," $1 "," $2 "," $5 "," $6 "," $7 "," $8 "," $9 "," $11 "," $12 }'

Resources