How to use bash to split column by strings?

How to use bash to split column by strings? - bash

Given the tab delimited file with eight columns:
22 51244237 rs575160859 C T 100 PASS AC=19;AF=0.00379393;AN=5008;NS=2504;DP=13345;EAS_AF=0;AMR_AF=0.0043;AFR_AF=0;EUR_AF=0.0099;SAS_AF=0.0061;AA=.|||;VT=SNP
How can I use bash to create a new tab delimited file from information in the eighth column with the columns: AF; EAS_AF; AMR_AF; AFR_AF; EUR_AF; SAS_AF and the corresponding numeric value?
ie:
#AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF
0.00379393 0 0.0043 0 0.0099 0.0061
I understand I could split the eigth column by ";" (https://unix.stackexchange.com/questions/156919/splitting-a-column-using-awk) and then remove the unwanted text columns and text strings (ie "AF="), but is there a more efficient way to do this?
Thanks

Could you please try following.
awk '
{
match($0,/AF[^;]*/)
af=substr($0,RSTART,RLENGTH)
match($0,/EAS_AF[^;]*/)
eas=substr($0,RSTART,RLENGTH)
match($0,/AMR_AF[^;]*/)
amr=substr($0,RSTART,RLENGTH)
match($0,/AFR_AF[^;]*/)
afr=substr($0,RSTART,RLENGTH)
match($0,/EUR_AF[^;]*/)
eur=substr($0,RSTART,RLENGTH)
match($0,/SAS_AF[^;]*/)
sas=substr($0,RSTART,RLENGTH)
VAL=af OFS ac OFS eas OFS amr OFS afr OFS eur OFS sas
split(VAL,array,"[= ]")
print array[1],array[4],array[6],array[8],array[10],array[12] ORS array[2],array[5],array[7],array[9],array[11],array[13]
}' Input_file | column -t
Explanation: Adding explanation for above code too here.
awk '
{
match($0,/AF[^;]*/) ##Using match out of the box awk function for matching AF string till semi colon.
af=substr($0,RSTART,RLENGTH) ##creating variable named af whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/EAS_AF[^;]*/) ##Using match out of the box awk function for matching EAS_AF string till semi colon.
eas=substr($0,RSTART,RLENGTH) ##creating variable named eas whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/AMR_AF[^;]*/) ##Using match out of the box awk function for matching AMR_AF string till semi colon.
amr=substr($0,RSTART,RLENGTH) ##creating variable named amr whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/AFR_AF[^;]*/) ##Using match out of the box awk function for matching AFR_AF string till semi colon.
afr=substr($0,RSTART,RLENGTH) ##creating variable named afr whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/EUR_AF[^;]*/) ##Using match out of the box awk function for matching EUR_AF string till semi colon.
eur=substr($0,RSTART,RLENGTH) ##creating variable named eur whose value is substring of indexes of RSTART to till value of RLENGTH.
match($0,/SAS_AF[^;]*/) ##Using match out of the box awk function for matching SAS_AF string till semi colon.
sas=substr($0,RSTART,RLENGTH) ##creating variable named sas whose value is substring of indexes of RSTART to till value of RLENGTH.
VAL=af OFS ac OFS eas OFS amr OFS afr OFS eur OFS sas ##Creating variable VAL whose value is values of all above mentioned variables.
split(VAL,array,"[= ]") ##Using split function of awk to split it into array named array with delimiter space OR =.
print array[1],array[4],array[6],array[8],array[10],array[12] ORS array[2],array[5],array[7],array[9],array[11],array[13] ##Printing all array values as per OP.
af=ac=eas=amr=afr=eur=sas="" ##Nullifying all variables mentioned above.
}' Input_file | column -t ##Mentioning Input_file name here and passing awk output to column command to take output in TAB format.

Split column by ";"
awk -F";" '$1=$1' OFS="\t" file.temp > tmp && mv tmp file.temp
Remove unwanted columns (new header: CHROM POS ID REF ALT QUAL FILTER AC AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF)
awk '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $13, $14, $15, $16, $17}' file.temp > tmp && mv tmp file.temp
Remove unwanted strings
awk '{ gsub("SAS_AF=", "", $14); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("EUR_AF=", "", $13); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AFR_AF=", "", $12); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AMR_AF=", "", $11); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("EAS_AF=", "", $10); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AF=", "", $9); print }' file.temp > tmp && mv tmp file.temp
awk '{ gsub("AC=", "", $8); print }' file.temp > tmp && mv tmp file.temp

This is how to really approach this task:
$ cat tst.awk
BEGIN {
FS=OFS="\t"
numFlds = split("AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF",fldNames,/ /)
printf "#"
for (i=1; i<=numFlds; i++) {
printf "%s%s", fldNames[i], (i<numFlds ? OFS : ORS)
}
}
{
nf = split($8,tmp,/[;=]/)
for (i=1; i<nf; i+=2) {
fldName = tmp[i]
fldVal = tmp[i+1]
name2val[fldName] = fldVal
}
for (i=1; i<=numFlds; i++) {
fldName = fldNames[i]
fldVal = name2val[fldName]
printf "%s%s", fldVal, (i<numFlds ? OFS : ORS)
}
}
$ awk -f tst.awk file
#AF EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF
0.00379393 0 0.0043 0 0.0099 0.0061
The alignment in the output only looks off because it's tab-separated as required.

Related

How to run a bash script in a loop

i wrote a bash script in order to pull substrings and save it to an output file from two input files that looks like this:
input file 1
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
input file 2
gene1 10 20
gene2 40 50
genen x y
my script
>output_file
cat input_file2 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' input_file1 >> output file
done
done
how can i make it work in a loop for more than one string in the input file 1?
new input file looks like this:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotypen...
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
I would like to have a different out file for every genotype and that the file name would be the genotype name.
thank you!

If I'm understanding correctly, would you try the following:
awk '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > genotype
print substr($0, start[i], len[i]) >> genotype
}
close(genotype)
}' input_file2 input_file1
input_file1:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotype3
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Input_file2:
gene1 10 20
gene2 40 50
gene3 20 25
[Results]
genotype1:
>gene1
aaaaaaaaaa
>gene2
aaaaaaaaaa
>gene3
aaaaa
genotype2:
>gene1
bbbbbbbbbb
>gene2
bbbbbbbbbb
>gene3
bbbbb
genotype3:
>gene1
nnnnnnnnnn
>gene2
nnnnnnnnnn
>gene3
nnnnn
[EDIT]
If you want to store the output files to a different directory,
please try the following instead:
dir="./outdir" # directory name to store the output files
# you can modify the name as you want
mkdir -p "$dir"
awk -v dir="$dir" '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > dir"/"genotype
print substr($0, start[i], len[i]) >> dir"/"genotype
}
close(dir"/"genotype)
}' input_file2 input_file1
The 1st two lines are executed in bash to define and mkdir the destination directory.
Then the directory name is passed to awk via -v option
Hope this helps.

Could you please try following, where I am assuming that your Input_file1's column which starts with > should be compared with 1st column of Input_file2's first column (since samples are confusing so based on OP's attempt this has been written).
awk '
FNR==NR{
start_point[$1]=$2
end_point[$1]=$3
next
}
/^>/{
sub(/^>/,"")
val=$0
next
}
{
print val ORS substr($0,start_point[val],end_point[val])
val=""
}
' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
start_point[$1]=$2 ##Creating an array named start_point with index $1 of current line and its value is $2.
end_point[$1]=$3 ##Creating an array named end_point with index $1 of current line and its value is $3.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if a line starts from > then do following.
sub(/^>/,"") ##Substituting starting > with NULL.
val=$0 ##Creating a variable val whose value is $0.
next ##next will skip all further statements from here.
}
{
print val ORS substr($0,start_point[val],end_point[val]) ##Printing val newline(ORS) and sub-string of current line whose start value is value of start_point[val] and end point is value of end_point[val].
val="" ##Nullifying variable val here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.

Match two files and print the matched strings based on the second file using awk

I have two files below named InputFile and Ref
InputFile
1234~code1=yyy:code2=fff:code3=vvv
1256~code2=ttt:code1=yyy:code4=zzz
4567~code4=uuu
8907~code8=ooo:code7=rrr
Ref
code2
code3
code8
code7
I have to match all the records in Ref to InputFile's second column (~ delimited and will be split by colon(:)). If a record in Ref is found in InputFile, it should print the preceding value after the = sign otherwise print none.
Desired output
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
I'm about to load it to a table having the Ref records as the columns.
Here's my script as of:
awk '
BEGIN{
FS=OFS="~"
}
FNR==NR{
a[$0]
next
}
FNR==1 && FNR!=NR{
print
next
}
{
num=split($2,array,"[=:]")
for(i=1;i<=num;i+=2){
if(array[i] in a){
val=val?val OFS array[i+1]:array[i+1]
}
else{
val=val?val OFS "~":"~"
}
}
print $1,val
val=""
}
' Ref InputFile
It prints the array (code1,code2,etc) in InputFile that is present in Ref but it doesn't print in Ref's order.
Script's output
1234~~fff~vvv
1256~ttt
4567~
8907~ooo~rrr

something similar to yours
$ awk -F~ 'NR==FNR {c[NR]=$1; cs=NR; next}
{n=split($2,f,"[=:]");
delete k;
for(i=1;i<n;i+=2) k[f[i]]=f[i+1];
printf "%s", $1;
for(i=1;i<=cs;i++) printf "%s", FS k[c[i]];
print ""}' ref input
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr
since you want to keep the order in the ref file, don't insert them as keys to the array, instead add them as values indexed with the order number (here the line number). Otherwise you're going to lose order, which I think it the (only?) issue with your script.

$ cat tst.awk
BEGIN {
FS = "[~:=]"
OFS = "~"
}
NR == FNR {
refs[++numRefs] = $0
next
}
{
delete ref2val
for (fldNr=2; fldNr<NF; fldNr+=2) {
ref2val[$fldNr] = $(fldNr+1)
}
printf "%s%s", $1, OFS
for (refNr=1; refNr<=numRefs; refNr++) {
ref = refs[refNr]
printf "%s%s", ref2val[ref], (refNr<numRefs ? OFS : ORS)
}
}
$ awk -f tst.awk refs file
1234~fff~vvv~~
1256~ttt~~~
4567~~~~
8907~~~ooo~rrr

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?

quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"

With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.

Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index

awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

print multiple fields if multiple pattern matches

I have a comma delimited file like below
0,category=a,type=b,value=1
1,category=c,type=b,.....,original_value=0
2,category=b,type=c,....,original_value=1,....,corrected_value=3
A line in the file can contain
(1)only 'value'
(2)only 'original_value'
(3)both 'original value' and 'corrected_value'
The values can be in any column.
The following awk command I wrote can only print one field after pattern match.
cat file | awk -F, 'BEGIN{OFS=","} /value/ { for (x=1;x<=NF;x++) if ($x~"value") {print $2,$3,$(x)} }' | sort -u
Current Output:
category=a,type=b,value=1
category=b,type=c,corrected_value=3
category=b,type=c,original_value=1
category=c,type=b,original_value=0
How do I print two fields (columns) of a line if two pattern matches occur? In this case, if both original_value and corrected_value exist.
Expected Output:
category=a,type=b,value=1
category=b,type=c,original_value=1,corrected_value=3
category=c,type=b,original_value=0
Bash Version: 4.3.11

You can use this awk command:
awk 'BEGIN{FS=OFS=","} {printf "%s%s%s", $2,OFS,$3; for(i=4; i<=NF; i++)
if ($i ~ /value/) printf "%s%s", OFS,$i; print ""}' file
category=a,type=b,value=1
category=c,type=b,original_value=0
category=b,type=c,original_value=1,corrected_value=3

Similar to #anubhava's answer, but does not rely on the category or type being in a particular column:
awk -F, '
BEGIN { pattern = "^(category|type|value|original_value|corrected_value)" }
{
sep = ""
for (i=1; i<=NF; i++) {
if ($i ~ pattern) {
printf "%s%s", sep, $i
sep = ","
}
}
print ""
}
' file

Unix AWK field seperator finding sum of one field group by other

I am using below awk command which is returning me unique value of parameter $11 and occurrence of it in the file as output separated by commas. But along with that I am looking for sum of parameter $14(last value) in the output. Please help me on it.
sample string in file
EXSTAT|BNK|2014|11|05|15|29|46|23169|E582754245|QABD|S|000|351
$14 is last value 351
bash-3.2$ grep 'EXSTAT|' abc.log|grep '|S|' |
awk -F"|" '{ a[$11]++ } END { for (b in a) { print b"," a[b] ; } }'
QDER,3
QCOL,1
QASM,36
QBEND,23
QAST,3
QGLBE,30
QCD,30
TBENO,1
QABD,9
QABE,5
QDCD,5
TESUB,1
QFDE,12
QCPA,3
QADT,80
QLSMR,6
bash-3.2$ grep 'EXSTAT|' abc.log
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226

Just add another associative array:
awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}'
tested below:
> cat temp
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
> awk -F"|" '{a[$11]++;c[$11]+=$14}END{for(b in a){print b"," a[b]","c[b]}}' temp
QGLBE,1,424
QADT,20,5510
QASM,1,1592
QCD,1,373
>
also check the test here

You need not use grep for searching the file if it contains EXSTAT the awk can do that for you as well.
For example:
awk 'BEGIN{FS="|"; OFS=","} $1~EXSTAT && $12~S {sum[$11]+=$14; count[$11]++}END{for (i in sum) print i,count[i],sum[i]}' abc.log
for the input file abc.log with contents
EXSTAT|BNK|2014|11|05|15|29|03|23146|E582754222|QGLBE|S|000|424
EXSTAT|BNK|2014|11|05|15|29|05|23147|E582754223|QCD|S|000|373
EXSTAT|BNK|2014|11|05|15|29|12|23148|E582754224|QASM|S|000|1592
EXSTAT|BNK|2014|11|05|15|29|13|23149|E582754225|QADT|S|000|660
EXSTAT|BNK|2014|11|05|15|29|14|23150|E582754226|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|14|23151|E582754227|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|15|23152|E582754228|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|15|23153|E582754229|QADT|S|000|258
EXSTAT|BNK|2014|11|05|15|29|17|23154|E582754230|QADT|S|000|261
EXSTAT|BNK|2014|11|05|15|29|18|23155|E582754231|QADT|S|000|263
EXSTAT|BNK|2014|11|05|15|29|18|23156|E582754232|QADT|S|000|250
EXSTAT|BNK|2014|11|05|15|29|19|23157|E582754233|QADT|S|000|270
EXSTAT|BNK|2014|11|05|15|29|19|23158|E582754234|QADT|S|000|264
EXSTAT|BNK|2014|11|05|15|29|20|23159|E582754235|QADT|S|000|245
EXSTAT|BNK|2014|11|05|15|29|20|23160|E582754236|QADT|S|000|241
EXSTAT|BNK|2014|11|05|15|29|21|23161|E582754237|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|21|23162|E582754238|QADT|S|000|229
EXSTAT|BNK|2014|11|05|15|29|22|23163|E582754239|QADT|S|000|234
EXSTAT|BNK|2014|11|05|15|29|22|23164|E582754240|QADT|S|000|237
EXSTAT|BNK|2014|11|05|15|29|23|23165|E582754241|QADT|S|000|254
EXSTAT|BNK|2014|11|05|15|29|23|23166|E582754242|QADT|S|000|402
EXSTAT|BNK|2014|11|05|15|29|24|23167|E582754243|QADT|S|000|223
EXSTAT|BNK|2014|11|05|15|29|24|23168|E582754244|QADT|S|000|226
it will give an output as
QASM,1,1592
QGLBE,1,424
QADT,20,5510
QCD,1,373
What it does?
'BEGIN{FS="|"; OFS=","} excecuted before the input file is processed. It sets FS, input field seperator as | and OFS output field seperator as ,
$1~EXSTAT && $12~S{sum[$11]+=$14; count[$11]++} action is for each line
$1~EXSTAT && $12~S checks if first field is EXSTAT and 12th field is S
sum[$11]+=$14 array sum of field $14 indexed by $11
count[$11]++ array count indexed by $11
END{for (i in sum) print i,count[i],sum[i]}' excecuted at end of file, prints the content of the arrays

You can use a second array.
awk -F"|" '/EXSTAT\|/&&/\|S\|/{a[$11]++}/EXSTAT\|/{s[$11]+=$14}\
END{for(b in a)print b","a[b]","s[b];}' abc.log
Explanation
/EXSTAT\|/&&/\|S\|/{a[$11]++} on lines that contain both EXSTAT| and |S|, increment a[$11].
/EXSTAT\|/ on lines containing EXSTAT| add $14 to s[$11]
END{for(b in a)print b","a[b]","s[b];} print out all keys in array a, values of array a, and values of array s, separated by commas.

#!awk -f
BEGIN {
FS = "|"
}
$1 == "EXSTAT" && $12 == "S" {
foo[$11] += $14
}
END {
for (bar in foo)
printf "%s,%s\n", bar, foo[bar]
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use bash to split column by strings? - bash

Related

How to run a bash script in a loop

Match two files and print the matched strings based on the second file using awk

Remove duplicate from csv using bash / awk

print multiple fields if multiple pattern matches

Unix AWK field seperator finding sum of one field group by other

Categories

Resources