File into table awk - shell

I am trying to make a table by reading the file.
Here is an example of the code I am trying to compile:
FHEAD|1|PRMPC|20200216020532|1037|S
TMBPE|2|MOD
TPDTL|3|72810|1995019|11049-|11049-|Dcto 20|0|5226468|20200216000001|20200222235959|2||1||||
TPGRP|4|5403307
TGLIST|5|5031472|1|||
TLITM|6|101055590
TPDSC|7|0|||-20||2|1|
TPGRP|8|5403308
TGLIST|9|5031473|0|||
TPDTL|13|10728|1995021|11049-|11049-|Dcto 30|0|5226469|20200216000001|20200222235959|2||1||||
TPGRP|14|5403310
TGLIST|15|5031475|1|||
TLITM|16|210000041
TLITM|17|101004522
TPDSC|113|0|||-30||2|1|
TPGRP|114|5403309
TGLIST|115|5031474|0|||
TLITM|116|101047933
TLITM|117|101004681
TLITM|118|101028161
TPDSC|119|0|||-25||2|1|
TPISR|214|101004225|2350|EA|20200216000000|COP|
TTAIL|1135
FTAIL|1136|1134
I tried to develop the code but it returns all tags in one line
for filename in "$input"*.dat;
do
echo "$filename">>"$files"
a=`awk -F'|' '$1=="FHEAD" && $5!=""{print $5}' "$filename"`
b=`awk -F'|' '$1=="TPDTL" && $3!=""{print $3}' "$filename"`
c=`awk -F'|' '$1=="TPDTL" && $4!=""{print $4}' "$filename"`
d=`awk -F'|' '$1=="TPDTL" && $10!=""{print $10}' "$filename"`
e=`awk -F'|' '$1=="TPDTL" && $11!=""{print $11}' "$filename"`
f=`awk -F'|' '$1=="TPDSC" && $6!=""{print $6}' "$filename"`
g=`awk -F'|' '$1=="TLITM" && $3!=""{print $3}' "$filename"`
For exemple:
echo -e ${d}
20200216000001 20200216000001
I wanted something like the picture.
Someone can help me?
Thanks in advance

Assuming:
The frequency of appearance of the keywords like FHEAD, TPDTL, etc
are not uniform. Use the latest one if needed.
The number of rows should be equal to the count of TLITM.
The table should be updated when TPDSC appears.
then would you please try the following:
awk 'BEGIN {FS = "|"; OFS = ","}
$1 ~ /FHEAD/ {a = $5}
$1 ~ /TPDTL/ {b = $3; c = $4; d = $10; e = $11}
$1 ~ /TLITM/ {f[++tlitm_count] = $3}
$1 ~ /TPDSC/ {g = $6;
for (i=1; i<=tlitm_count; i++) {
print a, b, c, d, e, f[i], g
}
tlitm_count = 0;
}
' *.dat
Output:
1037,72810,1995019,20200216000001,20200222235959,101055590,-20
1037,10728,1995021,20200216000001,20200222235959,210000041,-30
1037,10728,1995021,20200216000001,20200222235959,101004522,-30
1037,10728,1995021,20200216000001,20200222235959,101047933,-25
1037,10728,1995021,20200216000001,20200222235959,101004681,-25
1037,10728,1995021,20200216000001,20200222235959,101028161,-25
If you want the output delimiter to be a whitespace, please modify the value of OFS.

Related

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

AWK - using element on next record GETLINE?

I got some problem with this basic data:
DP;DG
67;
;10
;14
;14
;18
;18
;22
;65
68;
;0
;9
;25
;25
70;
that I'd like to transform on this kind of output:
DP;DG
67;
;10
;14
;14
;18
;18
;22
;65;x
68;
;0
;9
;25
;25;x
70;
The "x" value comes if on the next line $1 exists or if $2 is null. From my understanding, I've to use getline but I don't get the way!
I've tried the following code:
#!/bin/bash
file2=tmp.csv
file3=fin.csv
awk 'BEGIN {FS=OFS=";"}
{
print $0;
getline;
if($2="") {print $0";x"}
else {print $0}
}' $file2 > $file3
Seemed easy. I don't mention the result, totally different from my expectation.
Some clue? Is getline necessary on this problem?
OK, I continue to test some code:
#!/bin/bash
file2=tmp.csv
file3=fin.csv
awk 'BEGIN {FS=OFS=";"}
{
getline var
if (var ~ /.*;$/) {
print $0";x";
print var;
}
else {
print $0;
print var;
}
}' $file2 > $file3
It's quite better, but still, all lines that should be marked aren't... I don't get why...
alternative one pass version
$ awk -F\; 'NR>1 {printf "%s\n", (f && $2<0?"x":"")}
{f=$1<0; printf "%s", $0}
END {print ""}' file
give this one-liner a try:
awk -F';' 'NR==FNR{if($1>0||!$2)a[NR-1];next}FNR in a{$0=$0";x"}7' file file
or
awk -F';' 'NR==FNR{if($1~/\S/||$2).....

Subtract single largest number from multiple specific columns in awk

I have a comma delimited file that looks like
R,F,TE,K,G,R
1,0,12,f,1,18
2,1,17,t, ,17
3,1, , ,1,
4,0,15, ,0,16
There are some items which are missing, also first row is the header which I want to ignore. I wanted to calculate the second smallest number in specific columns and subtract it from all the elements in that column unless the value in the column is the minimum value. In this example, I want to subtract the second minimum values from columns 3 and 6 in the example. So, my final values would be:
R,F,TE,K,G,R
1,0,12,f,1,1
2,1, 2,t, ,0
3,1, , ,0,
4,0, 0, ,0,16
I tried individually using single columns and giving hand-coded thresholds to make it second largest by
awk 'BEGIN {FS=OFS=",";
};
{ min=1000000;
if($3<min && $3 != "" && $3>12) min = $3;
if($3>0) $3 = $3-min+1;
print}
END{print min}
' try1.txt
It finds the min alright but the output is not as expected. There should be an easier way in awk.
I'd loop over the file twice, once to find the minima, once to adjust the values. It's a trade-off of time versus memory.
awk -F, -v OFS=, '
NR == 1 {min3 = $3; min6 = $6}
NR == FNR {if ($3 < min3) min3 = $3; if ($6 < min6) min6 = $6; next}
$3 != min3 {$3 -= min3}
$6 != min6 {$6 -= min6}
{print}
' try1.txt try1.txt
For prettier output:
awk -F, -v OFS=, '
NR == 1 {min3 = $3; min6 = $6; next}
NR == FNR {if ($3 < min3) min3 = $3; if ($6 < min6) min6 = $6; next}
FNR == 1 {len3 = length("" min3); len6 = length("" min6)}
$3 != min3 {$3 = sprintf("%*d", len3, $3-min3)}
$6 != min6 {$6 = sprintf("%*d", len6, $6-min6)}
{print}
' try1.txt try1.txt
Given the new requirements:
min2_3=$(cut -d, -f3 try1.txt | tail -n +2 | sort -n | grep -v '^ *$' | sed -n '2p')
min2_6=$(cut -d, -f6 try1.txt | tail -n +2 | sort -n | grep -v '^ *$' | sed -n '2p')
awk -F, -v OFS=, -v min2_3=$min2_3 -v min2_6=$min2_6 '
NR==1 {print; next}
$3 !~ /^ *$/ && $3 >= min2_3 {$3 -= min2_3}
$6 !~ /^ *$/ && $6 >= min2_6 {$6 -= min2_6}
{print}
' try1.txt
R,F,TE,K,G,R
1,0,12,f,1,1
2,1,2,t, ,0
3,1, , ,1,
4,0,0, ,0,16
BEGIN{
FS=OFS=","
}
{
if(NR==1){print;next}
if(+$3)a[NR]=$3
if(+$6)b[NR]=$6
s[NR]=$0
}
END{
asort(a,c)
asort(b,d)
for(i=2;i<=NR;i++){
split(s[i],t)
if(t[3]!=c[1]&&+t[3]!=0)t[3]=t[3]-c[2]
if(t[6]!=d[1]&&+t[6]!=0)t[6]=t[6]-d[2]
print t[1],t[2],t[3],t[4],t[5],t[6]
}
}

Putting awk in a function

I have 4 awks that are really similiar so I want to put them in a function. My awk code is...
awk -v MYPATH="$MYPATH" -v FILE_EXT="$FILE_EXT" -v NAME_OF_FILE="$NAME_OF_FILE" -v DATE="$DATE" -v pattern="$STORED_PROCS_BEGIN" '
$0 ~ pattern {
rec = $1 OFS $2 OFS $4 OFS $7
for (i=9; i<=NF; i++) {
rec = rec OFS $i
if ($i ~ /\([01]\)/) {
break
}
}
print rec >> "'$MYPATH''$NAME_OF_FILE''$DATE'.'$FILE_EXT'"
}
' "$FILE_LOCATION"
So the pattern and regular expression differ. How can I put this awk in a function where I can replace pattern with $1 and /([01])/ with $2 if I already use those in my awk?
EDIT:
I was thinking I can do...
printFmt(){
awk -v .......
$0 ~ patten {
rec..
for..
rec..
if($i ~ search)
break
print rec
then call with printFmt set?
}
Not sure where the problem is since you already have in your code exactly what you need to do but maybe this will help by simplifying it a bit:
$ cat tst.sh
function prtStuff() {
awk -v x="$1" 'BEGIN{ print x }'
}
prtStuff "foo"
prtStuff "---"
prtStuff "bar"
$ ./tst.sh
foo
---
bar

Creating an array with awk and passing it to a second awk operation

I have a column file and I want to print all the lines that do not contain the string SOL, and to print only the lines that do contain SOL but has the 5th column <1.2 or >4.8.
The file is structured as: MOLECULENAME ATOMNAME X Y Z
Example:
151SOL OW 6554 5.160 2.323 4.956
151SOL HW1 6555 5.188 2.254 4.690 ----> as you can see this atom is out of the
151SOL HW2 6556 5.115 2.279 5.034 threshold, but it need to be printed
What I thought is to save a vector with all the MOLECULENAME that I want, and then tell awk to match all the MOLECULENAME saved in vector "a" with the file, and print the complete output. ( if I only do the first awk i end up having bad atom linkage near the thershold)
The problem is that i have to pass the vector from the first awk to the second... I tried like this with a[], but of course it doesn't work.
How can i do this ?
Here is the code I have so far:
a[] = (awk 'BEGIN{i=0} $1 !~ /SOL/{a[i]=$1;i++}; /SOL/ && $5 > 4.8 {a[i]=$1;i++};/SOL/ &&$5<1.2 {a[i]=$1;i++}')
awk -v a="$a[$i]" 'BEGIN{i=0} $1 ~ $a[i] {if (NR>6540) {for (j=0;j<3;j++) {print $0}} else {print $0}
You can put all of the same molecule names in one row by using sort on the file and then running this AWK which basically uses printf to print on the same line until a different molecule name is found. Then, a new line starts. The second AWK script is used to detect which molecules names have 3 valid lines in the original file. I hope this can help you to solve your problem
sort your_file | awk 'BEGIN{ molname=""; } ( $0 !~ "SOL" || ( $0 ~ "SOL" && ( $5<1.2 || $5>4.8 ) ) ){ if($1!=molname){printf("\n");molname=$1}for(i=1;i<=NF;i++){printf("%s ",$i);}}' | awk 'NF>12 {print $0}'
awk '!/SOL/ || $5 < 1.2 || $5 > 4.8' inputfile.txt
Print (default behaviour) lines where:
"SOL" is not found
SOL is found and fifth column < 1.2
SOL is found and fifth column > 4.8
SOLVED! Thanks to all, here is how i solved it.
#!/bin/bash
file=$1
awk 'BEGIN {molecola="";i=0;j=1;}
{if ($1 !~ /SOL/) {print $0}
else if ( $1 != molecola && $1 ~ /SOL/ ) {
for (j in arr_comp) {if( arr_comp[j] < 1.2 || arr_comp[j] > 5) {for(j in arr_comp) {print arr_mol[j] };break}}
delete(arr_comp)
delete(arr_mol)
arr_mol[0]=$0
arr_comp[0]=$5
molecola=$1
j=1
}
else {arr_mol[j]=$0;arr_comp[j]=$5;j++} }' $file

Resources