Subtract single largest number from multiple specific columns in awk - bash

I have a comma delimited file that looks like
R,F,TE,K,G,R
1,0,12,f,1,18
2,1,17,t, ,17
3,1, , ,1,
4,0,15, ,0,16
There are some items which are missing, also first row is the header which I want to ignore. I wanted to calculate the second smallest number in specific columns and subtract it from all the elements in that column unless the value in the column is the minimum value. In this example, I want to subtract the second minimum values from columns 3 and 6 in the example. So, my final values would be:
R,F,TE,K,G,R
1,0,12,f,1,1
2,1, 2,t, ,0
3,1, , ,0,
4,0, 0, ,0,16
I tried individually using single columns and giving hand-coded thresholds to make it second largest by
awk 'BEGIN {FS=OFS=",";
};
{ min=1000000;
if($3<min && $3 != "" && $3>12) min = $3;
if($3>0) $3 = $3-min+1;
print}
END{print min}
' try1.txt
It finds the min alright but the output is not as expected. There should be an easier way in awk.

I'd loop over the file twice, once to find the minima, once to adjust the values. It's a trade-off of time versus memory.
awk -F, -v OFS=, '
NR == 1 {min3 = $3; min6 = $6}
NR == FNR {if ($3 < min3) min3 = $3; if ($6 < min6) min6 = $6; next}
$3 != min3 {$3 -= min3}
$6 != min6 {$6 -= min6}
{print}
' try1.txt try1.txt
For prettier output:
awk -F, -v OFS=, '
NR == 1 {min3 = $3; min6 = $6; next}
NR == FNR {if ($3 < min3) min3 = $3; if ($6 < min6) min6 = $6; next}
FNR == 1 {len3 = length("" min3); len6 = length("" min6)}
$3 != min3 {$3 = sprintf("%*d", len3, $3-min3)}
$6 != min6 {$6 = sprintf("%*d", len6, $6-min6)}
{print}
' try1.txt try1.txt
Given the new requirements:
min2_3=$(cut -d, -f3 try1.txt | tail -n +2 | sort -n | grep -v '^ *$' | sed -n '2p')
min2_6=$(cut -d, -f6 try1.txt | tail -n +2 | sort -n | grep -v '^ *$' | sed -n '2p')
awk -F, -v OFS=, -v min2_3=$min2_3 -v min2_6=$min2_6 '
NR==1 {print; next}
$3 !~ /^ *$/ && $3 >= min2_3 {$3 -= min2_3}
$6 !~ /^ *$/ && $6 >= min2_6 {$6 -= min2_6}
{print}
' try1.txt
R,F,TE,K,G,R
1,0,12,f,1,1
2,1,2,t, ,0
3,1, , ,1,
4,0,0, ,0,16

BEGIN{
FS=OFS=","
}
{
if(NR==1){print;next}
if(+$3)a[NR]=$3
if(+$6)b[NR]=$6
s[NR]=$0
}
END{
asort(a,c)
asort(b,d)
for(i=2;i<=NR;i++){
split(s[i],t)
if(t[3]!=c[1]&&+t[3]!=0)t[3]=t[3]-c[2]
if(t[6]!=d[1]&&+t[6]!=0)t[6]=t[6]-d[2]
print t[1],t[2],t[3],t[4],t[5],t[6]
}
}

Related

File into table awk

I am trying to make a table by reading the file.
Here is an example of the code I am trying to compile:
FHEAD|1|PRMPC|20200216020532|1037|S
TMBPE|2|MOD
TPDTL|3|72810|1995019|11049-|11049-|Dcto 20|0|5226468|20200216000001|20200222235959|2||1||||
TPGRP|4|5403307
TGLIST|5|5031472|1|||
TLITM|6|101055590
TPDSC|7|0|||-20||2|1|
TPGRP|8|5403308
TGLIST|9|5031473|0|||
TPDTL|13|10728|1995021|11049-|11049-|Dcto 30|0|5226469|20200216000001|20200222235959|2||1||||
TPGRP|14|5403310
TGLIST|15|5031475|1|||
TLITM|16|210000041
TLITM|17|101004522
TPDSC|113|0|||-30||2|1|
TPGRP|114|5403309
TGLIST|115|5031474|0|||
TLITM|116|101047933
TLITM|117|101004681
TLITM|118|101028161
TPDSC|119|0|||-25||2|1|
TPISR|214|101004225|2350|EA|20200216000000|COP|
TTAIL|1135
FTAIL|1136|1134
I tried to develop the code but it returns all tags in one line
for filename in "$input"*.dat;
do
echo "$filename">>"$files"
a=`awk -F'|' '$1=="FHEAD" && $5!=""{print $5}' "$filename"`
b=`awk -F'|' '$1=="TPDTL" && $3!=""{print $3}' "$filename"`
c=`awk -F'|' '$1=="TPDTL" && $4!=""{print $4}' "$filename"`
d=`awk -F'|' '$1=="TPDTL" && $10!=""{print $10}' "$filename"`
e=`awk -F'|' '$1=="TPDTL" && $11!=""{print $11}' "$filename"`
f=`awk -F'|' '$1=="TPDSC" && $6!=""{print $6}' "$filename"`
g=`awk -F'|' '$1=="TLITM" && $3!=""{print $3}' "$filename"`
For exemple:
echo -e ${d}
20200216000001 20200216000001
I wanted something like the picture.
Someone can help me?
Thanks in advance
Assuming:
The frequency of appearance of the keywords like FHEAD, TPDTL, etc
are not uniform. Use the latest one if needed.
The number of rows should be equal to the count of TLITM.
The table should be updated when TPDSC appears.
then would you please try the following:
awk 'BEGIN {FS = "|"; OFS = ","}
$1 ~ /FHEAD/ {a = $5}
$1 ~ /TPDTL/ {b = $3; c = $4; d = $10; e = $11}
$1 ~ /TLITM/ {f[++tlitm_count] = $3}
$1 ~ /TPDSC/ {g = $6;
for (i=1; i<=tlitm_count; i++) {
print a, b, c, d, e, f[i], g
}
tlitm_count = 0;
}
' *.dat
Output:
1037,72810,1995019,20200216000001,20200222235959,101055590,-20
1037,10728,1995021,20200216000001,20200222235959,210000041,-30
1037,10728,1995021,20200216000001,20200222235959,101004522,-30
1037,10728,1995021,20200216000001,20200222235959,101047933,-25
1037,10728,1995021,20200216000001,20200222235959,101004681,-25
1037,10728,1995021,20200216000001,20200222235959,101028161,-25
If you want the output delimiter to be a whitespace, please modify the value of OFS.

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

Multiple pattern matching

I have an input file with columns seperated by | as follows.
[3yu23yuoi]|$name
!$fjkdjl|[kkklkl]
$hjhj|$mmkj
I want the output as
0 $name
!$fjkdjl 0
$hjhj $mmkj
Whenever the string begins with $ or !$ or "any", i want it to get printed as such else 0.
I have tried the following command.It prints verything same as input file only.
awk -F="|" '{if (($1 ~ /^.*\$/) || ($1 ~ /^.*\!$/) || ($1 ~ /^any/)) {print $1} else if ($1 ~ /^\[.*/){print "0"} else if (($2 ~ /^.*\$/) || ($2 ~ /^.*\!$/) || ($2 ~ /^any/)) {print $2} else if($2 ~ /^\[.*/){print "0"}}' input > output
This should do:
awk -F\| '{$1=$1;for (i=1;i<=NF;i++) if ($i!~/^(\$|!\$|any)/) $i=0}1' file
0 $name
!$fjkdjl 0
$hjhj $mmkj
If data does not start with $ !$ or any, set it to 0
Or if you like tab as separator:
awk -F\| '{$1=$1;for (i=1;i<=NF;i++) if ($i!~/^(\$|!\$|^any)/) $i=0}1' OFS="\t" file
0 $name
!$fjkdjl 0
$hjhj $mmkj
$1=$1 make sure all line have same output, even if no data is changed.

Awk & Sort-Output as Comma Delimited?

I am trying to get this to output as comma delimited. The current version doesn't work at all (I get a blank file as an output), and previous versions (where I keep the awk BEGIN statements but don't have the sort delimiter) will just output as tab delimited, not comma delimited. In the previous versions, without attempting to get the comma delimiters, I do get the expected answer (with the complicated filters, etc), so I'm not asking for help with that portion of it. I realize this is a very ugly way to filter and the numbers are also ugly/very large.
The background of the question: Find the regions in the file lamina.bed that overlap with the region chr12:5000000-6000000, and to sort descending by column 4, output as comma delimited. Chromosome is the first column, start position of the region is column 2, end position is column 3, value is column 4. We are supposed to use awk (in Unix bash shell). Thank you in advance for your help!
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000)' /vol1/opt/data/lamina.bed | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn > ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
cat ~/MOLB7621/PS_2/results/2015_02_05/PS2_p3_n1.csv
sample lines of input (tab delimited, including the lines on chr12 that should work):
#chrom start end value
chr1 11323785 11617177 0.86217008797654
chr1 12645605 13926923 0.934891485809683
chr1 14750216 15119039 0.945945945945946
chr12 3306736 5048326 0.913561847988077
chr12 5294045 5393088 0.923076923076923
chr12 5505370 6006665 0.791318864774624
chr12 7214638 7827375 0.8562874251497
chr12 8139885 10173149 0.884353741496599
To get comma-separated output, use the following:
$ awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1;print}' file | awk 'BEGIN{FS=","; OFS=","} ($1 == "chr12") ' | sort -t$"," -k4rn
chr12,5294045,5393088,0.923076923076923
chr12,3306736,5048326,0.913561847988077
chr12,5505370,6006665,0.791318864774624
The only change above is the addition on the action:
{$1=$1;print}
awk will only reformat a line with a new field separator if the one or more of the fields on the line have been changed in some way. $1=$1 is sufficient to indicate that field 1 has been changed. Consequently, the new field separators are inserted.
Also, the two calls to awk can be combined into a single call:
awk 'BEGIN{FS="\t"; OFS=","} ($2 <= 5000000 && $3 >= 5000000) || ($2 >= 5000000 && $3 <= 6000000) || ($2 <= 6000000 && $3 >= 6000000) || ($2 <= 5000000 && $3 >= 6000000) {$1=$1; if($1 == "chr12") print}' file | sort -t$"," -k4rn
Simpler Example
In the following, the input is tab-separated and the output field separator, OFS, is set to a comma. In this first example, the awk command print is used:
$ echo $'a\tb\tc' | awk -v OFS=, '{print}'
a b c
Despite OFS=,, the output retains the tab-separator.
Now, we add the simple statement $1=$1 and observe the output:
$ echo $'a\tb\tc' | awk -v OFS=, '{$1=$1;print}'
a,b,c
The output is now comma-separated. Again, that is because awk only reformats a line with the new OFS if it thinks that a field on the line has been changed in some way. The assignment of $1 to itself is sufficient to trigger that reformat.
Note that it is not sufficient to make a change that affects the line as a whole. For example, the following does not trigger a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{$0=$0;print}'
a b c
It is necessary to change one or more fields of the line individually. In the following, sub operates on $0 as a whole and, consequently, no reformat is triggered:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW");print}'
NEW b c
In the example below, however, sub operates specifically on field $1 and hence triggers a reformat:
$ echo $'a\tb\tc' | awk -v OFS=, '{sub($1,"NEW", $1);print}'
NEW,b,c

Creating an array with awk and passing it to a second awk operation

I have a column file and I want to print all the lines that do not contain the string SOL, and to print only the lines that do contain SOL but has the 5th column <1.2 or >4.8.
The file is structured as: MOLECULENAME ATOMNAME X Y Z
Example:
151SOL OW 6554 5.160 2.323 4.956
151SOL HW1 6555 5.188 2.254 4.690 ----> as you can see this atom is out of the
151SOL HW2 6556 5.115 2.279 5.034 threshold, but it need to be printed
What I thought is to save a vector with all the MOLECULENAME that I want, and then tell awk to match all the MOLECULENAME saved in vector "a" with the file, and print the complete output. ( if I only do the first awk i end up having bad atom linkage near the thershold)
The problem is that i have to pass the vector from the first awk to the second... I tried like this with a[], but of course it doesn't work.
How can i do this ?
Here is the code I have so far:
a[] = (awk 'BEGIN{i=0} $1 !~ /SOL/{a[i]=$1;i++}; /SOL/ && $5 > 4.8 {a[i]=$1;i++};/SOL/ &&$5<1.2 {a[i]=$1;i++}')
awk -v a="$a[$i]" 'BEGIN{i=0} $1 ~ $a[i] {if (NR>6540) {for (j=0;j<3;j++) {print $0}} else {print $0}
You can put all of the same molecule names in one row by using sort on the file and then running this AWK which basically uses printf to print on the same line until a different molecule name is found. Then, a new line starts. The second AWK script is used to detect which molecules names have 3 valid lines in the original file. I hope this can help you to solve your problem
sort your_file | awk 'BEGIN{ molname=""; } ( $0 !~ "SOL" || ( $0 ~ "SOL" && ( $5<1.2 || $5>4.8 ) ) ){ if($1!=molname){printf("\n");molname=$1}for(i=1;i<=NF;i++){printf("%s ",$i);}}' | awk 'NF>12 {print $0}'
awk '!/SOL/ || $5 < 1.2 || $5 > 4.8' inputfile.txt
Print (default behaviour) lines where:
"SOL" is not found
SOL is found and fifth column < 1.2
SOL is found and fifth column > 4.8
SOLVED! Thanks to all, here is how i solved it.
#!/bin/bash
file=$1
awk 'BEGIN {molecola="";i=0;j=1;}
{if ($1 !~ /SOL/) {print $0}
else if ( $1 != molecola && $1 ~ /SOL/ ) {
for (j in arr_comp) {if( arr_comp[j] < 1.2 || arr_comp[j] > 5) {for(j in arr_comp) {print arr_mol[j] };break}}
delete(arr_comp)
delete(arr_mol)
arr_mol[0]=$0
arr_comp[0]=$5
molecola=$1
j=1
}
else {arr_mol[j]=$0;arr_comp[j]=$5;j++} }' $file

Resources