Divide each row by max value in awk - shell

I am trying to divide the rows by the max value in that row as (with rows having all columns as NA)
r1 r2 r3 r4
a 0 2.3 1.2 0.1
b 0.1 4.5 9.1 3.1
c 9.1 8.4 0 5
I get
r1 r2 r3 r4
a 0 1 0.52173913 0.043478261
b 0.010989011 0.494505495 1 0.340659341
c 1 0.923076923 0 0.549450549
I tried to calculate max of each row by executing
awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}' file.txt > max.txt
then pasted it as the last column to the file.txt as
paste file.txt max.txt > file1.txt
I am trying to execute a code where the last column will divide all the columns in that line , but first I needed to format each line hence I am stuck at
awk '{for(i=1;i<NF;i++) printf "%s " $i,$NF}' file1.txt
I am trying to print each combination for that line and then print the next lines combinations on new line. But I want to know if there is a better way to do this.

awk to the rescue!
$ awk 'NR>1 {m=$2; for(i=3;i<=NF;i++) if($3>m) m=$3;
for(i=2;i<=NF;i++) $i/=m}1' file
r1 r2 r3 r4
a 0 1 0.521739 0.0434783
b 0.0222222 1 2.02222 0.688889
c 1 0.923077 0 0.549451

Following awk may help you on same:
awk '
FNR==1{
print;
next
}
{
len=""
for(i=2;i<=NF;i++){
len=len>$i?len:$i};
printf("%s%s", $1, OFS)
}
{
for(i=2;i<=NF;i++){
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)}
}
' Input_file
Explanation: Adding explanation too here with solution now:
awk '
FNR==1{ ##FNR==1 is a condition where it will check if it is first line of Input_file then do following:
print; ##printing the current line then.
next ##next is awk out of the box keyword which will skip all further statements now.
}
{
len="" ##variable named len(which contains the greatest value in a line here)
for(i=2;i<=NF;i++){ ##Starting a for loop here starting from 2nd field to till value of NF which means it will cover all the fields on a line.
len=len>$i?len:$i}; ##Creating a variable named len here whose value is $1 if it is NULL and if it is greater than current $1 then it remains same else will be $1
printf("%s%s", $1, OFS) ##Printing the 1st column value here along with space.
}
{
for(i=2;i<=NF;i++){ ##Starting a for loop here whose value starts from 2 to till the value of NF it covers all the field of current line.
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)} ##Printing current field divided by value of len varible(which has maximum value of current line), it also checks a conditoin if value of i equals to NF then print new line else print space.
}
' Input_file ##mentioning the Input_file name here.

Related

How to print the row number and starting location of a pattern when multiple matches per row are present?

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line.
I found this somewhat related question.
So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:
echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '
The resulting output is :
1 4
But what I would expect is something like this:
1 4
1 13
1 22
I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".
echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '
Output:
1 0
1 0
1 0
Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated.
Also, in my case the pattern to match would be a regex (/A.C/).
Thank you
This may be what you're trying to do:
echo xyzABCdefghiABCdefghiABCdef |
awk '{ begpos=1
while (match(substr($0, begpos), /ABC/)) {
print NR, begpos + RSTART - 1
begpos += RLENGTH + RSTART - 1
}
}'
Another option using gnu awk could be using split with a regex.
Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.
echo xyzABCdefghiABCdefghiABCdef |
awk ' {
n=split($0, a, /ABC/, seps); pos=1
for(i=1; i<n; i++){
pos += length(a[i])
print NR, pos
pos += length(seps[i])
}
}'
Output
1 4
1 13
1 22
With your shown samples, please try following awk code.
awk '
{
prev=0
while(match($0,/ABC/)){
$0=substr($0,RSTART+RLENGTH)
print FNR,prev+RSTART
prev+=RSTART+2
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
prev=0 ##Setting prev variable to 0 here.
while(match($0,/ABC/)){ ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
print FNR,prev+RSTART ##Printing line number along with prev+RSTART value here.
prev+=RSTART+2 ##Setting prev to prev+RSTART+2 here.
}
}
' Input_file ##Mentioning Input_file name here.
Determination of the coordinates of a string with awk:
echo "xyzABCdefghiABCdefghiABCdef" \
| awk -v s="ABC" 'BEGIN{ len=length(s) }
{
for(i=1; i<=length($0); i++){
if(substr($0, i, len)==s){
print NR, i
}
}
}'
Output:
1 4
1 13
1 22
As one line:
echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'
Source: Find position of character with awk
One awk idea using split() and some slicing-n-dicing of length() results:
ptn='ABC'
echo xyzABCdefghiABCdefghiABCdef |
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
n=split($0,arr,ptn)
for (i=1;i<n;i++) {
pos+=length(arr[i] ptn)
print NR,pos
}
}'
This generates:
1 4
1 13
1 22

Match columns between files and generate file with combination of data in terminal/powershell/command line Bash

I have two .txt files of different lengths and would like to do the following:
If a value in column 1 of file 1 is present in column 1 of file 3, print column 2 of file 2 and then the whole line that corresponds from file 1.
Have tried permutations of awk however am so far unsuccessful!
Thank you!
File 1:
MARKERNAME EA NEA BETA SE
10:1000706 T C -0.021786390809225 0.519667838651725
1:715265 G C 0.0310128798578049 0.0403763946716293
10:1002042 CCTT C 0.0337857775471699 0.0403300629299562
File 2:
CHR:BP SNP CHR BP GENPOS ALLELE1 ALLELE0 A1FREQ INFO
1:715265 rs12184267 1 715265 0.0039411 G C 0.964671
1:715367 rs12184277 1 715367 0.00394384 A G 0.964588
Desired File 3:
SNP MARKERNAME EA NEA BETA SE
rs12184267 1:715265 G C 0.0310128798578049 0.0403763946716293
Attempted:
awk -F'|' 'NR==FNR { a[$1]=1; next } ($1 in a) { print $3, $0 }' file1 file2
awk 'NR==FNR{A[$1]=$2;next}$0 in A{$0=A[$0]}1' file1 file2
With your shown samples, could you please try following.
awk '
FNR==1{
if(++count==1){ col=$0 }
else{ print $2,col }
next
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr){
print $2,arr[$1]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of file(s).
if(++count==1){ col=$0 } ##Checking if count is 1 then set col as current line.
else{ print $2,col } ##Checking if above is not true then print 2nd field and col here.
next ##next will skip all further statements from here.
}
FNR==NR{ ##This will be TRUE when file1 is being read.
arr[$1]=$0 ##Creating arr with 1st field index and value is current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field present in arr then do following.
print $2,arr[$1] ##Printing 2nd field, arr value here.
}
' file1 file2 ##Mentioning Input_files name here.

divide each column by max value/last value

I have a matrix like this:
A 25 27 50
B 35 37 475
C 75 78 80
D 99 88 76
0 234 230 681
The last row is the sum of all elements in the column - and it is also the maximum value.
What I would like to get is the matrix in which each value is divided by the last value in the column (e.g. for the first number in column 2, I would want "25/234="):
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.14957264957265 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.11160058737151
An answer in another thread gives an acceptable result for one column, but I was not able to loop it over all columns.
$ awk 'FNR==NR{max=($2+0>max)?$2:max;next} {print $1,$2/max}' file file
(this answer was provided here: normalize column data with maximum value of that column)
I would be grateful for any help!
In addition to the great approaches by #RavinderSingh13, you can also isolate the last line in the input file with, e.g. tail -n1 Input_file and then use the split() command in the BEGIN rule to separate the values. You can then make a single-pass through the file with awk to update the values as you indicate. In the end, you can pipe the output to head -n-1 to remove the unneeded final row, e.g.
awk -v lline="$(tail -n1 Input_file)" '
BEGIN { split(lline,a," ") }
{
printf "%s", $1
for(i=2; i<=NF; i++)
printf " %.15lf", $i/a[i]
print ""
}
' Input_file | head -n-1
Example Use/Output
$ awk -v lline="$(tail -n1 Input_file)" '
> BEGIN { split(lline,a," ") }
> {
> printf "%s", $1
> for(i=2; i<=NF; i++)
> printf " %.15lf", $i/a[i]
> print ""
> }
> ' Input_file | head -n-1
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.149572649572650 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.111600587371512
(note: this presumes you don't have trailing blank lines in your file and you really don't have blank lines between every row. If you do, let me know)
The differences between the approaches are largely negligible. In each case you are making a total of 3-passes through the file. Here with tail, awk and then head. In the other case with wc and then two-passes with awk.
Let either of us know if you have questions.
1st solution: Could you please try following, written and tested with shown samples in GNU awk. With exact 15 floating points as per OP's shown samples:
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) }
print
}
' Input_file Input_file
2nd solution: If you don't care of floating points to be specific points then try following.
awk -v lines=$(wc -l < Input_file) '
FNR==NR && FNR==lines{
for(i=2;i<=NF;i++){ arr[i]=$i }
next
}
FNR<lines && FNR!=NR{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
OR(placing condition of FNR==lines inside FNR==NR condition):
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v lines=$(wc -l < Input_file) ' ##Starting awk program from here, creating lines which variable which has total number of lines in Input_file here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
if(FNR==lines){ ##Checking if FNR is equal to lines then do following.
for(i=2;i<=NF;i++){ arr[i]=$i } ##Traversing through all fields here of current line and creating an array arr with index of i and value of current field value.
}
next ##next will skip all further statements from here.
}
FNR<lines{ ##Checking condition if current line number is lesser than lines, this will execute when 2nd time Input_file is being read.
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } ##Traversing through all fields here and saving value of divide of current field with arr current field value with 15 floating points into current field.
print ##Printing current line here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Append delimiters for implied blank fields

I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.

Add unique value from first column before each group

I have following file contents:
T12 19/11/19 2000
T12 18/12/19 2040
T15 19/11/19 2000
T15 18/12/19 2080
How to get following output with awk,bash and etc, I searched for similar examples but didn't find so far:
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Thanks,
S
Could you please try following. This code will print output in same order in which first field is occurring in Input_file.
awk '
!a[$1]++ && NF{
b[++count]=$1
}
NF{
val=$1
$1=""
sub(/^ +/,"")
c[val]=(c[val]?c[val] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
print b[i] ORS c[b[i]]
}
}
' Input_file
Output will be as follows.
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!a[$1]++ && NF{ ##Checking condition if $1 is NOT present in array a and line is NOT NULL then do following.
b[++count]=$1 ##Creating an array named b whose index is variable count(every time its value increases cursor comes here) and its value is first field of current line.
} ##Closing BLOCK for this condition now.
NF{ ##Checking condition if a line is NOT NULL then do following.
val=$1 ##Creating variable named val whose value is $1 of current line.
$1="" ##Nullifying $1 here of current line.
sub(/^ +/,"") ##Substituting initial space with NULL now in line.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating an array c whose index is variable val and its value is keep concatenating to its own value with ORS value.
} ##Closing BLOCK for this condition here.
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting a for loop which runs from i=1 to till value of variable count.
print b[i] ORS c[b[i]] ##Printing array b whose index is i and array c whose index is array b value with index i.
}
} ##Closing this program END block here.
' Input_file ##Mentioning Input_file name here.
Here is a quick awk:
$ awk 'BEGIN{RS="";ORS="\n\n"}{printf "%s\n",$1; gsub($1" +",""); print}' file
How does it work?
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
By default, the field separator is set to be any sequence of blanks. So $1 will point to that particular word we want on the separate line. So we print it with printf, and then we remove any reference to it with gsub.
awk is very flexible and provides a number of ways to solve the same problem. The answers you have already are excellent. Another way to approach the problem is to simply keep a single variable that holds the current field 1 as its value. (unset by default) When the first field changes, you simply output the first field as the current heading. Otherwise you output the 2nd and 3rd fields. If a blank-line is encountered, simply output the newline.
awk -v h= '
NF < 3 {print ""; next}
$1 != h {h=$1; print $1}
{printf "%s %s\n", $2, $3}
' file
Above are the 3-rules. If the line is empty (checked with number of fields less than three (NF < 3), then output the newline and skip to the next record. The second checks if the first field is not equal to your current heading variable h -- if not, set h to the new heading and output it. All non-empty records have the 2nd and 3rd fields output.
Result
Just paste the command above at the command line and you will get the desired result, e.g.
awk -v h= '
> NF < 3 {print ""; next}
> $1 != h {h=$1; print $1}
> {printf "%s %s\n", $2, $3}
> ' file
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080

Resources