Detect increment made in any column - bash

I have following data as input. I am trying to find the increment per group.
col1 col2 col3 group
1 2 100 alpha
1 2 100 alpha
1 2 100 alpha
3 4 200 beta
3 4 200 beta
3 4 200 beta
3 4 300 beta
5 6 700 charlie
7 8 400 tango
7 8 300 tango
7 8 700 tango
Example output:
tango: 300
charlie:0
beta:100
alpha:0
I am trying this approch but answers are incorrect as sometimes values increases in between the samples:
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 200 beta
5 6 700 charlie
7 8 300 tango
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |tac|awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 300 beta
5 6 700 charlie
7 8 700 tango

Awk solution:
awk 'NR==1{ next }
g && $4 != g{ print g":"(v - gr[g]) }
!($4 in gr){ gr[$4]=$3 }{ g=$4; v=$3 }
END{ print g":"(v - gr[g]) }' file
NR==1{ next } - skip the 1st record
g - variable aimed to hold group name
v - variable aimed to hold group value
!($4 in gr){ gr[$4]=$3 } - on the 1st occurrence of a distinct group name $4 - save its first value $3 into array gr
g && $4 != g{ print g":"(v - gr[g]) } - if the current group name $4 differs from the previous one g - print the delta between the last and 1st values of the previous group
The output:
alpha:0
beta:100
charlie:0
tango:300

The following should do the trick, this solution does not require the file to be sorted by group name.
awk '(NR==1){next}
{groupc[$4]++}
(groupc[$4]==1){groupv[$4]=$3}
{groupl[$4]=$3}
END{for(i in groupc) { print i":",groupl[i]-groupv[i]} }
' foo
The following things happen :
skip the first line (NR==1){next}
count how many time group is occuring {groupc[$4]++}
if the group count equal 1 define its first value under groupv
define the last seen value as groupl
at the END, run over all array keys (which are the groups), and print the last minus the first value.
output :
tango: 300
alpha: 0
beta: 100
charlie: 0

Following awk may help you in same too. It will provide output in same sequence as per your Input_file's last column values.
awk '
FNR==1{
next}
prev!=$NF && prev{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
!a[$NF]{
a[$NF]=$(NF-1)}
{
prev=$NF;
prev_val=$(NF-1)}
END{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
' Input_file
Output will be as follows. Will add explanation too shortly.
alpha 0
beta 100
charlie 0
tango 300
Explanation: Adding explanation of code too now for learning purposes of all.
awk '
FNR==1{ ##To skip first line of Input_file which is heading I am putting condition if FNR==1 then do next, where next will skip all further statements of awk.
next}
prev!=$NF && prev{ ##Checking conditions here if variable prev value is NOT equal to current line $NF and variable prev is NOT NULL then do following:
val=prev_val!=a[prev]?prev_val-a[prev]:0;##create a variable val, if prev_val is not equal to a[prev] then subttract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
!a[$NF]{ ##Checking if array a value whose index is $NF is NULL then fill it with current $NF value, actually this is to get the very first value of any column so that later we could subtract it with the its last value as per OP request.
a[$NF]=$(NF-1)}
{
prev=$NF; ##creating variable named prev and assigning its value to last column of the current line.
prev_val=$(NF-1)} ##creating variable named prev_val whose value will be second last columns value of current line.
END{ ##starting end block of awk code here, it will come when Input_file is done with reading.
val=prev_val!=a[prev]?prev_val-a[prev]:0;##getting value of variable val where checking if prev_val is not equal to a[prev] then subtract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
' Input_file ##Mentioning the Input_file name here.

$ cat tst.awk
NR==1 { next }
!($4 in beg) { beg[$4] = $3 }
{ end[$4] = $3 }
END {
for (grp in beg) {
print grp, end[grp] - beg[grp]
}
}
$ awk -f tst.awk file
tango 300
alpha 0
beta 100
charlie 0

Related

Awk if else with conditions

I am trying to make a script (and a loop) to extract matching lines to print them into a new file. There are 2 conditions: 1st is that I need to print the value of the 2nd and 4th columns of the map file if the 2nd column of the map file matches with the 4th column of the test file. The 2nd condition is that when there is no match, I want to print the value in the 2nd column of the test file and a zero in the second column.
My test file is made this way:
8 8:190568 0 190568
8 8:194947 0 194947
8 8:197042 0 197042
8 8:212894 0 212894
My map file is made this way:
8 190568 0.431475 0.009489
8 194947 0.434984 0.009707
8 19056880 0.395066 112.871160
8 101908687 0.643861 112.872348
1st attempt:
for chr in {21..22};
do
awk 'NR==FNR{a[$2]; next} {if ($4 in a) print $2, $4 in a; else print $2, $4 == "0"}' map_chr$chr.txt test_chr$chr.bim > position.$chr;
done
Result:
8:190568 1
8:194947 1
8:197042 0
8:212894 0
My second script is:
for chr in {21..22}; do
awk 'NR == FNR { ++a[$4]; next }
$4 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test_chr$chr.bim" "map_chr$chr.txt" >> "position.$chr"
done
And the result is:
1 0
1 0
1 0
1 0
The result I need is:
8:190568 0.009489
8:194947 0.009707
8:197042 0
8:212894 0
This awk should work for you:
awk 'FNR==NR {map[$2]=$4; next} {print $4, map[$4]+0}' mapfile testfile
190568 0.009489
194947 0.009707
197042 0
212894 0
This awk command processes mapfile first and stores $2 as key with $4 as a value in an associative array named as map.
Later when it processes testfile in 2nd block we print $4 from 2nd file with the stored value in map using key as $4. We add 0 in stored value to make sure that we get 0 when $4 is not present in map.

How to print the row number and starting location of a pattern when multiple matches per row are present?

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line.
I found this somewhat related question.
So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:
echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '
The resulting output is :
1 4
But what I would expect is something like this:
1 4
1 13
1 22
I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".
echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '
Output:
1 0
1 0
1 0
Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated.
Also, in my case the pattern to match would be a regex (/A.C/).
Thank you
This may be what you're trying to do:
echo xyzABCdefghiABCdefghiABCdef |
awk '{ begpos=1
while (match(substr($0, begpos), /ABC/)) {
print NR, begpos + RSTART - 1
begpos += RLENGTH + RSTART - 1
}
}'
Another option using gnu awk could be using split with a regex.
Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.
echo xyzABCdefghiABCdefghiABCdef |
awk ' {
n=split($0, a, /ABC/, seps); pos=1
for(i=1; i<n; i++){
pos += length(a[i])
print NR, pos
pos += length(seps[i])
}
}'
Output
1 4
1 13
1 22
With your shown samples, please try following awk code.
awk '
{
prev=0
while(match($0,/ABC/)){
$0=substr($0,RSTART+RLENGTH)
print FNR,prev+RSTART
prev+=RSTART+2
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
prev=0 ##Setting prev variable to 0 here.
while(match($0,/ABC/)){ ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
print FNR,prev+RSTART ##Printing line number along with prev+RSTART value here.
prev+=RSTART+2 ##Setting prev to prev+RSTART+2 here.
}
}
' Input_file ##Mentioning Input_file name here.
Determination of the coordinates of a string with awk:
echo "xyzABCdefghiABCdefghiABCdef" \
| awk -v s="ABC" 'BEGIN{ len=length(s) }
{
for(i=1; i<=length($0); i++){
if(substr($0, i, len)==s){
print NR, i
}
}
}'
Output:
1 4
1 13
1 22
As one line:
echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'
Source: Find position of character with awk
One awk idea using split() and some slicing-n-dicing of length() results:
ptn='ABC'
echo xyzABCdefghiABCdefghiABCdef |
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
n=split($0,arr,ptn)
for (i=1;i<n;i++) {
pos+=length(arr[i] ptn)
print NR,pos
}
}'
This generates:
1 4
1 13
1 22

Divide each row by max value in awk

I am trying to divide the rows by the max value in that row as (with rows having all columns as NA)
r1 r2 r3 r4
a 0 2.3 1.2 0.1
b 0.1 4.5 9.1 3.1
c 9.1 8.4 0 5
I get
r1 r2 r3 r4
a 0 1 0.52173913 0.043478261
b 0.010989011 0.494505495 1 0.340659341
c 1 0.923076923 0 0.549450549
I tried to calculate max of each row by executing
awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}' file.txt > max.txt
then pasted it as the last column to the file.txt as
paste file.txt max.txt > file1.txt
I am trying to execute a code where the last column will divide all the columns in that line , but first I needed to format each line hence I am stuck at
awk '{for(i=1;i<NF;i++) printf "%s " $i,$NF}' file1.txt
I am trying to print each combination for that line and then print the next lines combinations on new line. But I want to know if there is a better way to do this.
awk to the rescue!
$ awk 'NR>1 {m=$2; for(i=3;i<=NF;i++) if($3>m) m=$3;
for(i=2;i<=NF;i++) $i/=m}1' file
r1 r2 r3 r4
a 0 1 0.521739 0.0434783
b 0.0222222 1 2.02222 0.688889
c 1 0.923077 0 0.549451
Following awk may help you on same:
awk '
FNR==1{
print;
next
}
{
len=""
for(i=2;i<=NF;i++){
len=len>$i?len:$i};
printf("%s%s", $1, OFS)
}
{
for(i=2;i<=NF;i++){
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)}
}
' Input_file
Explanation: Adding explanation too here with solution now:
awk '
FNR==1{ ##FNR==1 is a condition where it will check if it is first line of Input_file then do following:
print; ##printing the current line then.
next ##next is awk out of the box keyword which will skip all further statements now.
}
{
len="" ##variable named len(which contains the greatest value in a line here)
for(i=2;i<=NF;i++){ ##Starting a for loop here starting from 2nd field to till value of NF which means it will cover all the fields on a line.
len=len>$i?len:$i}; ##Creating a variable named len here whose value is $1 if it is NULL and if it is greater than current $1 then it remains same else will be $1
printf("%s%s", $1, OFS) ##Printing the 1st column value here along with space.
}
{
for(i=2;i<=NF;i++){ ##Starting a for loop here whose value starts from 2 to till the value of NF it covers all the field of current line.
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)} ##Printing current field divided by value of len varible(which has maximum value of current line), it also checks a conditoin if value of i equals to NF then print new line else print space.
}
' Input_file ##mentioning the Input_file name here.

not getting array value in awk

I want to insert array values with all other contents of testfile.ps into result.ps file but array values not getting printed,please help.
My requirement is every time condition is met array next index value should get printed with other contents of testfile.ps into result.ps
actually arr[0] and arr[1] are big strings in my project but for simplicity i am editing it
#!/bin/bash
a[0]=""lineto""\n""stroke""
a[1]=""476.00"" ""26.00""
awk '{ if($1 == "(Page" ){for (i=0; i<2; i++){print $arr[i]; print $0; }}
else print }' testfile.ps > result.ps
testfile.ps
(Page 1 of 2 )
move
(Page 1 of 3 )
"gsave""\n""2.00"" ""setlinewidth""\n"
result.ps should be
(Page 1 of 2 )
lineto
stroke
move
(Page 1 of 3 )
476.00 26.00
gsave
2.00
setlinewidth
means once second time condition is met array index should be incremented to 1 and it should print a[1]
i applied this approch also,with only single array element but not getting any output
awk -v "a0=$a[0]" 'BEGIN {a[0]=""lineto""stroke""; if($1 == "move" ){for (i in a){ print a0;print $0; }} else print }' testfile.txt
edited:
hi , I have resolved the issue up to some extent but stuck at one place, how can i compare two strings like "a=476.00 1.00 lineto\nstroke\ngrestore\n" and "b=26.00 moveto\n368.00 1.00 lineto\n" in awk command, i am trying
awk -v "a=476.00 1.00 lineto\nstroke\ngrestore\n" -v "b=26.00 moveto\n368.00 1.00 lineto\n" -v "i=$a" '{
if ($1 == "(Page" && ($2%2==0 || $2==1) && $3 == "of"){
print i;
if [ i == a ];then
i=b; print $0;
fi
else if [ i == b ];then
i=c; print $0;
fi
else print $0;
}'testfile.txt
You are using in your awk program a variable arr which is never initialized.
In your case, you want to pass a variable from the shell to awk. From the awk man page:
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the program begins. Such
variable values are available to the BEGIN rule of an AWK program.
Hence, you need something like
awk -v "a0=$a[0]" -v "a1=$a[1]" .....
and in a BEGIN block, you can set up your array arr from the variables a0 and a1 in any way you want.
Gather the data to a single var using a separator:
$ awk -v s="lineto\nstroke;476.00 26.00" ' # ; as separator
BEGIN{ n=split(s,a,";") } # split s var to a array
1 # output record
/\(Page/ && i<n { print a[++i] } # if (Page and still data in a
' file
(Page 1 of 2 )
lineto
stroke
move
(Page 1 of 3 )
476.00 26.00
"gsave""\n""2.00"" ""setlinewidth""\n"

Bash - only printing certain parts of a matrix using awk

I want to read a matrix of numbers
1 3 4 5
2 4 9 0
And only want my awk statement to print out the first and last, so 1 and 0. I have this so far, but nothing will print. What is wrong with my logic?
awk 'BEGIN {for(i=1;i<NF;i++)
if(i==1)printf("%d ", $i);
else if(i==NF && i==NR)printf("%d ", $i);}'
$ awk '{ if (NR==1) { print $1}} END{print $NF}' matrix
1
0
The above awk program has two parts. The first is:
{ if (NR==1) { print $1}}
This prints the first field (column) of the first record (line) of the file.
The second part is:
END{print $NF}
This parts runs only at the end after the last record (line) has been read. It prints the last field (column) of that line.
Borrowing from unix.com, you can use the following:
awk 'NR == 1 {print $1} END { print $NF }'
This will print the first column of the first line (NR == 1) and end input has finished (END), print the final column of the last line.
If I understand the output format you're looking for, this code should capture those values and print them:
awk 'NR == 1 {F = $1} END { L = $NF ; printf("%d %d", F, L) }'
awk is line based, NR is the current record (line) number.
and awk is essentially match => action,
echo "1 3 4 5
2 4 9 0" |
awk 'NR == 1 {print $1;}
END {print $NF;}'
for the first record print the first field;
for the last record print the last field.
Since so many solutions with awk, here is another way with sed.
sed -r ':a;$!{N;ba};s/\s+.*\s+/ /' file
Yet another sed variant:
$ echo $'1 3 4 5\n2 4 9 0' | sed -n '1s/ .*//p;$s/.* //p'
awk 'NR==1{print $1;} END{print $NF;}'

Resources