How to substract every nth from (n+3)th line in awk? - bash

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you

For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40

You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

Related

How to sort data based on the value of a column for part (multiple lines) of a file?

My data in the file file1 look like
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
Each block has the same number of rows (Here it is 3+2 = 5). In each block, the first two lines are header, the next 3 rows have two columns, the first column is the label, which is one of the number from 1 to 3. I want to sort the rows in each block, based on the value of the first column (except the first two rows). So the expected result is
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
I think sort -k 1 -n file1 will be good for the total file.
It gives me the wrong result:
0
1
2
3
3
3
2 0.1
3 0.2
3 0.3
1 0.4
2 0.4
2 0.5
1 0.8
1 0.8
3 0.8
This is not the expected result.
How to sort each block is still a problem for me. I think AWK is possible to perform this problem. Please give some suggestions.
Apply the DSU (Decorate/Sort/Undecorate) idiom using any awk+sort+cut and regardless of how many lines are in each bock:
$ awk -v OFS='\t' '
NF<pNF || NR==1 { blockNr++ }
{ print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }
' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3 |
cut -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
To understand what that's doing, just look at the first 2 steps:
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file
1 1 1 1 3
1 1 2 2 0
1 2 3 2 2 0.5
1 2 4 1 1 0.8
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 8 2 2 0.1
2 2 9 3 3 0.8
2 2 10 1 1 0.4
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
$ awk -v OFS='\t' 'NF<pNF || NR==1{ blockNr++ } { print blockNr, NF, NR, (NF>1 ? $1 : NR), $0; pNF=NF }' file |
sort -n -k1,1 -k2,2 -k4,4 -k3,3
1 1 1 1 3
1 1 2 2 0
1 2 4 1 1 0.8
1 2 3 2 2 0.5
1 2 5 3 3 0.2
2 1 6 6 3
2 1 7 7 1
2 2 10 1 1 0.4
2 2 8 2 2 0.1
2 2 9 3 3 0.8
3 1 11 11 3
3 1 12 12 2
3 2 13 1 1 0.8
3 2 14 2 2 0.4
3 2 15 3 3 0.3
and notice that the awk command is just creating the key values that you need for sort to sort on by block number, line number or $1, etc. So awk Decorates the input, sort Sorts it, and cut Undecorates it by removing the decoration values that the awk script added.
You can use sort and arrays in gawk
awk 'NF==1 && a[1]{
n=asort(a);
for(k=1; k<=n; k++){print a[k]};
delete a; i=1
}NF==1{print}
NF==2{a[i]=$0;++i}
END{n=asort(a); for(k=1; k<=n; k++){print a[k]}}
' file1
you get
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
This is similar to Ed Morton's solution but without variable assignment, it uses only built-in variables instead:
λ cat input.txt
3
0
2 0.5
1 0.8
3 0.2
3
1
2 0.1
3 0.8
1 0.4
3
2
1 0.8
2 0.4
3 0.3
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt |
sort -n -k1,1 -k2,2 -k3,3 -k4,4 | cut -d ' ' -f5-
3
0
1 0.8
2 0.5
3 0.2
3
1
1 0.4
2 0.1
3 0.8
3
2
1 0.8
2 0.4
3 0.3
How it work
awk '{ print int((NR-1)/5), ((NR-1)%5<2) ? 0 : 1, (NF>1 ? $1 : NR), NR, $0 }' input.txt
0 0 1 1 3
0 0 2 2 0
0 1 2 3 2 0.5
0 1 1 4 1 0.8
0 1 3 5 3 0.2
1 0 6 6 3
1 0 7 7 1
1 1 2 8 2 0.1
1 1 3 9 3 0.8
1 1 1 10 1 0.4
2 0 11 11 3
2 0 12 12 2
2 1 1 13 1 0.8
2 1 2 14 2 0.4
2 1 3 15 3 0.3
A ruby:
ruby -e '$<.read.split(/\n/).map(&:split).
slice_when { |a, b| b.length == 1 && b.length < a.length }.
map{|e| e.sort_by{|sl| sl.length()>1 ? -sl[-1].to_f : -1.0/0}}.
each{|e| e.each{|x| puts "#{x.join(" ")}"}}' file
Or, a DSU form ruby:
ruby -lane 'BEGIN{lines=[]; block=0; lnf=0}
block+=1 if $F.length()>1 && lnf==1
lnf=$F.length()
lines << [block, -($F.length()>1 ? $F[-1].to_f : (-1.0/0)), $.] + $F
END{lines.sort().each{|sl| puts "#{sl[3..].join(" ")}"}}
' file

Generation of a counter variable for episodes in panel data in stata [duplicate]

This question already has an answer here:
Calculating consecutive ones
(1 answer)
Closed 1 year ago.
I am trying to generate a counter variable that describes the duration of a temporal episode in panel data.
I am using long format data that looks something like this:
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
I want to generate a variable like aim1 that starts with a value of 1 when var1==1, and counts up one unit with each subsequent observation per ID where var1 is still equal to 1. For each observation where var1!=1, aim1 should contain missing values.
I already tried using rangestat (count) to solve the problem, however the created variable does not restart the count with each episode:
ssc install rangestat
gen var2=1 if var1==1
rangestat (count) aim2=var2, interval(time -7 0) by (id)
Here are two ways to do it: (1) from first principles, but see this paper for more and (2) using tsspell from SSC.
clear
input byte id int time byte var1 int aim1
1 1 0 .
1 2 0 .
1 3 1 1
1 4 1 2
1 5 0 .
1 6 0 .
1 7 0 .
2 1 0 .
2 2 1 1
2 3 1 2
2 4 1 3
2 5 0 .
2 6 1 1
2 7 1 2
end
bysort id (time) : gen wanted = 1 if var1 == 1 & var1[_n-1] != 1
by id: replace wanted = wanted[_n-1] + 1 if var1 == 1 & missing(wanted)
tsset id time
ssc inst tsspell
tsspell, cond(var1 == 1)
list, sepby(id _spell)
+---------------------------------------------------------+
| id time var1 aim1 wanted _seq _spell _end |
|---------------------------------------------------------|
1. | 1 1 0 . . 0 0 0 |
2. | 1 2 0 . . 0 0 0 |
|---------------------------------------------------------|
3. | 1 3 1 1 1 1 1 0 |
4. | 1 4 1 2 2 2 1 1 |
|---------------------------------------------------------|
5. | 1 5 0 . . 0 0 0 |
6. | 1 6 0 . . 0 0 0 |
7. | 1 7 0 . . 0 0 0 |
|---------------------------------------------------------|
8. | 2 1 0 . . 0 0 0 |
|---------------------------------------------------------|
9. | 2 2 1 1 1 1 1 0 |
10. | 2 3 1 2 2 2 1 0 |
11. | 2 4 1 3 3 3 1 1 |
|---------------------------------------------------------|
12. | 2 5 0 . . 0 0 0 |
|---------------------------------------------------------|
13. | 2 6 1 1 1 1 2 0 |
14. | 2 7 1 2 2 2 2 1 |
+---------------------------------------------------------+
The approach of tsspell is very close to what you ask for, except (a) its counter (by default _seq is 0 when out of spell, but replace _seq = . if _seq == 0 gets what you ask (b) its auxiliary variables (by default _spell and _end) are useful in many problems. You must install tsspell before you can use it with ssc install tsspell.

AWK: Add number to the column for specific line

I have a data file of:
1 2 3
1 5 7
2 5 9
11 21 110
6 17 -2
10 2 8
6 4 3
5 1 8
6 1 5
7 3 1
I want to add number 1 to the third column, only for line number 1, 3, 6, 8, 9, 10. And add 2 to the second column, for line number 6~9.
I know how to add 2 to entire second column, and add 1 to entire third column using awk
awk '{print $1, $2+2, $3+1}' data > data2
But how can I modify this code to specific lines of second and third column?
Thanks
Best,
awk to the rescue! You can check for NR in the condition, but for 6 values it will be tedious, alternatively you can check for string match with anchored NR.
$ awk 'BEGIN{lines=",1,3,6,8,9,10,"}
match(lines,","NR","){$3++}
NR>=6 && NR<=9{$2+=2}1' nums
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2
$ cat tst.awk
BEGIN {
for (i=6;i<=9;i++) {
d[2,i] = 2
}
split("1 3 6 8 9 10",t);
for (i in t) {
d[3,t[i]] = 1
}
}
{ $2 += d[2,NR]; $3 += d[3,NR]; print }
$ awk -f tst.awk file
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2

Ranking aggregated values in panel data

I have an unbalanced panel data set with daily data similar to this, for n countries:
quarter date id trade trade_quarterly rank i
1 1 1 1 2 1 10
1 2 1 1 2 1 17
1 1 2 1 1 2 12
2 1 1 0 1 1 5
2 2 1 1 1 1 9
2 1 2 0 1 1 14
2 2 2 1 1 1 8
2 2 3 0 0 3 6
Given are the first 4 columns.
Interested in information i, I would now like to keep only the 2 most traded ids for each quarter. I aggregated quarterly trades with
bysort quarter id: egen trade_quarterly =sum(trade)
to get column 5.
To calculate column 6, I tried using
bysort quarter id : egen xx =rank(trade_quarterly), "option"
which does not appear to produce the correct solution.
(Note that since the values are aggregated within ids ranking with rank(xx), field would produce a wrong rank for the following id)
The last line of syntax
bysort quarter id : egen xx =rank(trade_quarterly), option
is not legal, as the literal text option is itself not an option. More generally, egen, rank() can not help here with your present data structure.
But consider this, just a matter of a collapse to sums (totals) and then keeping only the largest two (the last two after sorting) within cross-combinations:
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
collapse (sum) trade, by(quarter id)
bysort quarter (trade) : keep if (_N - _n) < 2
list, sepby(id quarter)
+----------------------+
| quarter id trade |
|----------------------|
1. | 1 2 1 |
|----------------------|
2. | 1 1 2 |
|----------------------|
3. | 2 1 1 |
|----------------------|
4. | 2 2 1 |
+----------------------+
If you don't want to collapse, then extra technique is to tag each id-quarter pair just once when ranking.
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
egen sum = total(trade), by(quarter id)
egen tag = tag(quarter id)
bysort tag quarter (trade) : gen tokeep = tag & (_N - _n) < 2
bysort quarter id (tokeep) : replace tokeep = tokeep[_N]
list if tokeep, sepby(quarter)
+--------------------------------------------------+
| quarter date id trade sum tag tokeep |
|--------------------------------------------------|
1. | 1 2 1 1 2 0 1 |
2. | 1 1 1 1 2 1 1 |
3. | 1 1 2 1 1 1 1 |
|--------------------------------------------------|
4. | 2 2 1 1 1 0 1 |
5. | 2 1 1 0 1 1 1 |
6. | 2 2 2 1 1 0 1 |
7. | 2 1 2 0 1 1 1 |
+--------------------------------------------------+
Note, in agreement with #William Lisowski's comment, that the largest two may not be uniquely identifiable in the presence of ties.

Matrix addition over multiple files using e.g. awk

I have a bunch of files from simulation output, all with the same number of rows and fields.
What I need to do is to combine them, so that I get only one file with the numbers summed up, which basically resembles the addition of several matrices.
Example:
File1.txt
1 1 1
1 1 1
1 1 1
File2.txt
2 2 2
2 2 2
2 2 2
File3.txt
3 3 3
3 3 3
3 3 3
required output
6 6 6
6 6 6
6 6 6
I'm going to integrate this into some larger Shell-script, therefore I would prefer a solution in awk, though other languages are welcome as well.
awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}
END{for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
input files could be more than 3
test with your data:
kent$ head f[1-3]
==> f1 <==
1 1 1
1 1 1
1 1 1
==> f2 <==
2 2 2
2 2 2
2 2 2
==> f3 <==
3 3 3
3 3 3
3 3 3
kent$ awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}END{for(i=1;i<=FNR;i++)for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
6 6 6
6 6 6
6 6 6
Quick hack:
paste f1 f2 f3 | awk '{for(i=1;i<=m;i++)printf "%d%s",$i+$(i+m)+$(i+2*m),i==m?ORS:OFS}' m=3

Resources