pandas: time difference in groupby - time

How to calculate time difference for each id between current row and next for
dataset below:
time id
2012-03-16 23:50:00 1
2012-03-16 23:56:00 1
2012-03-17 00:08:00 1
2012-03-17 00:10:00 2
2012-03-17 00:12:00 2
2012-03-17 00:20:00 2
2012-03-20 00:43:00 3
and get next result:
time id tdiff
2012-03-16 23:50:00 1 6
2012-03-16 23:56:00 1 12
2012-03-17 00:08:00 1 NA
2012-03-17 00:10:00 2 2
2012-03-17 00:12:00 2 8
2012-03-17 00:20:00 2 NA
2012-03-20 00:43:00 3 NA

I see that you need result in minutes by id. Here is how to do it :
use diff() in groupby :
# first convert to datetime with the right format
data['time']=pd.to_datetime(data.time, format='%Y-%m-%d %H:%M:%S')
data['tdiff']=(data.groupby('id').diff().time.values/60000000000).astype(int)
data['tdiff'][data['tdiff'] < 0] = np.nan
print(data)
output
time id tdiff
0 2012-03-16 23:50:00 1 NaN
1 2012-03-16 23:56:00 1 6.0
2 2012-03-17 00:08:00 1 12.0
3 2012-03-17 00:10:00 2 NaN
4 2012-03-17 00:12:00 2 2.0
5 2012-03-17 00:20:00 2 8.0
6 2012-03-20 00:43:00 3 NaN

Related

how to fix query-tool: Query failed ERROR: syntax error at or near while inserting data

I have this command to populate created tables with geographic data
COPY public.adonis_schema (id, name, batch, migration_time) FROM stdin;
1 database/migrations/1607548129188_users 1 2021-04-02 14:14:27.470863+00
2 database/migrations/1607548416832_conversations 1 2021-04-02 14:14:28.070888+00
3 database/migrations/1607548444586_participations 1 2021-04-02 14:14:28.480253+00
4 database/migrations/1607548494088_messages 1 2021-04-02 14:14:29.050234+00
5 database/migrations/1609020554140_vcard_shares 1 2021-04-02 14:14:29.520909+00
6 database/migrations/1609024583459_add_conversation_names 1 2021-04-02 14:14:29.905367+00
7 database/migrations/1609467289494_meetings 1 2021-04-02 14:14:30.300248+00
8 database/migrations/1609467351706_notes 1 2021-04-02 14:14:30.960852+00
9 database/migrations/1609976010374_meetings_lengths 1 2021-04-02 14:14:31.640233+00
10 database/migrations/1610498049695_conversations_event_ids 1 2021-04-02 14:14:32.020247+00
11 database/migrations/1611099138751_cache_users 1 2021-04-02 14:14:32.405294+00
12 database/migrations/1616628109445_conversations_ownerships 1 2021-04-02 14:14:32.800258+00
13 database/migrations/1617362496376_conversations_types 1 2021-04-02 14:14:33.185207+00
14 database/migrations/1617805023298_conversations_timestamps 2 2021-04-07 14:18:42.957427+00
47 database/migrations/1622085675952_user_is_busies 3 2021-05-27 03:22:31.964783+00
\.
I have as indicated put the entire code in a sql file and executed it with psql but I have an error:
ERROR: ERREUR: erreur de syntaxe sur ou près de « 1 »
LINE 2: 1 database/migrations/1607548129188_users 1 2021-04-02 14:14...
^
SQL state: 42601
Character: 73
Do you have an idea please?

How to interpolate values in panel data using a loop

I have a panel dataset. My variable identifiers are cc for country codes and Year for years:
clear
input long cc float(sch Year)
2 0 1960
2 0 1961
2 0 1962
2 0 1963
2 0 1964
2 0 1965
2 0 1966
2 0 1967
2 0 1968
2 0 1969
2 0 1970
2 0 1971
2 0 1972
2 0 1973
2 0 1974
2 0 1975
2 0 1976
2 0 1977
2 .733902 1978
2 .7566 1979
2 .78 1980
2 .875 1981
2 .9225 1982
2 1.0174999 1983
2 1.0649999 1984
2 1.16 1985
2 1.2425 1986
2 1.28375 1987
2 1.36625 1988
2 1.4075 1989
2 1.49 1990
2 1.5825 1991
2 1.62875 1992
2 1.72125 1993
2 1.7675 1994
2 1.86 1995
2 1.935 1996
2 1.9725 1997
2 2.0475001 1998
2 2.085 1999
2 2.16 2000
2 2.27 2001
2 2.325 2002
2 2.435 2003
2 2.49 2004
2 2.6 2005
2 2.7575 2006
2 2.83625 2007
2 2.99375 2008
2 3.0725 2009
2 3.23 2010
2 3.15125 2011
2 3.190625 2012
2 3.1709375 2013
2 3.1807814 2014
2 3.1758595 2015
2 3.1783204 2016
2 3.17709 2017
2 3.177705 2018
4 0 1960
4 0 1961
4 0 1962
4 0 1963
4 0 1964
4 0 1965
4 0 1966
4 0 1967
4 0 1968
4 0 1969
4 0 1970
4 0 1971
4 0 1972
4 0 1973
4 0 1974
4 0 1975
4 0 1976
4 0 1977
4 4.657455 1978
4 4.8015 1979
4 4.95 1980
4 5.4 1981
4 5.625 1982
4 6.075 1983
4 6.3 1984
4 6.75 1985
4 7.02 1986
4 7.155 1987
4 7.425 1988
4 7.56 1989
4 7.83 1990
4 7.8275 1991
4 7.82625 1992
4 7.82375 1993
4 7.8225 1994
4 7.82 1995
4 8.195 1996
4 8.3825 1997
4 8.7575 1998
4 8.945 1999
4 9.32 2000
4 9.412499 2001
4 9.45875 2002
4 9.55125 2003
4 9.5975 2004
4 9.69 2005
4 9.73 2006
4 9.75 2007
4 9.79 2008
4 9.81 2009
4 9.85 2010
4 9.83 2011
4 9.84 2012
4 9.835 2013
4 9.8375 2014
4 9.83625 2015
4 9.836875 2016
4 9.836563 2017
4 9.83672 2018
end
I would like to interpolate the sch variable for decreasing years. Variable sch has observations over years 1979-2018. By using the observation for 1978 I would like to interpolate the value of 1977:
sch_1977 = 0.97 * sch_1978
The code I have tried is the following:
forvalues y = 1977 1976 1975{
local i = `y' - 1958
bysort cc (Year): generate sch`y' = 0.97*sch[`i']
replace sch`y' = 0 if Year != `y'
replace sch = sch + sch`y'
}
Here i corresponds to the row where the year of 1978 placed for variable cc. By using a forvalues loop, in every iteration I wanted to create a new variable (sch1977, sch1978, sch1979) with an interpolated observation in the corresponding year and zeros for all other observations. Next, I would like to sum up this new variable with sch. However, Stata complains that the code is invalid.
The following works for me:
foreach x in 1977 1976 1975 {
local i = (2018 - 1960) - (2018 - `x') + 2
bysort cc (Year): generate sch_`x' = 0.97 * sch[`i']
replace sch_`x' = 0 if Year != `x'
replace sch = sch + sch_`x'
}
Results:
bysort cc (Year): list if inrange(Year, 1970, 1980), sepby(cc)
-> cc = 2
+-------------------------------------------------------+
| cc sch Year sch_1977 sch_1976 sch_1975 |
|-------------------------------------------------------|
11. | 2 0 1970 0 0 0 |
12. | 2 0 1971 0 0 0 |
13. | 2 0 1972 0 0 0 |
14. | 2 0 1973 0 0 0 |
15. | 2 0 1974 0 0 0 |
16. | 2 .6698126 1975 0 0 .6698126 |
17. | 2 .6905284 1976 0 .6905284 0 |
18. | 2 .7118849 1977 .7118849 0 0 |
19. | 2 .733902 1978 0 0 0 |
20. | 2 .7566 1979 0 0 0 |
21. | 2 .78 1980 0 0 0 |
+-------------------------------------------------------+
-> cc = 4
+-------------------------------------------------------+
| cc sch Year sch_1977 sch_1976 sch_1975 |
|-------------------------------------------------------|
11. | 4 0 1970 0 0 0 |
12. | 4 0 1971 0 0 0 |
13. | 4 0 1972 0 0 0 |
14. | 4 0 1973 0 0 0 |
15. | 4 0 1974 0 0 0 |
16. | 4 4.250733 1975 0 0 4.250733 |
17. | 4 4.382199 1976 0 4.382199 0 |
18. | 4 4.517731 1977 4.517731 0 0 |
19. | 4 4.657455 1978 0 0 0 |
20. | 4 4.8015 1979 0 0 0 |
21. | 4 4.95 1980 0 0 0 |
+-------------------------------------------------------+

how to find a sequence of numbers

I have a data file formatted like this:
0.00 0.00 0.00
1 10 1.0
2 12 1.0
3 15 1.0
4 20 0.0
5 23 0.0
0.20 0.15 0.6
1 12 1.0
2 15 1.0
3 20 0.0
4 18 0.0
5 20 0.0
0.001 0.33 0.15
1 8 1.0
2 14 1.0
3 17 0.0
4 25 0.0
5 15 0.0
I need to remove some data and reorder line like this:
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
My code do not show anything. The problem might be in the grep command. Could you please help me out?
touch extract_file.txt
for (( i=1; i<=band; i++))
do
sed -e '1, 7d' data_file | grep -w " '$(echo $i)' " | awk '{print $2}' > extract(echo $i).txt
paste -s extract_file.txt extract$(echo $i).txt > data
done
#rm eigen*.txt
The following code with comments:
cat <<EOF |
0.00 0.00 0.00
1 10 1.0
2 12 1.0
3 15 1.0
4 20 0.0
5 23 0.0
0.20 0.15 0.6
1 12 1.0
2 15 1.0
3 20 0.0
4 18 0.0
5 20 0.0
0.001 0.33 0.15
1 8 1.0
2 14 1.0
3 17 0.0
4 25 0.0
5 15 0.0
EOF
# remove lines not starting with a space
grep -v '^[^ ]' |
# remove leading space
sed 's/^[[:space:]]*//' |
# remove third arg
sed 's/[[:space:]]*[^[:space:]]*$//' |
# stable sort on first number
sort -s -n -k1 |
# each time first number changes, print additional newline
awk '{ if(length(last) != 0 && last != $1) printf "\n"; print; last=$1}'
outputs:
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
Tested on repl.
perl one-liner:
$ perl -lane 'push #{$nums{$F[0]}}, "#F[0,1]" if /^ /;
END { for $n (sort { $a <=> $b } keys %nums) {
print for #{$nums{$n}};
print "" }}' input.txt
1 10
1 12
1 8
2 12
2 15
2 14
3 15
3 20
3 17
4 20
4 18
4 25
5 23
5 20
5 15
Basically, for each line starting with a space, use the first number as a key to a hash table that stores lists of the first two numbers, and print them out sorted by first number.

How to substract every nth from (n+3)th line in awk?

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

Ranking aggregated values in panel data

I have an unbalanced panel data set with daily data similar to this, for n countries:
quarter date id trade trade_quarterly rank i
1 1 1 1 2 1 10
1 2 1 1 2 1 17
1 1 2 1 1 2 12
2 1 1 0 1 1 5
2 2 1 1 1 1 9
2 1 2 0 1 1 14
2 2 2 1 1 1 8
2 2 3 0 0 3 6
Given are the first 4 columns.
Interested in information i, I would now like to keep only the 2 most traded ids for each quarter. I aggregated quarterly trades with
bysort quarter id: egen trade_quarterly =sum(trade)
to get column 5.
To calculate column 6, I tried using
bysort quarter id : egen xx =rank(trade_quarterly), "option"
which does not appear to produce the correct solution.
(Note that since the values are aggregated within ids ranking with rank(xx), field would produce a wrong rank for the following id)
The last line of syntax
bysort quarter id : egen xx =rank(trade_quarterly), option
is not legal, as the literal text option is itself not an option. More generally, egen, rank() can not help here with your present data structure.
But consider this, just a matter of a collapse to sums (totals) and then keeping only the largest two (the last two after sorting) within cross-combinations:
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
collapse (sum) trade, by(quarter id)
bysort quarter (trade) : keep if (_N - _n) < 2
list, sepby(id quarter)
+----------------------+
| quarter id trade |
|----------------------|
1. | 1 2 1 |
|----------------------|
2. | 1 1 2 |
|----------------------|
3. | 2 1 1 |
|----------------------|
4. | 2 2 1 |
+----------------------+
If you don't want to collapse, then extra technique is to tag each id-quarter pair just once when ranking.
clear
input quarter date id trade
1 1 1 1 2
1 2 1 1 2
1 1 2 1 1
2 1 1 0 1
2 2 1 1 1
2 1 2 0 1
2 2 2 1 1
2 2 3 0 0
end
egen sum = total(trade), by(quarter id)
egen tag = tag(quarter id)
bysort tag quarter (trade) : gen tokeep = tag & (_N - _n) < 2
bysort quarter id (tokeep) : replace tokeep = tokeep[_N]
list if tokeep, sepby(quarter)
+--------------------------------------------------+
| quarter date id trade sum tag tokeep |
|--------------------------------------------------|
1. | 1 2 1 1 2 0 1 |
2. | 1 1 1 1 2 1 1 |
3. | 1 1 2 1 1 1 1 |
|--------------------------------------------------|
4. | 2 2 1 1 1 0 1 |
5. | 2 1 1 0 1 1 1 |
6. | 2 2 2 1 1 0 1 |
7. | 2 1 2 0 1 1 1 |
+--------------------------------------------------+
Note, in agreement with #William Lisowski's comment, that the largest two may not be uniquely identifiable in the presence of ties.

Resources