How do I order an event study (panel data) dataframe? - events

I have a big panel database ordered by month and subject with lots of time series variables. In this database several events occure to some of the subjects indicated by a dummy-like variable but with the number of the event.
So I have:
Month
Subject_id
event
Variable_1
Variable_2
01-01-1970
A
0
8%
13%
02-01-1970
A
1
9%
5%
...
...
...
...
...
12-01-1984
B
0
-2%
1%
01-01-1985
B
2
10%
7%
02-01-1985
B
3
26%
3%
I want to construct another database where I can have the months ordered by before-after the event like t-12; t-11; t-10...t; t+1;t+2...
Month
Event
Subject_id
Variable_1
Variable_2
t-1
1
A
8%
13%
t
1
A
9%
5%
...
...
...
...
...
t-1
2
B
-2%
1%
t
2
B
10%
7%
...
...
...
...
...
t-1
3
B
10%
7%
t
3
B
26%
3%
Note that january 1985 is, at the same time, t for event 2 of subject B and t-1 for event 3 of the same subject. For this reason, I couldn't been able to merge by subject and a t+-x column. Some subjects have more than one overlapping event.
How can I transform my data into this new dataframe (I don't care about loosing the subjects that do not have events) ?

Related

I have a Matrix that is cacluating % and it works properly for 1 row but not multiples

I have a Matrix that is calculating % and it works properly for 1 row but not multiples.
Its calculating the individual departments in the row by item number to equal 100 %.
When using multiple rows it calculates all the rows together for a total of 100%.
This is not what I want.
I want all rows to act like the first pic with 1 row calculating across the row.
lenter image description herelike this dept 1 dept 2 dept 3 total item 1 71% 14% 14% 100% item 2 50% 25% 25% 100%
I have figured this out, so this is how I needed to have my sql SUM(B.RDCQTY) OVER (partition by RDICDE) AS SMDSTRDCQTY and RDCQTY / SUM(B.RDCQTY) OVER (PARTITION BY RDICDE) AS PER and in the last cte SUM(PER) OVER (PARTITION BY RDICDE) AS TTLPER then in SSRS the percentage column as sum(per) and the total % column as =(Fields!TTLPER.Value) Now the report is calulating properly per row.fixedpic

Profiling shell commands in Emacs

Is there a way to profile the amount of time blocking on shell commands in emacs? Consider the following program:
(profiler-start 'cpu)
(shell-command "sleep 3")
(profiler-report)
(profiler-stop)
The profiler report will look something like this:
- command-execute 371 95%
- call-interactively 371 95%
- funcall-interactively 329 84%
- execute-extended-command 175 44%
- execute-extended-command--shorter 157 40%
- completion-try-completion 149 38%
- completion--nth-completion 149 38%
- completion--some 143 36%
- #<compiled 0x438307f1> 143 36%
- completion-pcm-try-completion 102 26%
- completion-pcm--find-all-completions 98 25%
completion-pcm--all-completions 98 25%
+ completion-pcm--merge-try 4 1%
completion-basic-try-completion 41 10%
+ sit-for 16 4%
- eval-expression 154 39%
- eval 154 39%
- profiler-start 154 39%
- debug 154 39%
- recursive-edit 141 36%
- command-execute 114 29%
- call-interactively 114 29%
- byte-code 107 27%
+ read--expression 64 16%
+ read-extended-command 43 11%
+ funcall-interactively 7 1%
+ byte-code 42 10%
+ ... 19 4%
As you can see the time spent is more or less evenly distributed. I'm interested in seeing output that tells me that I'm spending the significant part of the program blocking on the shell-command sleep 3, is this possible somehow? I am aware that sleep 3 is not heavy on my CPU - but I'm trying to figure out which shell commands are called from magit that is taking such a long time - so I'll also be interested in stuff that's IO-bound.
Note that profiler.el is a sampling profiler. You might want to try an instrumenting profiler such as elp.el if you are interested in the wall time.
In your case you may want to instrument magit by using M-x elp-instrument-package RET magit RET. After running your magit commands you can then take a look at the results using M-x elp-results RET.
For magit you would probably find that the function magit-process-file is taking up a lot of time. To further investigate the specific function calls you could then simply instrument that or any other function by adding an advice function logging the runtime together with the function's arguments to the messages buffer for each individual function call as follows.
(defun log-function-time (f &rest args)
(let ((time-start (current-time)))
(prog1
(apply f args)
(message "%s seconds used in (magit-process-file %s)"
(time-to-seconds (time-subtract (current-time) time-start))
args))))
(advice-add 'magit-process-file :around 'log-function-time)

How do I rank this data by percentage and total possible?

Given this set in Excel:
Group Enrolled Eligible Percent
A 0 76 0%
B 10 92 11%
C 0 38 0%
D 2 50 4%
E 0 111 0%
F 4 86 5%
G 3 97 3%
H 4 178 2%
I 2 77 3%
J 0 64 0%
K 0 37 0%
L 11 54 20%
Is there a way to sort (for charting) to achieve the following order?
Group Enrolled Eligible Percent
L 11 54 20%
B 10 92 11%
F 4 86 5%
D 2 50 4%
G 3 97 3%
I 2 77 3%
H 4 178 2%
K 0 37 0%
C 0 38 0%
J 0 64 0%
A 0 76 0%
E 0 111 0%
My goal is to rank/visualize using these criteria:
Percent desc (when Enrolled > 0)
Eligible asc (when Enrolled = 0)
After writing this question, the answer looks obvious: sort by Percent descending, then Eligible ascending (when Percent or Enrolled = 0). But I feel like I'm missing an obvious method/term to achieve the results I'm looking for.
Thanks.
With Google spreadsheet Query is the easy way. Goal 1:
=QUERY(A1:D13,"Select A,B,C,D Where B>0 Order By D desc,1")
Goal 2:
=QUERY(A1:D13,"Select A,B,C,D Where B=0 Order By C ,1")
The term you're missing is SORT
Here's the formula you are looking for:
=SORT(A1:D13,4,0,3,1)
Note:
Numbers should be formatted as Numbers.

Prize distribution algorithm

I have written an algorithm for prize distribution for my tournaments. I just want to know if anyone can see any bug or edge case that I haven't figured out or even write something better and more efficient.
So assuming that at the end of a tournament players get a final scores and based on their score they will be sorted and ranked for example:
Rank Score %total prize
1 50 40%
2 50 25%
3 40 15%
4 20 10%
5 20 5%
6 16 3%
7 10 2%
I want it be in a way that in case of tie of Rank A and Rank B, the prize of these ranked be summed and divided by 2 OR in case of tie between Rank A,Rank B and C the sum of these ranks prizes divide by 3 , etc. If there is no tie between ranks they get their predefined prize. so here is the pseudo code for what I have written so far:
rank=0
while(rank < Max # prizes)
{
prize= %prizeForRank(rank)
offset=1
while(scoreList[rank+offset]!=null && scoreList[rank]==scoreList[rank+offset])
{
prize += %prizeForRank(rank+offset)
offset++
}
prize= prize / offset
for(int k=0; k<offset;k++)
{
prizeOfPlayer[k+rank] = prize
}
rank+=offset
}

python average of random sample many times

I am working with pandas and I wish to sample 2 stocks from each trade date and store as part of the dataset the average "Stock_Change" and the average "Vol_Change" for the given day in question based on the sample taken (in this case, 2 stocks per day). The actual data is much larger spanning years and hundreds of names. My sample will be of 100 names, I just use 2 for the purposes of this question.
Sample data set:
In [3]:
df
​
Out[3]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -0.05 0.07
1 1/1/2008 B -0.06 0.17
2 1/1/2008 C -0.05 0.07
3 1/1/2008 D 0.05 0.13
4 1/1/2008 E -0.03 -0.10
5 1/2/2008 A 0.03 -0.17
6 1/2/2008 B 0.08 0.34
7 1/2/2008 C 0.03 0.17
8 1/2/2008 D 0.06 0.24
9 1/2/2008 E 0.02 0.16
10 1/3/2008 A 0.02 0.05
11 1/3/2008 B 0.01 0.39
12 1/3/2008 C 0.05 -0.17
13 1/3/2008 D -0.01 0.37
14 1/3/2008 E -0.06 0.23
15 1/4/2008 A 0.03 0.31
16 1/4/2008 B -0.07 0.16
17 1/4/2008 C -0.06 0.29
18 1/4/2008 D 0.00 0.09
19 1/4/2008 E 0.00 -0.02
20 1/5/2008 A 0.04 -0.04
21 1/5/2008 B -0.06 0.16
22 1/5/2008 C -0.08 0.07
23 1/5/2008 D 0.09 0.16
24 1/5/2008 E 0.06 0.18
25 1/6/2008 A 0.00 0.22
26 1/6/2008 B 0.08 -0.13
27 1/6/2008 C 0.07 0.18
28 1/6/2008 D 0.03 0.32
29 1/6/2008 E 0.01 0.29
30 1/7/2008 A -0.08 -0.10
31 1/7/2008 B -0.09 0.23
32 1/7/2008 C -0.09 0.26
33 1/7/2008 D 0.02 -0.01
34 1/7/2008 E -0.05 0.11
35 1/8/2008 A -0.02 0.36
36 1/8/2008 B 0.03 0.17
37 1/8/2008 C 0.00 -0.05
38 1/8/2008 D 0.08 -0.13
39 1/8/2008 E 0.07 0.18
One other point, the samples can not contain the same security more than once (sample without replacement). My guess is that this a good R question but I don't know the last thing about R . .
I have no idea of even how to start this question.
thanks in advance for any help.
Edit by OP
I tried this but don't seem to be able to get it to work on a the group-by dataframe (grouped by Symbol and Date):
In [35]:
import numpy as np
import pandas as pd
from random import sample
​
# create random index
​
rindex = np.array(sample(range(len(df)), 10))
​
# get 10 random rows from df
dfr = df.ix[rindex]
In [36]:
dfr
Out[36]:
Date Symbol Stock_Change Vol_Change
6 1/2/2008 B 8% 34%
1 1/2/2008 B -6% 17%
37 1/3/2008 C 0% -5%
25 1/1/2008 A 0% 22%
3 1/4/2008 D 5% 13%
12 1/3/2008 C 5% -17%
10 1/1/2008 A 2% 5%
2 1/3/2008 C -5% 7%
26 1/2/2008 B 8% -13%
17 1/3/2008 C -6% 29%
OP Edit #2
As I read the question I realize that I may not have been very clear. What I want to do is sample the data many times (call it X) for each day and in essence end up with X times "# of dates" as my new dataset. This may not look like it makes sense with the data i am showing but my actual data has 500 names and 2 years (2x365 = 730) of dates and I wish to sample 50 random names for each day for a total of 50 x 730 = 36500 data points.
first attempt gave this:
In [10]:
# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, sample_size=3):
symbols = group.Symbol.values
symbols_selected = np.random.choice(symbols, size=sample_size, replace=False)
return group.loc[group.Symbol.isin(symbols_selected)]
​
df.groupby(['Date']).apply(get_subsample).reset_index(drop=True)
​
Out[10]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -5% 7%
1 1/1/2008 A 3% -17%
2 1/1/2008 A 2% 5%
3 1/1/2008 A 3% 31%
4 1/1/2008 A 4% -4%
5 1/1/2008 A 0% 22%
6 1/1/2008 A -8% -10%
7 1/1/2008 A -2% 36%
8 1/2/2008 B -6% 17%
9 1/2/2008 B 8% 34%
10 1/2/2008 B 1% 39%
11 1/2/2008 B -7% 16%
12 1/2/2008 B -6% 16%
13 1/2/2008 B 8% -13%
14 1/2/2008 B -9% 23%
15 1/2/2008 B 3% 17%
16 1/3/2008 C -5% 7%
17 1/3/2008 C 3% 17%
18 1/3/2008 C 5% -17%
19 1/3/2008 C -6% 29%
20 1/3/2008 C -8% 7%
21 1/3/2008 C 7% 18%
22 1/3/2008 C -9% 26%
23 1/3/2008 C 0% -5%
24 1/4/2008 D 5% 13%
25 1/4/2008 D 6% 24%
26 1/4/2008 D -1% 37%
27 1/4/2008 D 0% 9%
28 1/4/2008 D 9% 16%
29 1/4/2008 D 3% 32%
30 1/4/2008 D 2% -1%
31 1/4/2008 D 8% -13%
32 1/5/2008 E -3% -10%
33 1/5/2008 E 2% 16%
34 1/5/2008 E -6% 23%
35 1/5/2008 E 0% -2%
36 1/5/2008 E 6% 18%
37 1/5/2008 E 1% 29%
38 1/5/2008 E -5% 11%
39 1/5/2008 E 7% 18%
import pandas as pd
import numpy as np
# replicate your data structure
# ==============================
np.random.seed(0)
dates = pd.date_range('2008-01-01', periods=100, freq='B')
symbols = 'A B C D E'.split()
multi_index = pd.MultiIndex.from_product([dates, symbols], names=['Date', 'Symbol'])
stock_change = np.random.randn(500)
vol_change = np.random.randn(500)
df = pd.DataFrame({'Stock_Change': stock_change, 'Vol_Change': vol_change}, index=multi_index).reset_index()
# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, X=100, sample_size=3):
frame = pd.DataFrame(columns=['sample_{}'.format(x) for x in range(1,X+1)])
for col in frame.columns.values:
frame[col] = group.loc[group.Symbol.isin(np.random.choice(symbols, size=sample_size, replace=False)), ['Stock_Change', 'Vol_Change']].mean()
return frame.mean(axis=1)
result = df.groupby(['Date']).apply(get_subsample)
Out[169]:
Stock_Change Vol_Change
Date
2008-01-01 1.3937 0.2005
2008-01-02 0.0406 -0.7280
2008-01-03 0.6073 -0.2699
2008-01-04 0.2310 0.7415
2008-01-07 0.0718 -0.7269
2008-01-08 0.3808 -0.0584
2008-01-09 -0.5595 -0.2968
2008-01-10 0.3919 -0.2741
2008-01-11 -0.4856 0.0386
2008-01-14 -0.4700 -0.4090
... ... ...
2008-05-06 0.1510 0.1628
2008-05-07 -0.1452 0.2824
2008-05-08 -0.4626 0.2173
2008-05-09 -0.2984 0.6324
2008-05-12 -0.3817 0.7698
2008-05-13 0.5796 -0.4318
2008-05-14 0.2875 0.0067
2008-05-15 0.0269 0.3559
2008-05-16 0.7374 0.1065
2008-05-19 -0.4428 -0.2014
[100 rows x 2 columns]

Resources