faster way to create variable that aggregates a column by id [duplicate] - performance

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.
df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))
I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?
Edit: Here is a speed comparison of the answers applied to my data.
> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
user system elapsed
14.405 0.000 14.479
> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
user system elapsed
15.973 0.000 15.992
> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
user system elapsed
0.048 0.000 0.048
> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
user system elapsed
0.032 0.000 0.030
> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
user system elapsed
0.028 0.000 0.028

Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:
library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
cand.perc = 1:9,
key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
ID Perc.total cand.perc
[1,] 1 6 1
[2,] 1 6 2
[3,] 1 6 3
[4,] 2 15 4
[5,] 2 15 5
[6,] 2 15 6
[7,] 3 24 7
[8,] 3 24 8
[9,] 3 24 9
Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/

For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.
df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)

Use tapply to get the group stats, then add them back into your dataset afterwards.
Reproducible example:
means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]
Untested solution for your scenario:
sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]

ilprincipe if none of the above fits your needs you could try transposing your data
dft=t(df)
then use aggregate
dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)
next have back your rownames
rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]
Transpose back to original orientation
df2=t(dfta)
and bind to original data
newdf=cbind(df,df2)

Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:
ddply(df, "id", transform, perc.total = sum(cand.perc))
getting rid of the superfluous cbind should speed things up.

You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.

Related

Invalid syntax loop in Stata

I'm trying to run a for loop to make a balance table in Stata (comparing the demographics of my dataset with national-level statistics)
For this, I'm prepping my dataset and attempting to calculate the percentages/averages for some key demographics.
preserve
rename unearnedinc_wins95 unearninc_wins95
foreach var of varlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019 { //continuous or binary; to put categorical vars use kwallis test
dis "for variable `var':"
tabstat `var'
summ `var'
local `var'_samplemean=r(mean)
}
clear
set obs 11
gen var=""
gen sample=.
gen F=.
gen pvalue=.
replace var="% Female" if _n==1
replace var="Age" if _n==2
replace var="% Non-white" if _n==3
replace var="HH size" if _n==4
replace var="% Parent" if _n==5
replace var="% Employed" if _n==6
replace var="Savings stock ($)" if _n==7
replace var="Debt stock ($)" if _n==8
replace var="Earned income last mo. ($)" if _n==9
replace var="Unearned income last mo. ($)" if _n==10
replace var="% Under FPL 2019" if _n==11
foreach col of varlist sample {
replace `col'=100*round(`fem_`col'mean', 0.01) if _n==1
replace `col'=round(`age_`col'mean') if _n==2
replace `col'=100*round(`nonwhite_`col'mean', 0.01) if _n==3
replace `col'=round(`hhsize_`col'mean', 0.1) if _n==4
replace `col'=100*round(`parent_`col'mean', 0.01) if _n==5
replace `col'=100*round(`employed_`col'mean', 0.01) if _n==6
replace `col'=round(`savings_wins95_`col'mean') if _n==7
replace `col'=round(`debt_wins95_`col'mean') if _n==8
replace `col'=round(`earnedinc_wins95_`col'mean') if _n==9
replace `col'=round(`unearninc_wins95_`col'mean') if _n==10
replace `col'=100*round(`underfpl2019_`col'mean', 0.01) if _n==11
}
I'm trying to run the following loop, but in the second half of the loop, I keep getting an 'invalid syntax' error. For context, in the first half of the loop (before clearing the dataset), the code stores the average values of the variables as a macro (`var'_samplemean). Can someone help me out and mend this loop?
My sample data:
clear
input byte fem float(age nonwhite) byte(hhsize parent) float employed double(savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95) float underfpl2019
1 35 1 6 1 1 0 2500 0 0 0
0 40 0 4 1 1 0 10000 1043 0 0
0 40 0 4 1 1 0 20000 2400 0 0
0 40 0 4 1 1 .24 20000 2000 0 0
0 40 0 4 1 1 10 . 2600 0 0
Thanks!
Thanks for sharing the snippet of data. Apart from the fact the variable unearninc_wins95 has already been renamed in your sample data, the code runs fine for me without returning an error.
That being said, the columns for your F-statistics and p-values are empty once the loop at the bottom of your code completes. As far as I can see there is no local/varlist called sample which you're attempting to call with the line foreach col of varlist sample{. This could be because you haven't included it in your code, in which case please do, or it could be because you haven't created the local/varlist sample, in which case this could well be the source of your error message.
Taking a step back, there are more efficient ways of achieving what I think you're after. For example, you can get (part of) what you want using the package stat2data (if you don't have it installed already, run ssc install stat2data from the command prompt). You can then run the following code:
stat2data fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019, saving("~/yourstats.dta") stat(count mean)
*which returns:
preserve
use "~/yourstats.dta", clear
. list, sep(11)
+----------------------------+
| _name sN smean |
|----------------------------|
1. | fem 5 .2 |
2. | age 5 39 |
3. | nonwhite 5 .2 |
4. | hhsize 5 4.4 |
5. | parent 5 1 |
6. | employed 5 1 |
7. | savings_wins 5 2.048 |
8. | debt_wins95 4 13125 |
9. | earnedinc_wi 5 1608.6 |
10. | unearninc_wi 5 0 |
11. | underfpl2019 5 0 |
+----------------------------+
restore
This is missing the empty F-statistic and p-value variables you created in your code above, but you can always add them in the same way you have with gen F=. and gen pvalue=.. The presence of these variables though indicates you want to run some tests at some point and then fill the cells with values from them. I'd offer advice on how to do this but it's not obvious to me from your code what you want to test. If you can clarify this I will try and edit this answer to include that.
This doesn't answer your question directly; as others gently point out the question is hard to answer without a reproducible example. But I have several small comments on your code which are better presented in this form.
Assuming that all the variables needed are indeed present in the dataset, I would recommend something more like this:
local myvarlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019
local desc `" "% Female" "Age" "% Non-white" "HH size" "% Parent" "% Employed" "Savings stock ($)" "Debt stock ($)" "Earned income last mo. ($)" "Unearned income last mo. ($)" "% Under FPL 2019" "'
local i = 1
gen variable = ""
gen mean = ""
local i = 1
foreach var of local myvars {
summ `var', meanonly
local this : word `i' of `desc'
replace variable = "`this'" in `i'
if inlist(`i', 1, 3, 5, 6, 11) {
replace mean = strofreal(100 * r(mean), "%2.0f") in `i'
}
else if `i' == 4 {
replace mean = strofreal(r(mean), "%2.1f") in `i'
}
else replace mean = strofreal(r(mean), "%2.0f") in `i'
local ++i
}
This has not been tested.
Points arising include:
Using in is preferable for what you want over testing the observation number with if.
round() is treacherous for rounding to so many decimal places. Most of the time you will get what you want, but occasionally you will get bizarre results arising from the fact that Stata works in binary, like any equivalent program. It is safer to treat rounding as a problem in string manipulation and use display formats as offering precisely what you want.
If the text you want to show is just the variable label for each variable, this code could be simplified further.
The code hints at intent to show other stuff, which is easily done compatibly with this design.

How to avoid row names in further analysis in R?

I´m just running the following example from GGEBiplotGUI package and of course, it works properly.
library(GGEBiplotGUI)
data("Ontario")
Ontario
GGEBiplot(Data = Ontario)
But when I download "Ontario" data and I want to run the above cited script on my PC. See the example below.
Ontario <- read.csv("Book.csv")
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)
The result is the following table (from column 0 to 10) taking numbers (From 1 to 17) as genotypes and "X" as another location.
See the result below please.
X BH93 EA93 HW93 ID93 KE93 NN93 OA93 RN93 WP93
1 ann 4.460 4.150 2.849 3.084 5.940 4.450 4.351 4.039 2.672
2 ari 4.417 4.771 2.912 3.506 5.699 5.152 4.956 4.386 2.938
3 aug 4.669 4.578 3.098 3.460 6.070 5.025 4.730 3.900 2.621
4 cas 4.732 4.745 3.375 3.904 6.224 5.340 4.226 4.893 3.451
5 del 4.390 4.603 3.511 3.848 5.773 5.421 5.147 4.098 2.832
6 dia 5.178 4.475 2.990 3.774 6.583 5.045 3.985 4.271 2.776
7 ena 3.375 4.175 2.741 3.157 5.342 4.267 4.162 4.063 2.032
8 fun 4.852 4.664 4.425 3.952 5.536 5.832 4.168 5.060 3.574
9 ham 5.038 4.741 3.508 3.437 5.960 4.859 4.977 4.514 2.859
10 har 5.195 4.662 3.596 3.759 5.937 5.345 3.895 4.450 3.300
11 kar 4.293 4.530 2.760 3.422 6.142 5.250 4.856 4.137 3.149
12 kat 3.151 3.040 2.388 2.350 4.229 4.257 3.384 4.071 2.103
13 luc 4.104 3.878 2.302 3.718 4.555 5.149 2.596 4.956 2.886
14 m12 3.340 3.854 2.419 2.783 4.629 5.090 3.281 3.918 2.561
15 reb 4.375 4.701 3.655 3.592 6.189 5.141 3.933 4.208 2.925
16 ron 4.940 4.698 2.950 3.898 6.063 5.326 4.302 4.299 3.031
17 rub 3.786 4.969 3.379 3.353 4.774 5.304 4.322 4.858 3.382
How can I fix this problem? I mean, in order to avoid "rownames" and "x" as a variables in the GGEBiplotGUI analysis.
I have also tried with these codes and they didn´t work:
attributes(Ontario)$row.names <- NULL
print(Ontario, row.names = F)
row.names(Ontario) <- NULL
Ontario[, -1] ## It deletes the first column not the 0 one.
Many thanks in advance!
This code worked properly.
Ontario <- read.csv("Libro.csv")
rownames(Ontario)<-Ontario$X
Ontario1<-Ontario[,-1]
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)

SAS: Event study window with condition

I am conducting an event study and need the average value of return to generate abnormal returns. My benchmark window is [-60,-11] and my event window is [-5,-1] with 0 as announcement date. However, I have several announcements which could contaminate the benchmark and event window.
Still, I want to keep the 50 days of benchmark window intact, thus, if there is an announcement in the benchmark window, delete this day and extend the window by 1.
Right now I generate averages with proc expand:
proc expand; by stock;
convert logreturn = avg_logreturn / METHOD = none TRANSFORMOUT = (movave 60 lag 11);
run;
And then deduct the average from the actual returns.
My data set looks like this (10 years of data):
Stock Date Return Announcement
AAA 01/01/10 0.05
AAA 02/01/10 0.04
AAA 03/01/10 -0.02 03/01/10 this one should be deleted as is spoils the coming announcement but still be counted as an announcement
AAA 04/01/10 0.01
AAA 05/01/10 -0.03
AAA 06/01/10 0.05
AAA 07/01/10 0.04
AAA 08/01/10 -0.02 08/01/10
AAA 09/01/10 0.01
AAA 10/01/10 -0.03
AAB 01/01/10 0.01
etc
Basically, each announcement needs a window of -60 to -11 where I calculate the average of. The length should remain the same but always if there is anther announcement in this window, the return should not be counted in that average.
The idea is simple but the realization seems compliacted...
Pre process the data to be expanded.
Find the stock/dates that need to be culled
Create a view that excludes culled dates
Proc EXPAND
Sample code:
data have;
attrib
stock length=$3
date length=4 format=date9. informat=mmddyy8.
return length=8 format=6.2
announcement length=4 format=date9. informat=mmddyy8.
;
infile cards missover;
input stock date return announcement;
datalines;
AAA 01/01/10 0.05
AAA 02/01/10 0.04
AAA 03/01/10 -0.02 03/01/10 this one should be deleted as is spoils the coming announcement but still be counted as an announcement
AAA 04/01/10 0.01
AAA 05/01/10 -0.03
AAA 06/01/10 0.05
AAA 07/01/10 0.04
AAA 08/01/10 -0.02 08/01/10
AAA 09/01/10 0.01
AAA 10/01/10 -0.03
AAB 01/01/10 0.01
run;
%let CULL_GAP_LE_CRITERIA = 5 ;
data cull(keep=stock cull_date);
set have;
by stock date;
retain cull_date;
if first.stock then cull_date = .;
if announcement then do;
if cull_date then do;
gap = intck('month', cull_date, announcement);
if gap <= &CULL_GAP_LE_CRITERIA then
OUTPUT;
end;
cull_date = announcement; * setup for culling this announcement as well;
put cull_date=;
end;
run;
data DATA_FOR_EXPAND / view=DATA_FOR_EXPAND;
merge
have
cull(rename=cull_date=date in=culled)
;
by stock date;
if not culled;
run;
if not culled will remove the culled row. I think this is correct for you because you said the window was to increase by 1.
If you want the culled date to be used in rolling windows prior to itself you have a bit of a pickle.

R csv.bz2 Shell Windows counting number of lines

I'm having problems for counting the number of lines in a messy csv.bz2 file.
Since this is a huge file I want to be able to preallocate a data frame before reading the bzip2 file with the read.csv() function.
As you can see in the following tests, my results are widely variable, and none of the correspond with the number of actual rows in the csv.bz2 file.
> system.time(nrec1 <- as.numeric(shell('type "MyFile.csv" | find /c ","', intern=T)))
user system elapsed
0.02 0.00 53.50
> nrec1
[1] 1060906
> system.time(nrec2 <- as.numeric(shell('type "MyFile.csv.bz2" | find /c ","', intern=T)))
user system elapsed
0.00 0.02 10.15
> nrec2
[1] 126715
> system.time(nrec3 <- as.numeric(shell('type "MyFile.csv" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.02 53.10
> nrec3
[1] 1232705
> system.time(nrec4 <- as.numeric(shell('type "MyFile.csv.bz2" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.01 4.96
> nrec4
[1] 533062
The most interesting result is the one I called nrec4 since it takes no time, and it returns roughly half the number of rows of nrec1, but I'm totally unsure if the naive multiplication by 2 will be ok.
I have tried several other methods including fread() and hsTableReader() but the former crashes and the later is so slow that I won't even consider it further.
My questions are:
Which reliable method can I use for counting the number of rows in a csv.bz2 file?
It's ok to use a formula for calculating the number of rows directly in a csv.bz2 file without decompressing it?
Thanks in advance,
Diego
Roland was right from the beginning.
When using the garbage collector, the illusion of improved performance still remained.
I had to close and re-start R for doing an accurate test.
Yes, the process is still a little bit faster by few seconds (red line), and the increase in RAM consumption is more uniform when using nrows.
But at least in this case is not worthy the effort of trying to find an optimization for the read.csv() function.
It is slow but it is what it is.
If someone know about a faster approach I'm interested.
fread() crashes just in case.
Thanks.
Without nrows (Blue Line)
Sys.time()
system.time(storm.data <- read.csv(fileZip,
header = TRUE,
stringsAsFactors = F,
comment.char = "",
colClasses = "character"))
Sys.time()
rm(storm.data)
gc()
With nrows (Red Line)
Sys.time()
system.time(nrec12 <- as.numeric(
shell('type "MyFile.csv.bz2" | find /v /c ""',
intern=T)))
nrec12 <- nrec12 * 2
system.time(storm.data <- read.csv(fileZip,
stringsAsFactors = F,
comment.char = "",
colClasses = "character",
nrows = nrec12))
Sys.time()
rm(storm.data)
gc()

Rolling list over unequal times in XTS

I have stock data at the tick level and would like to create a rolling list of all ticks for the previous 10 seconds. The code below works, but takes a very long time for large amounts of data. I'd like to vectorize this process or otherwise make it faster, but I'm not coming up with anything. Any suggestions or nudges in the right direction would be appreciated.
library(quantmod)
set.seed(150)
# Create five minutes of xts example data at .1 second intervals
mins <- 5
ticks <- mins * 60 * 10 + 1
times <- xts(runif(seq_len(ticks),1,100), order.by=seq(as.POSIXct("1973-03-17 09:00:00"),
as.POSIXct("1973-03-17 09:05:00"), length = ticks))
# Randomly remove some ticks to create unequal intervals
times <- times[runif(seq_along(times))>.3]
# Number of seconds to look back
lookback <- 10
dist.list <- list(rep(NA, nrow(times)))
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
> user system elapsed
6.12 0.00 5.85
You should check out the window function, it will make your subselection of dates a lot easier. The following code uses lapply to do the work of the for loop.
# Your code
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
# user system elapsed
# 10.09 0.00 10.11
# My code
system.time(dist.list<-lapply(index(times),
function(x) window(times,start=x-lookback-1,end=x))
)
# user system elapsed
# 3.02 0.00 3.03
So, about a third faster.
But, if you really want to speed things up, and you are willing to forgo millisecond accuracy (which I think your original method implicitly does), you could just run the loop on unique date-hour-second combinations, because they will all return the same time window. This should speed things up roughly twenty or thirty times:
dat.time=unique(as.POSIXct(as.character(index(times)))) # Cheesy method to drop the ms.
system.time(dist.list.2<-lapply(dat.time,function(x) window(times,start=x-lookback-1,end=x)))
# user system elapsed
# 0.37 0.00 0.39

Resources