How to insert new rows into a dataframe under a variety of conditions? - for-loop

I have data of around 60,000 events and 1,900 IDs. So this is just an example how my data is structured:
ID <- c(1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2)
behaviour <- c("stand", "eat", "run", "lie", "lick", "stand", "lick", "eat", "stand", "eat", "rum", "lick", "eat", "lie", "rum")
timediff <- c(NA, -100, 7, 45, 120, 85, 3, 15, NA, 4, 39, 173, 5, 50, 13)
dat <- data.frame(ID, behaviour, timediff)
The dataframe dat looks like this:
ID
behaviour
timediff
1.1
stand
NA
1.1
eat
-100
1.1
run
7
1.1
lie
45
1.1
lick
120
1.1
stand
85
1.1
lick
3
1.1
eat
15
1.2
stand
NA
1.2
eat
4
1.2
rum
39
1.2
lick
173
1.2
eat
5
1.2
lie
50
1.2
rum
13
What I am looking for is how to add a new row based on following conditions:
Run the function for adding a new row only on the ID level.
Add a new row when timediff >= 5. Then the new row should include the same ID as the row above & behaviour should always be set on "still" & timediff should be a NA.
Add a new row as long as there is no behaviour == "lie". Do not apply the row-adding-function starting from the first behaviour following behaviour == "lie" until first behaviour following behaviour == "stand".
Add the new row above the current row.
What my data should like after adding the new rows:
ID
behaviour
timediff
1.1
stand
NA
1.1
eat
-100
1.1
still
NA
1.1
run
7
1.1
still
NA
1.1
lie
45
1.1
lick
120
1.1
stand
85
1.1
lick
3
1.1
still
NA
1.1
eat
15
1.2
stand
NA
1.2
eat
4
1.2
still
NA
1.2
rum
39
1.2
still
NA
1.2
lick
173
1.2
still
NA
1.2
eat
5
1.2
still
NA
1.2
lie
50
1.2
rum
13
I tried to handle the problem by using a self-designed function for inserting a new row, a unfinished for-loop and included if conditons but was not able to find an appropriate approach. So I appreciate every solution taking me out of this problem.
insert_row <- function(data, new_row, r)
data[seq(r - 1, nrow(data) + 1),] <- data[seq(r, nrow(data)),]
data()[r,] <- new_row
data
}
newrow <- data.frame(matrix(data = NA, nrow = 1, ncol = 3)
mylist_3 <- list()
for(i in 1:(length(unique(dat$ID))) {
a1 <- filter(dat, ID == unique(dat$ID) [i])
if(a1$behaviour != "lie") {
insert_row(dat, newrow, i)
}else if(a1$behaviour == "lie" {
#from here on, I do not know how to get my conditons to the heart of the matter...
}

Related

Mutate new column from random value in existing columns

I'm looking to mutate my data and create a new column which randomly selects a value from the existing data. My data looks something like:
individual
age_2010
age_2011
age_2012
age_2013
a
20
21
NA
21
b
33
34
35
36
c
76
NA
78
79
d
46
46
48
49
And I want it to look like:
individual
age_2010
age_2011
age_2012
age_2013
Random Sample
a
20
21
22
NA
21
b
33
34
35
36
36
c
76
NA
78
79
78
d
46
46
48
49
48
Is there any way to add a new column which includes a random figure from any of the previous age columns, and preferably keeping the data in wide form?
I think this is an easier approach:
d[, RandomSample:=sample(na.omit(t(.SD)),1),individual]
If dealing with the edge cases discussed above is desired, and one wanted to follow this approach, we could do this:
f <- function(df) {
s = na.omit(t(df))
ifelse(length(s)>0, sample(s,1),NA_real_)
}
d[, RandomSample:=f(.SD),individual]
Or,
we could just wrap the original approach in tryCatch
d[, RandomSample:=tryCatch(sample(na.omit(t(.SD)),1),error=\(e) NA),individual]
You can reshape longer, then do grouped sampling:
library(data.table)
# Sample data
d <- structure(list(individual = c("a", "b", "c", "d"), age_2010 = c(20, 33, 76, 46), age_2011 = c(21, 34, NA, 46), age_2012 = c(NA, 35, 78, 48), age_2013 = c(21, 36, 79, 49)), row.names = c(NA, -4L), spec = structure(list(cols = list(individual = structure(list(), class = c("collector_character", "collector")), age_2010 = structure(list(), class = c("collector_double", "collector")), age_2011 = structure(list(), class = c("collector_double", "collector")), age_2012 = structure(list(), class = c("collector_double", "collector")), age_2013 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2L), class = "col_spec"), class = c("data.table", "data.frame"))
d
#> individual age_2010 age_2011 age_2012 age_2013
#> 1: a 20 21 NA 21
#> 2: b 33 34 35 36
#> 3: c 76 NA 78 79
#> 4: d 46 46 48 49
# Solution
d[, "Random Sample"] <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(x = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[[`)(2) # extract vector from frame
d
#> individual age_2010 age_2011 age_2012 age_2013 Random Sample
#> 1: a 20 21 NA 21 21
#> 2: b 33 34 35 36 33
#> 3: c 76 NA 78 79 76
#> 4: d 46 46 48 49 49
Alternatively, you can also use apply(), which is less verbose but much slower:
d[, "Random Sample"] <- apply(d[, -1], 1, \(x) x |> na.omit() |> sample(1))
See the benchmark here for speed comparison. On just 40k observations, apply() needs 59 times longer and 8 times the memory.
# Make large sample data set
d_large <- d |>
list() |>
rep(1e4) |>
rbindlist()
bench::mark(
base = apply(d_large[, -1], 1, \(x) x |> na.omit() |> sample(1)),
dt = d_large |>
melt("individual") |>
(`[`)(!is.na(value),
.(x = sample(value, 1)),
keyby = .(individual)) |>
(`[[`)(2),
check = F
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 617.86ms 617.9ms 1.62 103.3MB 12.9
#> 2 dt 6.96ms 10.5ms 80.9 13.1MB 47.3
Created on 2022-07-27 by the reprex package (v2.0.1)
Edit:
Here are versions that work with the edge case where all years are NA. In the first case I went for a join with the original table, which is a bit more expensive than the other version
# Solution with Data Table
d <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(`Random Sample` = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[`)(d) # right join with original frame
Here I simply used purrr::possibly() to return NA when sampling a zero length vector.
# Solution with apply
d[, "Random Sample"] <- apply(d[, -1], 1,
\(x) x |> na.omit() |> purrr::possibly(sample, NA)(1))

AMPL: Syntax for sets?

I'm spinning up on high level language for mixed integer linear programs (MILPs). The language is A Modeling Language for A Mathematical Programming Language (AMPL).
Chapter 4, page 65, Figure 4-7 shows the following syntax:
set PROD := bands coils plate ;
However, Chapter 5, page 74, shows the following syntax:
set PROD = {"bands", "coils", "plate"};
Can anyone please explain this difference in syntax?
I put the latter into a *.dat file, and AMPL complains expected ; ( : or symbol where the { is. Wondering if it is just a mistake in the manual.
Thanks.
The syntax in Chapter 4 --
set PROD := bands coils plate;
-- is used in data files, while the syntax in Chapter 5 --
set PROD = {"bands", "coils", "plate"};
-- is used in model files. It's a little weird (IMO) that the syntax for sets is different in model and data files, but it is. For another example of this difference, see this question and answer.
Complete working example code modified from AMPL manual
Added by the original poster of the question.
dietu.mod:
# dietu.mod
#----------
# set MINREQ; # nutrients with minimum requirements
# set MAXREQ; # nutrients with maximum requirements
set MINREQ = {"A", "B1", "B2", "C", "CAL"};
set MAXREQ = {"A", "NA", "CAL"};
set NUTR = MINREQ union MAXREQ; # nutrients
set FOOD; # foods
param cost {FOOD} > 0;
param f_min {FOOD} >= 0;
param f_max {j in FOOD} >= f_min[j];
param n_min {MINREQ} >= 0;
param n_max {MAXREQ} >= 0;
param amt {NUTR,FOOD} >= 0;
var Buy {j in FOOD} >= f_min[j], <= f_max[j];
minimize Total_Cost: sum {j in FOOD} cost[j] * Buy[j];
subject to Diet_Min {i in MINREQ}:
sum {j in FOOD} amt[i,j] * Buy[j] >= n_min[i];
subject to Diet_Max {i in MAXREQ}:
sum {j in FOOD} amt[i,j] * Buy[j] <= n_max[i];
The explicit definitions of setes MINREQ and MAXREQ and their members is taken from the *.dat file below (where their definitions have been commented out). Matlab users, observe above & beware that you need commas between members in a set.
dietu.dat:
# dietu.dat
#----------
data;
# set MINREQ := A B1 B2 C CAL ;
# set MAXREQ := A NA CAL ;
set FOOD := BEEF CHK FISH HAM MCH MTL SPG TUR ;
param: cost f_min f_max :=
BEEF 3.19 2 10
CHK 2.59 2 10
FISH 2.29 2 10
HAM 2.89 2 10
MCH 1.89 2 10
MTL 1.99 2 10
SPG 1.99 2 10
TUR 2.49 2 10 ;
param: n_min n_max :=
A 700 20000
C 700 .
B1 0 .
B2 0 .
NA . 50000
CAL 16000 24000 ;
param amt (tr): A C B1 B2 NA CAL :=
BEEF 60 20 10 15 938 295
CHK 8 0 20 20 2180 770
FISH 8 10 15 10 945 440
HAM 40 40 35 10 278 430
MCH 15 35 15 15 1182 315
MTL 70 30 15 15 896 400
SPG 25 50 25 15 1329 370
TUR 60 20 15 10 1397 450 ;
Solve the model using the following at the AMPL prompt:
reset data;
reset;
model dietu.mod;
data dietu.dat;
solve;

How to add multiple columns in Apache Spark

Here is my input data with four columns with space as the delimiter. I want to add the second and third column and print the result
sachin 200 10 2
sachin 900 20 2
sachin 500 30 3
Raju 400 40 4
Mike 100 50 5
Raju 50 60 6
My code is in the mid way
from pyspark import SparkContext
sc = SparkContext()
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name,cash,cash2)
myFile = sc.textFile("D:\PYSK\cash.txt")
rdd = myFile.map(getLineInfo)
print rdd.collect()
From here I got the result as
[('sachin', 200, 10), ('sachin', 900, 20), ('sachin', 500, 30), ('Raju', 400, 40
), ('Mike', 100, 50), ('Raju', 50, 60)]
Now the final result I need is as below, adding the 2nd and 3rd column and display the remaining fields
sachin 210 2
sachin 920 2
sachin 530 3
Raju 440 4
Mike 150 5
Raju 110 6
Use this:
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name, cash + cash2, cash3)

How to speed up Pandas multilevel dataframe shift by group?

I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))
How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095
similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.
try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0

Data Frame Subset Performance

I have a couple of large data frames (1 million+ rows x 6-10 columns) I need to subset repeatedly. The subsetting section is the slowest part of my code and I curious if there is way to do this faster.
load("https://dl.dropbox.com/u/4131944/Temp/DF_IOSTAT_ALL.rda")
start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M")
end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M")
system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
> system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in,])
user system elapsed
16.59 0.00 16.60
dput(head(DF_IOSTAT_ALL))
structure(list(date_stamp = structure(list(sec = c(14, 24, 34,
44, 54, 4), min = c(0L, 0L, 0L, 0L, 0L, 1L), hour = c(0L, 0L,
0L, 0L, 0L, 0L), mday = c(20L, 20L, 20L, 20L, 20L, 20L), mon = c(7L,
7L, 7L, 7L, 7L, 7L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(1L, 1L, 1L, 1L, 1L, 1L), yday = c(232L, 232L, 232L,
232L, 232L, 232L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt")), cpu = c(0.9, 0.2, 0.2, 0.1,
0.2, 0.1), rsec_s = c(0, 0, 0, 0, 0, 0), wsec_s = c(0, 3.8, 0,
0.4, 0.2, 0.2), util_pct = c(0, 0.1, 0, 0, 0, 0), node = c("bda101",
"bda101", "bda101", "bda101", "bda101", "bda101")), .Names = c("date_stamp",
"cpu", "rsec_s", "wsec_s", "util_pct", "node"), row.names = c(NA,
6L), class = "data.frame")
I would use xts for this. The only potential hiccup is that xts is a matrix with an ordered index attribute, so you can't mix types like you can in a data.frame.
If the node column is invariant, you can just exclude it from your xts object:
library(xts)
x <- xts(DF_IOSTAT_ALL[,2:5], as.POSIXct(DF_IOSTAT_ALL$date_stamp))
x["2012-08-20 00:00:24/2012-08-20 00:00:54"]
Update using the OP's actual data:
Data <- DF_IOSTAT_ALL
# change node from character to numeric,
# so it can exist in the xts object too.
Data$node <- as.numeric(gsub("^bda","",Data$node)
# create the xts object
x <- xts(Data[,-1], as.POSIXct(Data$date_stamp))
# subset one day
system.time(x['2012-08-20 13:00/2012-08-20 17:00'])
# user system elapsed
# 0 0 0
# subset 13:00-17:00 for all days
system.time(x['T13:00/T17:00'])
# user system elapsed
# 2.64 0.00 2.66
Here are my experiments with data.table. Interestingly, just the conversion to data.table will make your lookups faster, possibly through more efficient lookup to the logical vectors. I compared four things: the original data frame lookup; a lookup with conversion from POSIXlt to POSIXct (thanks to Matthew Dowle); the data table lookup; and the data table lookup in addition to the setup of copy and conversion. Even with the additional setup, the data table lookup wins. With multiple lookups, you'll get even more savings in time.
library(data.table)
library(rbenchmark)
load("DF_IOSTAT_ALL.rda")
DF_IOSTAT_ALL.original <- DF_IOSTAT_ALL
start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M")
end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M")
#function to test: original
fun <- function() DF_IOSTAT_INT <<- DF_IOSTAT_ALL.original[DF_IOSTAT_ALL.original$date_stamp >= start_in & DF_IOSTAT_ALL.original$date_stamp <= end_in,]
#function to test: changing to POSIXct
DF_IOSTAT_ALL.ct <- within(DF_IOSTAT_ALL.original,date_stamp <- as.POSIXct(date_stamp))
fun.ct <- function() DF_IOSTAT_INT <<- DF_IOSTAT_ALL.ct[with(DF_IOSTAT_ALL.ct,date_stamp >= start_in & date_stamp <= end_in),]
#function to test: with data.table and POSIXct
DF_IOSTAT_ALL.dt <- as.data.table(DF_IOSTAT_ALL.ct);
fun.dt <- function() DF_IOSTAT_INT <<- DF_IOSTAT_ALL.dt[date_stamp >= start_in & date_stamp <= end_in,]
#function to test: with data table and POSIXct, with setup steps
newfun <- function() {
DF_IOSTAT_ALL <- DF_IOSTAT_ALL.original;
#data.table doesn't play well with POSIXlt, so convert to POSIXct
DF_IOSTAT_ALL$date_stamp <- as.POSIXct(DF_IOSTAT_ALL$date_stamp);
DF_IOSTAT_ALL <- data.table(DF_IOSTAT_ALL);
DF_IOSTAT_INT <<- DF_IOSTAT_ALL[date_stamp >= start_in & date_stamp <= end_in,];
}
benchmark(fun(), fun.ct(), fun.dt(), newfun(), replications=3,order="relative")
# test replications elapsed relative user.self sys.self user.child sys.child
#3 fun.dt() 3 0.18 1.000000 0.11 0.08 NA NA
#2 fun.ct() 3 0.52 2.888889 0.44 0.08 NA NA
#4 newfun() 3 35.49 197.166667 34.88 0.58 NA NA
#1 fun() 3 66.68 370.444444 66.42 0.15 NA NA
If you know what your time intervals are beforehand, you can probably make it even faster by splitting with findInterval or cut and keying/indexing the table.
DF_IOSTAT_ALL <- copy(DF_IOSTAT_ALL.new)
time.breaks <- strptime.d("2012-08-19 19:00:00") + 0:178 * 60 * 60 #by hour
DF_IOSTAT_ALL[,interval := findInterval(date_stamp,time.breaks)]
setkey(DF_IOSTAT_ALL,interval)
start_in <- time.breaks[60]
end_in <- time.breaks[61]
benchmark(a <- DF_IOSTAT_ALL[J(60)],b <- fun2(DF_IOSTAT_ALL))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 DF_IOSTAT_ALL[J(60)] 100 0.78 1.000000 0.64 0.14 NA NA
#2 fun2(DF_IOSTAT_ALL) 100 6.69 8.576923 5.76 0.91 NA NA
all.equal(a,b[,.SD,.SDcols=c(12,1:11,13)]) #test for equality (rearranging columns to match)
#TRUE

Resources