I have stock data at the tick level and would like to create a rolling list of all ticks for the previous 10 seconds. The code below works, but takes a very long time for large amounts of data. I'd like to vectorize this process or otherwise make it faster, but I'm not coming up with anything. Any suggestions or nudges in the right direction would be appreciated.
library(quantmod)
set.seed(150)
# Create five minutes of xts example data at .1 second intervals
mins <- 5
ticks <- mins * 60 * 10 + 1
times <- xts(runif(seq_len(ticks),1,100), order.by=seq(as.POSIXct("1973-03-17 09:00:00"),
as.POSIXct("1973-03-17 09:05:00"), length = ticks))
# Randomly remove some ticks to create unequal intervals
times <- times[runif(seq_along(times))>.3]
# Number of seconds to look back
lookback <- 10
dist.list <- list(rep(NA, nrow(times)))
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
> user system elapsed
6.12 0.00 5.85
You should check out the window function, it will make your subselection of dates a lot easier. The following code uses lapply to do the work of the for loop.
# Your code
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
# user system elapsed
# 10.09 0.00 10.11
# My code
system.time(dist.list<-lapply(index(times),
function(x) window(times,start=x-lookback-1,end=x))
)
# user system elapsed
# 3.02 0.00 3.03
So, about a third faster.
But, if you really want to speed things up, and you are willing to forgo millisecond accuracy (which I think your original method implicitly does), you could just run the loop on unique date-hour-second combinations, because they will all return the same time window. This should speed things up roughly twenty or thirty times:
dat.time=unique(as.POSIXct(as.character(index(times)))) # Cheesy method to drop the ms.
system.time(dist.list.2<-lapply(dat.time,function(x) window(times,start=x-lookback-1,end=x)))
# user system elapsed
# 0.37 0.00 0.39
Related
Using Micropython for the ESP32 microcontroller, flashed with the latest firmware at time of writing (v1.18)
I'm making an alarm (sort-of) system where I get multiple time values ("13:15" for example) from my website, and then I have to ring an alarm bell at those times.
I've done the website and I can do the ring stuff, but I don't know how to actually create time objects from the previously mentioned strings ("13:15"), and then check if any of the times inputted match the current time, the date is irrelevant.
From reading the documentation, im getting the sense that this cant be done, since ive looked through the micropython module github, and you apparently cant get datetime in micropython, and i know that in regular python my problem can be solved with datetime.
import ntptime
import time
import network
# Set esp as a wifi station
station = network.WLAN(network.STA_IF)
# Activate wifi station
station.active(True)
# Connect to wifi ap
station.connect(ssid,passwd)
while station.isconnected() == False:
print('.')
time.sleep(1)
print(station.ifconfig())
try:
print("Local time before synchronization: %s" %str(time.localtime()))
ntptime.settime()
print("Local time after synchronization: %s" %str(time.localtime()))
except:
print("Error syncing time, exiting...")
this is the shortened code from my project, with only the time parts, now comes into play the time comparison thing I don't know how to do.
Using ntptime to get time from server. I use "time.google.com", to get the time. Then, I transform it into seconds (st) to be more accurate. And set my targets hour in seconds 1 hour = 3600 s.
import utime
import ntptime
def server_time():
try:
# Ask to time.google.com server the current time.
ntptime.host = "time.google.com"
ntptime.settime()
t = time.localtime()
# print(t)
# transform tuple time 't' to seconds value. 1 hour =
st = t[3]*3600 + t[4]*60 + t[5]
return st
except:
# print('no time')
st = -1
return st
while True:
# Returns an increasing millisecond counter since the Board reset.
now = utime.ticks_ms()
# Check current time every 5000 ms (5s) without going to sleep or stop any other process.
if now >= period + 5000:
period += 5000
# call your servertime function
st = server_time()
if ((st > 0) and (st < 39600)) or (st > 82800): # Turn On 17:00 Mexico Time.
# something will be On between 17:00 - 06:00
elif ((st <82800) and (st > 39600)): # Turn Off 6:00.
# something will be Off between 06:00 - 17:00
else:
pass
After running ntptime.settime() you can do the following to retrieve the time, keep in mind this is in UTC:
rtc = machine.RTC()
hour = rtc.datetime()[4] if (rtc.datetime()[4]) > 9 else "0%s" % rtc.datetime()[4]
minute = rtc.datetime()[5] if rtc.datetime()[5] > 9 else "0%s" % rtc.datetime()[5]
The if else statement makes sure that numbers lower or equal to 9 are padded with a zero.
There is the following task: I need to get minutes between one time and another one: for example, between "8:15" and "7:45". I have the following code:
(Time.parse("8:15") - Time.parse("7:45")).minute
But I get result as "108000.0 seconds".
How can I fix it?
The result you get back is a float of the number of seconds not a Time object. So to get the number of minutes and seconds between the two times:
require 'time'
t1 = Time.parse("8:15")
t2 = Time.parse("7:45")
total_seconds = (t1 - t2) # => 1800.0
minutes = (total_seconds / 60).floor # => 30
seconds = total_seconds.to_i % 60 # => 0
puts "difference is #{minutes} minute(s) and #{seconds} second(s)"
Using floor and modulus (%) allows you to split up the minutes and seconds so it's more human readable, rather than having '6.57 minutes'
You can avoid weird time parsing gotchas (Daylight Saving, running the code around midnight) by simply doing some math on the hours and minutes instead of parsing them into Time objects. Something along these lines (I'd verify the math with tests):
one = "8:15"
two = "7:45"
h1, m1 = one.split(":").map(&:to_i)
h2, m2 = two.split(":").map(&:to_i)
puts (h1 - h2) * 60 + m1 - m2
If you do want to take Daylight Saving into account (e.g. you sometimes want an extra hour added or subtracted depending on today's date) then you will need to involve Time, of course.
Time subtraction returns the value in seconds. So divide by 60 to get the answer in minutes:
=> (Time.parse("8:15") - Time.parse("7:45")) / 60
#> 30.0
I'm having problems for counting the number of lines in a messy csv.bz2 file.
Since this is a huge file I want to be able to preallocate a data frame before reading the bzip2 file with the read.csv() function.
As you can see in the following tests, my results are widely variable, and none of the correspond with the number of actual rows in the csv.bz2 file.
> system.time(nrec1 <- as.numeric(shell('type "MyFile.csv" | find /c ","', intern=T)))
user system elapsed
0.02 0.00 53.50
> nrec1
[1] 1060906
> system.time(nrec2 <- as.numeric(shell('type "MyFile.csv.bz2" | find /c ","', intern=T)))
user system elapsed
0.00 0.02 10.15
> nrec2
[1] 126715
> system.time(nrec3 <- as.numeric(shell('type "MyFile.csv" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.02 53.10
> nrec3
[1] 1232705
> system.time(nrec4 <- as.numeric(shell('type "MyFile.csv.bz2" | find /v /c ""', intern=T)))
user system elapsed
0.00 0.01 4.96
> nrec4
[1] 533062
The most interesting result is the one I called nrec4 since it takes no time, and it returns roughly half the number of rows of nrec1, but I'm totally unsure if the naive multiplication by 2 will be ok.
I have tried several other methods including fread() and hsTableReader() but the former crashes and the later is so slow that I won't even consider it further.
My questions are:
Which reliable method can I use for counting the number of rows in a csv.bz2 file?
It's ok to use a formula for calculating the number of rows directly in a csv.bz2 file without decompressing it?
Thanks in advance,
Diego
Roland was right from the beginning.
When using the garbage collector, the illusion of improved performance still remained.
I had to close and re-start R for doing an accurate test.
Yes, the process is still a little bit faster by few seconds (red line), and the increase in RAM consumption is more uniform when using nrows.
But at least in this case is not worthy the effort of trying to find an optimization for the read.csv() function.
It is slow but it is what it is.
If someone know about a faster approach I'm interested.
fread() crashes just in case.
Thanks.
Without nrows (Blue Line)
Sys.time()
system.time(storm.data <- read.csv(fileZip,
header = TRUE,
stringsAsFactors = F,
comment.char = "",
colClasses = "character"))
Sys.time()
rm(storm.data)
gc()
With nrows (Red Line)
Sys.time()
system.time(nrec12 <- as.numeric(
shell('type "MyFile.csv.bz2" | find /v /c ""',
intern=T)))
nrec12 <- nrec12 * 2
system.time(storm.data <- read.csv(fileZip,
stringsAsFactors = F,
comment.char = "",
colClasses = "character",
nrows = nrec12))
Sys.time()
rm(storm.data)
gc()
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions.
df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc)))
I'm quite new to R. I have looked at by(), aggregate() and tapply(), but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this?
Edit: Here is a speed comparison of the answers applied to my data.
> # My original solution
> system.time( ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) )
user system elapsed
14.405 0.000 14.479
> # Paul Hiemstra
> system.time( ddply(df, "id", transform, perc.total = sum(cand.perc)) )
user system elapsed
15.973 0.000 15.992
> # Richie Cotton
> system.time( with(df, tapply(df$cand.perc, df$id, sum))[df$id] )
user system elapsed
0.048 0.000 0.048
> # John
> system.time( with(df, ave(cand.perc, id, FUN = sum)) )
user system elapsed
0.032 0.000 0.030
> # Christoph_J
> system.time( df[ , list(perc.total = sum(cand.perc)), by="id"][df])
user system elapsed
0.028 0.000 0.028
Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:
library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
cand.perc = 1:9,
key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
ID Perc.total cand.perc
[1,] 1 6 1
[2,] 1 6 2
[3,] 1 6 3
[4,] 2 15 4
[5,] 2 15 5
[6,] 2 15 6
[7,] 3 24 7
[8,] 3 24 8
[9,] 3 24 9
Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/
For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.
df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
Use tapply to get the group stats, then add them back into your dataset afterwards.
Reproducible example:
means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]
Untested solution for your scenario:
sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]
ilprincipe if none of the above fits your needs you could try transposing your data
dft=t(df)
then use aggregate
dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)
next have back your rownames
rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]
Transpose back to original orientation
df2=t(dfta)
and bind to original data
newdf=cbind(df,df2)
Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:
ddply(df, "id", transform, perc.total = sum(cand.perc))
getting rid of the superfluous cbind should speed things up.
You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.
If I have #time = Time.now.strftime("%Y-%m-%d %H:%M:%S"),
How can I reduce this time by 15 minutes ?
I already tried this one :: #reducetime = #time-15.minutes, works fine at console but give errors while execution. Other than this Is there any way to resolve this issue.
Thanks
Your problem is that you're formatting your time into a string before you're done treating it as a time. This would make more sense:
#time = Time.now
#reducetime = #time - 15.minutes
# And then later when you're reading to display #time...
formatted_time = #time.strftime("%Y-%m-%d %H:%M:%S")
You shouldn't format your data until right before you're ready to display it.
If you must have #time as the formatted time then you're going to have to parse it before computing #reducetime:
#reducetime = (DateTime.strptime(#time, "%Y-%m-%d %H:%M:%S") - 15.minutes).to_time