https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/internal/scraper/processscraper/documentation.md
I have been using this library which gives me 3 values for a single process :
user time, system time & wait time
One example value is : 0.05, 0.01, 0.00
How can I calculate CPU percent of the particular process ?
To calculate the total CPU load/utilization percent of the system, we need to calculate "total system cpu time (during the period)" + "total user cpu time (during the period)" / "period"
In your case, suppose you take sample every 2 seconds, then for every sample, you need to calculate:
= ( (process.cpu.time.sys - previous_process.cpu.time.sys) + (process.cpu.time.user - previous_process.cpu.time.user) ) / 2
In fact, I'd like to use regex to extract the seconds of Total Execution time: in shell. And someone could help me out of this?
Here is the target string:
Val Loss: 20.032490309035197
Val Accuracy: 0.13
SystemML Statistics:
Total elapsed time: 80.698 sec.
Total compilation time: 1.325 sec.
Total execution time: 79.373 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 141449/0/0/2.
Cache writes (WB, FS, HDFS): 22097/0/0.
Cache times (ACQr/m, RLS, EXP): 0.151/0.024/0.285/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/1802.
HOP DAGs recompile time: 1.649 sec.
Functions recompiled: 1.
Functions recompile time: 0.006 sec.
Paramserv func number of workers: 1.
Paramserv func total gradients compute time: 38.000 secs.
Paramserv func total aggregation time: 29.604 secs.
Paramserv func model broadcasting time: 0.008 secs.
Paramserv func total batch slicing time: 0.000 secs.
Total JIT compile time: 20.714 sec.
Total JVM GC count: 228.
Total JVM GC time: 3.195 sec.
You may use this awk command:
awk '/Total execution time:/{print $(NF-1)}' file
79.373
I had a weird experience with Matlab. I wanted to compute the time needed for matrices multiplication as follows:
tic;
for i=1:10^-3:10^4
[1 1 1]*[1 1 1;1 1 1;1 1 1];
end
toc;
and
tic;
for i=1:10^-3:10^4
[1 1 1;1 1 1;1 1 1]*[1 1 1;1 1 1;1 1 1];
end
toc;
now the result for first one was
Elapsed time is 7.707570 seconds.
so I expected the second one to be as 23 seconds (since the first one needs n^2=9 multiplications while the second one needs n^3=27) but the result was:
Elapsed time is 10.558797 seconds.
can someone explain to me what happened here?
Thanks
I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
I've tried:
df <- do.call("rbind", list)
df <- ldply(list)
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6
loaded via a namespace (and not attached):
[1] tools_2.14.1
>
Given that you are looking for performance, it appears that a data.table solution should be suggested.
There is a function rbindlist which is the same but much faster than do.call(rbind, list)
library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
## user system elapsed
## 0.00 0.01 0.02
It is also very fast for a list of data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.frame <- rbindlist(Xdf))
## user system elapsed
## 0.03 0.00 0.03
For comparison
system.time(docall <- do.call(rbind, Xdf))
## user system elapsed
## 50.72 9.89 60.88
And some proper benchmarking
library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X),
rbindlist.data.frame = rbindlist(Xdf),
docall = do.call(rbind, Xdf),
replications = 5)
## test replications elapsed relative user.self sys.self
## 3 docall 5 276.61 3073.444445 264.08 11.4
## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0
## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
and against #JoshuaUlrich's solutions
benchmark(use.rbl.dt = rbl.dt(X),
use.rbl.ju = rbl.ju (Xdf),
use.rbindlist =rbindlist(X) ,
replications = 5)
## test replications elapsed relative user.self
## 3 use.rbindlist 5 0.10 1.0 0.09
## 1 use.rbl.dt 5 0.10 1.0 0.09
## 2 use.rbl.ju 5 0.33 3.3 0.31
I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame
rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.
# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]]) # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names # Give Xb the correct names.
Xb.df <- as.data.frame(Xb) # Convert Xb to a data.frame.
})
# user system elapsed
# 3.356 0.024 3.388
system.time(X1 <- do.call(rbind, X))
# user system elapsed
# 169.627 6.680 179.675
identical(X1,Xb.df)
# [1] TRUE
Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)
# My "rbind list" function
rbl.ju <- function(x) {
u <- unlist(x, recursive=FALSE)
n <- names(u)
un <- unique(n)
l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
names(l) <- un
d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
as.data.frame(rbindlist(x))
}
library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
system.time(dt <- rbl.dt(X)) # rbindlist only exists in recent versions
}
# user system elapsed
# 0.02 0.00 0.02
system.time(ju <- rbl.ju(X))
# user system elapsed
# 0.05 0.00 0.05
identical(dt,ju)
# [1] TRUE
Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.
This simple experiment seems to confirm that that's a very fruitful path to take:
## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
## First, rbind together all 50,000 data.frames in a single step
system.time({
X1 <- do.call(rbind, X)
})
# user system elapsed
# 137.08 57.98 200.08
## Doing it in two stages cuts the processing time by >95%
## - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
## - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
X3 <- do.call(rbind, X2)
})
# user system elapsed
# 6.14 0.05 6.21
## Checking that the results are the same
identical(X1, X3)
# [1] TRUE
You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.
However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?
do.call.rbind <- function (lst) {
while (length(lst) > 1) {
idxlst <- seq(from = 1, to = length(lst), by = 2)
lst <- lapply(idxlst, function(i) {
if (i == length(lst)) {
return(lst[[i]])
}
return(rbind(lst[[i]], lst[[i + 1]]))
})
}
lst[[1]]
}
I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]
The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.
If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.
> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1 a 100 327.728 1.755965 248.620 79.099 0 0
2 b 100 186.637 1.000000 181.874 4.751 0 0
The relative speed up will be exponentially better as you increase the length of the list.
How would I, for example, find out that 6pm is 50% between 4pm and 8pm?
Or that 12am Wednesday is 50% between 12pm Tuesday and 12pm Wednesday?
Convert the times to seconds, calculate the span in seconds, calculate the difference between your desired time and the first time in seconds, calculate the fraction of the whole span, and then multiply by 100%?
Example:
12 AM = 0 seconds (of day)
12 PM = 43200 seconds (of day)
Your desired time = 3 AM = 10800 seconds of day
Total time span = 43200 - 0 = 43200 seconds
Time difference of your desired time from first time = 10800 - 0 = 10800 seconds
Fraction = 10800 / 43200 = 0.25
Percentage = 0.25 * 100% = 25%
(Sorry don't know Ruby but there's the idea.)
require 'date'
start = Date.new(2008, 4, 10)
middle = Date.new(2009, 12, 12)
enddate = Date.new(2009, 4, 10)
duration = start - enddate #Duration of the whole time
desired = middle - start #Difference between desired + Start
fraction = desired / duration
percentage = fraction * 100
puts percentage.to_i
Thanks to 'John W' for the math.