Why doesn't this data.table function modify the argument? [duplicate] - data.table

I'm writing a function that, among other things, coerces the input into a data.table.
df <- data.frame(id = 1:10)
f <- function(df){setDT(df)}
df[, temp := 1]
However, the last command outputs the following warning:
Warning message: In [.data.table(df, , :=(temp, 1)) : Invalid
.internal.selfref detected and fixed by taking a copy of the whole
table so that := can add this new column by reference. At an earlier
point, this data.table has been copied by R (or been created manually
using structure() or similar). Avoid key<-, names<- and attr<- which
in R currently (and oddly) may copy the whole data.table. Use set*
syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also,
in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list()
used to copy named objects); please upgrade to R>v3.0.2 if that is
biting. If this message doesn't help, please report to datatable-help
so the root cause can be fixed.
I'm using v1.9.3 of data.table and R 3.1.1. Does it mean df is copied at some point? How to avoid this warning?
The code of setDT actually uses NSE. So this seems to work:
df1 <- data.frame(id = 1:10)
f <- function(df){eval(substitute(setDT(df)),parent.frame())}
df1[, temp := 1]
It seems I can do other stuffs with df within the function f like
df1 <- data.frame(id = 1:10)
f <- function(df){
df[, temp := 1]
Is this the right way to do it?

Great question! The warning message should say: ... and fixed by taking a shallow copy of the whole table .... Will fix this.
setDT does two things:
set the class to data.table from data.frame/list
use alloc.col to over-allocate columns (so that := can be used directly)
And the 2nd step requires a shallow copy, if the input is not a data.table already. And this is why we assign the value back to the symbol in it's environment (setDT's parent frame). But the parent frame for setDT is your function f(). Therefore the setDT(df) within your function has gone through smoothly, but the df that resides in the global environment will only have it's class changed, not the over-allocation (as the shallow copy severed the link).
And in the next step, := detects that and shallow copies once again to over-allocate.
The idea so far is to use setDT to convert to data.tables before providing it to a function. But I'd like that these cases be resolved (will take a look).
Thanks a bunch!


Input f into play3d() and movie3d() in the rgl package in R

I don't understand the input f expected by play3d and movie3d in the rgl package.
plot3d(df[,1],df[,2],df[,3], type = "n", radius = .2 )
myplotfunction<-function(x) {
rgl.spheres(x=x$x,y=x$y,z=x$z, type="s", r=0.025)
When executing the 2 lines below, the animation does play but both lines (play3d() and movie3d()) trigger the error displayed below:
play3d(f=lapply(listofobs,myplotfunction), fps=1 )
movie3d(f=lapply(listofobs,myplotfunction), fps=1 , duration=20)
I am hoping someone can correct my code and help me understand the f input to play3d and movie3d.
Question 1: Why is the play3d line above correct enough that the animation does display correctly?
Question 2: Why is the play3d line above incorrect enough that it triggers the error?
Question 3: What is wrong with the movie3d line that it does not produce a video output?
As the docs say, f is "A function returning a list that may be passed to par3d". It's not a list, which is what your usage passes.
To answer the questions:
R evaluates the lapply call which does the animation, then play3d looks at the result and dies because it's not a function.
f needs to be a function, as described in the help page.
It dies when it looks at f, because it's not a function.
This looks like it will do what you want:
plot3d(df, type = "n" )
id <- NA
myplotfunction<-function(time) {
index <- round(time)
# For a 3x faster display, use index <- round(3*time)
# To cycle through the points several times, use
# index <- round(3*time) %% nobs + 1
if (!is.na(id))
pop3d(id = id) # Delete previous item
id <<- spheres3d(df[index,], r=0.025)
play3d(myplotfunction, startTime = 1, duration = nobs - 1)
movie3d(myplotfunction, startTime = 1, duration = nobs - 1, fps = 1)
This will leave a GIF in file.path(tempdir(), "movie.gif").
Some other notes:
don't call rgl.spheres. It will cause you immense pain later. Use spheres3d, or never call any *3d function, and never upgrade rgl: you're living in the past using the rgl.* functions. The *3d functions and the rgl.* functions don't play nicely together.
to construct a dataframe, just use the data.frame() function, don't convert
a matrix.
you don't need all those contortions to extract points from the dataframe.
Most rgl functions can handle a dataframe with x, y, and z columns.
You might notice the plot3d frame move a little: spheres are bigger than points, so it will adjust to accommodate them. You could use xlim, ylim and zlim to set the original frame a little bigger if you don't like this.

Is there a way to write single band raster from multiple raster stacks

I have 4 subfolder that contains 5 rasters with continuous values. So a build a loop with "for" function to :
list these raster files
stack these files per folder , i.e 4 rasterstacks objects (that contains 5 rasters)
I apllied a treshold to transform the the continuous raster in binary raster
Finally I wrote the binary raster using wirte.raster function.
My issue is in a step 4. Eventhough I use the argument "byLayer = T" in writeRaster function
the rasters saved were a rasterstack with the 5 binary rasters. And i want write it per raster, per file, per band
I really grateful if anyone give me any insights
sub <- list.dirs(full.names=FALSE, recursive=FALSE)
for(j in 1:length(sub)) {
h <- list.files(path=sub[j], recursive=TRUE, full.names=TRUE, pattern='.tif')
stack_present <- stack(h)
binary_0.2 <- stack_present >=0.2
writeRaster(binary_0.2, filename=paste0(sub[j], bylayer = T, suffix = "_bin.tif"), overwrite=TRUE)
This is wrong because the argument "bylayer" is lost as it becomes part of the filename)
writeRaster(binary_0.2, filename=paste0(sub[j], bylayer = T, suffix = "_bin.tif"), overwrite=TRUE)
It should be something like this (and it helps to do it in two steps)
f <- paste0(sub[j], "bin.tif")
writeRaster(binary_0.2, filename=f, bylayer=TRUE, overwrite=TRUE)
Illustrated here
b <- brick(system.file("external/rlogo.grd", package="raster"))
writeRaster(b, filename="abc.tif", bylayer=T)
#[1] "abc_1.tif" "abc_2.tif" "abc_3.tif"
writeRaster(b, filename="bin.tif", bylayer=T, suffix = paste0("f", 1:3))
#[1] "bin_f1.tif" "bin_f2.tif" "bin_f3.tif"
Alternatively, you can loop over the files within each folder

add columns to data frame using foreach and %dopar%

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
w <- startWorkers()
transform_features <- function()
cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
foreach(thiscol=cols, mydata) %dopar% {
name <- names(mydata)[thiscol]
print(paste('transforming variable ', name))
mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
ncol(mydata) # 4 columns
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.
See When should I use the := operator in data.table?
Might it be easier if you did something like
mydata <- data.frame(
mydata_sqrt <- sqrt(mydata)
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')
mydata <- cbind(mydata, mydata_sqrt)
producing something like
> mydata
a b c d a_sqrt b_sqrt c_sqrt d_sqrt
1 29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2 5.037735 12.282458 3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3 80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4 39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5 11.459075 8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6 26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7 42.533937 7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8 41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9 59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
There are two ways you can handle this:
Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as #Henry suggested.
Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.
Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Listing functions with debug flag set in R

I am trying to find a global counterpart to isdebugged() in R. My scenario is that I have functions that make calls to other functions, all of which I've written, and I am turning debug() on and off for different functions during my debugging. However, I may lose track of which functions are set to be debugged. When I forget and start a loop, I may get a lot more output (nuisance, but not terrible) or I may get no output when some is desired (bad).
My current approach is to use a function similar to the one below, and I can call it with listDebugged(ls()) or list the items in a loaded library (examples below). This could suffice, but it requires that I call it with the list of every function in the workspace or in the packages that are loaded. I can wrap another function that obtains these. It seems like there should be an easier way to just directly "ask" the debug function or to query some obscure part of the environment where it is stashing the list of functions with the debug flag set.
So, a two part question:
Is there a simpler call that exists to query the functions with the debug flag set?
If not, then is there any trickery that I've overlooked? For instance, if a function in one package masks another, I suspect I may return a misleading result.
I realize that there is another method I could try and that is to wrap debug and undebug within functions that also maintain a hidden list of debugged function names. I'm not yet convinced that's a safe thing to do.
UPDATE (8/5/11): I searched SO, and didn't find earlier questions. However, SO's "related questions" list has shown that an earlier question that is similar, though the function in the answer for that question is both more verbose and slower than the function offered by #cbeleites. The older question also doesn't provide any code, while I did. :)
The code:
listDebugged <- function(items){
isFunction <- vector(length = length(items))
isDebugged <- vector(length = length(items))
for(ix in seq_along(items)){
isFunction[ix] <- is.function(eval(parse(text = items[ix])))
for(ix in which(isFunction == 1)){
isDebugged[ix] <- isdebugged(eval(parse(text = items[ix])))
names(isDebugged) <- items
# Example usage
Here's my throw at the listDebugged function:
ls.deb <- function(items = search ()){
.ls.deb <- function (i){
f <- ls (i)
f <- mget (f, as.environment (i), mode = "function",
## return a function that is not debugged
ifnotfound = list (function (x) function () NULL)
if (length (f) == 0)
return (NULL)
f <- f [sapply (f, isdebugged)]
f <- names (f)
## now check whether the debugged function is masked by a not debugged one
masked <- !sapply (f, function (f) isdebugged (get (f)))
## generate pretty output format:
## "package::function" and "(package::function)" for masked debugged functions
if (length (f) > 0) {
if (grepl ('^package:', i)) {
i <- gsub ('^package:', '', i)
f <- paste (i, f, sep = "::")
f [masked] <- paste ("(", f [masked], ")", sep = "")
} else {
functions <- lapply (items, .ls.deb)
unlist (functions)
I chose a different name, as the output format are only the debugged functions (otherwise I easily get thousands of functions)
the output has the form package::function (or rather namespace::function but packages will have namespaces pretty soon anyways).
if the debugged function is masked, output is "(package::function)"
the default is looking throught the whole search path
This is a simple one-liner using lsf.str:
which(sapply(lsf.str(), isdebugged))
You can change environments within the function, see ?lsf.str for more arguments.
Since the original question, I've been looking more and more at Mark Bravington's debug package. If using that package, then check.for.traces() is the appropriate command to list those functions that are being debugged via mtrace.
The debug package is worth a look if one is spending much time with the R debugger and various trace options.
#cbeleites I like your answer, but it didn't work for me. I got this to work but it is less functional than yours above (no recursive checks, no pretty print)
debug.ls <- function(items = search()){
.debug.ls <- function(package){
f <- ls(package)
active <- f[which(aaply(f, 1, function(x){
tryCatch(isdebugged(x), error = function(e){FALSE}, finally=FALSE)
functions <- lapply (items, .debug.ls)
unlist (functions)
I constantly get caught in the browser window frame because of failing to undebug functions. So I have created two functions and added them to my .Rprofile. The helper functions are pretty straight forward.
# Returns a vector of functions on which the debug flag is set
debuggedFuns <- function() {
envs <- search()
debug_vars <- sapply(envs, function(each_env) {
funs <- names(Filter(is.function, sapply(ls(each_env), get, each_env)))
debug_funs <- Filter(isdebugged, funs)
# Removes the debug flag from all the functions returned by `debuggedFuns`
unDebugAll <- function(verbose = TRUE) {
toUnDebug <- debuggedFuns()
if (length(toUnDebug) == 0) {
if (verbose) loginfo('no Functions to `undebug`')
} else {
if (verbose) loginfo('undebugging [%s]', paste0(toUnDebug, collapse = ', '))
for (each_fn in toUnDebug) {
I have tested them out, and it works pretty well. Hope this helps!

Tricks to manage the available memory in an R session

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionally rm() some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.
Any other nice tricks folks want to share? One per post, please.
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.dim)
names(out) <- c("Type", "Size", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
Ensure you record your work in a reproducible script. From time-to-time, reopen R, then source() your script. You'll clean out anything you're no longer using, and as an added benefit will have tested your code.
I use the data.table package. With its := operator you can :
Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference
None of these operations copy the (potentially large) data.table at all, not even once.
Aggregation is also particularly fast because data.table uses much less working memory.
Related links :
News from data.table, London R presentation, 2012
When should I use the := operator in data.table?
Saw this on a twitter post and think it's an awesome function by Dirk! Following on from JD Long's answer, I would do this for user friendly reading:
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
format(utils::object.size(x), units = "auto") })
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
Which results in something like the following:
Type Size PrettySize Length/Rows Columns
pca.res PCA 790128 771.6 Kb 7 NA
DF data.frame 271040 264.7 Kb 669 50
factor.AgeGender factanal 12888 12.6 Kb 12 NA
dates data.frame 9016 8.8 Kb 669 2
sd. numeric 3808 3.7 Kb 51 NA
napply function 2256 2.2 Kb NA NA
lsos function 1944 1.9 Kb NA NA
load loadings 1768 1.7 Kb 12 2
ind.sup integer 448 448 bytes 102 NA
x character 96 96 bytes 1 NA
NOTE: The main part I added was (again, adapted from JD's answer) :
obj.prettysize <- napply(names, function(x) {
print(object.size(x), units = "auto") })
I make aggressive use of the subset parameter with selection of only the required variables when passing dataframes to the data= argument of regression functions. It does result in some errors if I forget to add variables to both the formula and the select= vector, but it still saves a lot of time due to decreased copying of objects and reduces the memory footprint significantly. Say I have 4 million records with 110 variables (and I do.) Example:
# library(rms); library(Hmisc) for the cph,and rcs functions
Mayo.PrCr.rbc.mdl <-
cph(formula = Surv(surv.yr, death) ~ age + Sex + nsmkr + rcs(Mayo, 4) +
rcs(PrCr.rat, 3) + rbc.cat * Sex,
data = subset(set1HLI, gdlab2 & HIVfinal == "Negative",
select = c("surv.yr", "death", "PrCr.rat", "Mayo",
"age", "Sex", "nsmkr", "rbc.cat")
) )
By way of setting context and the strategy: the gdlab2 variable is a logical vector that was constructed for subjects in a dataset that had all normal or almost normal values for a bunch of laboratory tests and HIVfinal was a character vector that summarized preliminary and confirmatory testing for HIV.
I love Dirk's .ls.objects() script but I kept squinting to count characters in the size column. So I did some ugly hacks to make it present with pretty formatting for the size:
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, object.size)
obj.prettysize <- sapply(obj.size, function(r) prettyNum(r, big.mark = ",") )
obj.dim <- t(napply(names, function(x)
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size,obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
out <- out[c("Type", "PrettySize", "Rows", "Columns")]
names(out) <- c("Type", "Size", "Rows", "Columns")
if (head)
out <- head(out, n)
That's a good trick.
One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
This doesn't really address memory management, but one important function that isn't widely known is memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. As Dirk mentioned, you need to be using 64-bit in order to take real advantage of this.
I quite like the improved objects function developed by Dirk. Much of the time though, a more basic output with the object name and size is sufficient for me. Here's a simpler function with a similar objective. Memory use can be ordered alphabetically or by size, can be limited to a certain number of objects, and can be ordered ascending or descending. Also, I often work with data that are 1GB+, so the function changes units accordingly.
showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {
objectList <- ls(parent.frame())
oneKB <- 1024
oneMB <- 1048576
oneGB <- 1073741824
memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))
memListing <- sapply(memoryUse, function(size) {
if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))
else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))
else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))
else return(paste(size, "bytes"))
memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)
if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),]
else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"
if(!missing(limit)) memListing <- memListing[1:limit,]
print(memListing, row.names=FALSE)
And here is some example output:
> showMemoryUse(decreasing=TRUE, limit=5)
objectName memorySize
coherData 713.75 MB
spec.pgram_mine 149.63 kB
stoch.reg 145.88 kB
describeBy 82.5 kB
lmBandpass 68.41 kB
I never save an R workspace. I use import scripts and data scripts and output any especially large data objects that I don't want to recreate often to files. This way I always start with a fresh workspace and don't need to clean out large objects. That is a very nice function though.
Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%.
When you read stuff into R with for example read.csv they require a certain amount of memory.
After this you can save them with save("Destinationfile",list=ls())
The next time you open R you can use load("Destinationfile")
Now the memory usage might have decreased.
It would be nice if anyone could confirm whether this produces similar results with a different dataset.
To further illustrate the common strategy of frequent restarts, we can use littler which allows us to run simple expressions directly from the command-line. Here is an example I sometimes use to time different BLAS for a simple crossprod.
r -e'N<-3*10^3; M<-matrix(rnorm(N*N),ncol=N); print(system.time(crossprod(M)))'
r -lMatrix -e'example(spMatrix)'
loads the Matrix package (via the --packages | -l switch) and runs the examples of the spMatrix function. As r always starts 'fresh', this method is also a good test during package development.
Last but not least r also work great for automated batch mode in scripts using the '#!/usr/bin/r' shebang-header. Rscript is an alternative where littler is unavailable (e.g. on Windows).
For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.
dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
if( !i %% 10000 ) {
print( i, "; flushing to disk..." )
write.table( dfinal, file=tempfile, append=!first, col.names=first )
first <- FALSE
dfinal <- NULL # nuke it
# ... complex operations here that add data to 'dfinal' data frame
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )
Just to note that data.table package's tables() seems to be a pretty good replacement for Dirk's .ls.objects() custom function (detailed in earlier answers), although just for data.frames/tables and not e.g. matrices, arrays, lists.
I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
Calling gc () "by hand" can help if the size of the data get close to available memory.
Sometimes a different algorithm needs much less memory.
Sometimes there's a trade off between vectorization and memory use.
compare: split & lapply vs. a for loop.
For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.
The use of environments instead of lists to handle collections of objects which occupy a significant amount of working memory.
The reason: each time an element of a list structure is modified, the whole list is temporarily duplicated. This becomes an issue if the storage requirement of the list is about half the available working memory, because then data has to be swapped to the slow hard disk. Environments, on the other hand, aren't subject to this behaviour and they can be treated similar to lists.
Here is an example:
get.data <- function(x)
# get some data based on x
return(paste("data from",x))
collect.data <- function(i,x,env)
# get some data
data <- get.data(x[[i]])
# store data into environment
element.name <- paste("V",i,sep="")
env[[element.name]] <- data
better.list <- new.env()
filenames <- c("file1","file2","file3")
# read/write access
better.list[["V2"]] <- "testdata"
# number of list elements
In conjunction with structures such as big.matrix or data.table which allow for altering their content in-place, very efficient memory usage can be achieved.
The llfunction in gData package can show the memory usage of each object as well.
If you really want to avoid the leaks, you should avoid creating any big objects in the global environment.
What I usually do is to have a function that does the job and returns NULL — all data is read and manipulated in this function or others that it calls.
With only 4GB of RAM (running Windows 10, so make that about 2 or more realistically 1GB) I've had to be real careful with the allocation.
I use data.table almost exclusively.
The 'fread' function allows you to subset information by field names on import; only import the fields that are actually needed to begin with. If you're using base R read, null the spurious columns immediately after import.
As 42- suggests, where ever possible I will then subset within the columns immediately after importing the information.
I frequently rm() objects from the environment as soon as they're no longer needed, e.g. on the next line after using them to subset something else, and call gc().
'fread' and 'fwrite' from data.table can be very fast by comparison with base R reads and writes.
As kpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.
As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.
A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will often at least exponentially expand at some point. This has made me focus a lot of attention on the quality as opposed to quantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.
There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).
Hopefully some of that might help others out.
This is a newer answer to this excellent old question. From Hadley's Advanced R:
## 88 B
## 832 B
## 6.74 kB
This adds nothing to the above, but is written in the simple and heavily commented style that I like. It yields a table with the objects ordered in size , but without some of the detail given in the examples above:
#Find the objects
MemoryObjects = ls()
#Create an array
#Name the columns
#Define the first column as the objects
#Define a function to determine size
#Apply the function to the objects
#Produce a table with the largest objects first
As well as the more general memory management techniques given in the answers above, I always try to reduce the size of my objects as far as possible. For example, I work with very large but very sparse matrices, in other words matrices where most values are zero. Using the 'Matrix' package (capitalisation important) I was able to reduce my average object sizes from ~2GB to ~200MB as simply as:
my.matrix <- Matrix(my.matrix)
The Matrix package includes data formats that can be used exactly like a regular matrix (no need to change your other code) but are able to store sparse data much more efficiently, whether loaded into memory or saved to disk.
Additionally, the raw files I receive are in 'long' format where each data point has variables x, y, z, i. Much more efficient to transform the data into an x * y * z dimension array with only variable i.
Know your data and use a bit of common sense.
If you are working on Linux and want to use several processes and only have to do read operations on one or more large objects use makeForkCluster instead of a makePSOCKcluster. This also saves you the time sending the large object to the other processes.
I really appreciate some of the answers above, following #hadley and #Dirk that suggest closing R and issuing source and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:
First a wrapper:
#!/usr/bin/Rscript --vanilla --default-packages=utils
for(l in 1:length(fdir)) {
for(k in 1:length(fds)) {
system(paste("Rscript runConsensus.r", l, k))
with this script I basically control what my main script do runConsensus.r, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.
Hope it helps.
Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd file, or load it externally from an rmd file I have already previously saved. This is especially easy to do in R Markdown using the following code-chunk structure.
```{r Create OBJECT}
COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
Output OBJECT; }
```{r Generate or load OBJECT}
#NOTE: Set LOAD to TRUE if you want to load saved file
#NOTE: Set LOAD to FALSE if you want to generate the object from scratch
#NOTE: Set SAVE to TRUE if you want to save the object externally
if(LOAD) {
OBJECT <- readRDS(file = 'MySavedObject.rds')
} else {
if (SAVE) { saveRDS(file = 'MySavedObject.rds', object = OBJECT) } }
With this code structure, all I need to do is to change LOAD depending on whether I want to generate the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD <- TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.
for (i in 1:10)
gc(reset = T)
from time to time also helps R to free unused but still not released memory.
You also can get some benefit using knitr and puting your script in Rmd chuncks.
I usually divide the code in different chunks and select which one will save a checkpoint to cache or to a RDS file, and
Over there you can set a chunk to be saved to "cache", or you can decide to run or not a particular chunk. In this way, in a first run you can process only "part 1", another execution you can select only "part 2", etc.
```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}
corpusTw <- corpus(twitter) # build the corpus
```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}
dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)
As a side effect, this also could save you some headaches in terms of reproducibility :)
Based on #Dirk's and #Tony's answer I have made a slight update. The result was outputting [1] before the pretty size values, so I took out the capture.output which solved the problem:
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
format(utils::object.size(x), units = "auto") })
obj.size <- napply(names, utils::object.size)
obj.dim <- t(napply(names, function(x)
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called
dataframe-> step1 -> step2 -> step3 -> result
raster-> multipliedRast -> meanRastF -> sqrtRast -> resultRast
I work with temporary objects that I call temp.
dataframe -> temp -> temp -> temp -> result
Which leaves me with less intermediate files and more overview.
raster <- raster('file.tif')
temp <- raster * 10
temp <- mean(temp)
resultRast <- sqrt(temp)
To save more memory I can simply remove temp when no longer needed.
If I need several intermediate files, I use temp1, temp2, temp3.
For testing I use test, test2, ...
rm(list=ls()) is a great way to keep you honest and keep things reproducible.
