R performance with data reshaping - performance

I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:
ID DATE1 DATE2 VALTYPE VALUE
'abcd1233' 2009-11-12 2009-12-23 'TYPE1' 123.45
...
VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:
ID DATE1 DATE2 VALUE.TYPE1 VALUE.TYPE2
'abcd1233' 2009-11-12 2009-12-23 123.45 NA
...
The data frame has more than 4,500,000 observations (although about 70% of VALUEs are NA). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.
These are my experiences so far:
Using vanilla reshape() method:
tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),
timevar = "VALTYPE");
RESULT: Error: cannot allocate vector of size 4.8 Gb
Using cast() method of reshape package:
tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);
RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.
Using by() and merge():
sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x);
tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]],
by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);
RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).
To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?

A useful trick is to combine the id variables into a character vector and then do the reshape.
tbl$NEWID <- with(tbl, paste(ID, DATE1, DATE2, sep=";"))
tbl2 <- recast(tbl2, NEWID ~ VALTYPE, measure.var="VALUE")
It's about 40% faster in a problem of similar size in my intel core2 duo 2.2ghz macbook.

What about doing this in a non-R-like manner? I assume you have a TYPE1 and a TYPE2 row for each value of ID,DATE1,DATE2? Then sort the dataframe by those variables, and write a big for loop. You can repeatedly do rbind() operations to build the table, or you could try to pre-allocate the table (maybe) and just assign the VALUE.TYPE1 and VALUE.TYPE2 slots with [<-, which should do the assignment in-place.
(Note that if you're using rbind(), I believe that it's inefficient if you have any factor variables, so make sure everything is a character instead!)

Maybe you could use the cat() function?

Related

Performance difference map() vs withColumn()

I have a table with over 100 columns. I need to remove double quotes from certain columns. I found 2 ways to do it, using withColumn() and map()
Using withColumn()
cols_to_fix = ["col1", ..., "col20"]
for col in cols_to_fix:
df = df.withColumn(col, regexp_replace(df[col], "\"", ""))
Using map()
def remove_quotes(row: Row) -> Row:
row_as_dict = row.asDict()
cols_to_fix = ["col1", ..., "col20"]
for column in cols_to_fix:
if row_as_dict[column]:
row_as_dict[column] = re.sub("\"", "", str(row_as_dict[column]))
return Row(**row_as_dict)
df = df.rdd.map(remove_quotes).toDF(df.schema)
Here is my question. I found using map() takes about 4 times longer than withColumn() on a table that has ~25M records. I will really appreciate if any fellow stack overflow user can explain the reason for the performance difference, so that I can avoid similar pitfall in future.
firstly, one piece of advice: do not convert DataFrame to RDD and just do df.map(your function here), this may save a lot of time.
the following page
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds
would save us a lot of time, its main conclusion is that RDD is remarkably slow than DataFrame/Dataset, not to mention the time used for the conversion from DataFrame to RDD.
Let's talk about map and withColumn without any conversion between DataFrame to RDD now.
Conclusion first: map is usually 5x slower than withColumn. the reason is that map operation always involves deserialization and serialization while withColumn can operate on column of interest.
to be specific, map operation should deserialize the Row into several parts on which the operation will be carrying,
An example here :
assume we have a DataFrame which looks like
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Then we want to increment all the values in column users_count by 1, we can do it like this:
df.map(row => {
val usersCount = row.getInt(1) + 1
(row.getString(0), usersCount)
}).toDF("language", "users_count_incremented_by_1")
In the code above, we firstly need to deserialize every row to extract the values in the 2nd column, after that we output the modified values and save it as an DataFrame(this step requires serialization of (a,b) into Row(a, b) since DataFrame is nothing but a DataSet of Rows).
for more detailed explanation, check the following excellent article
https://medium.com/#fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44
map can not operate on the column itself but have to operate on the values of the column, getting the values require deserialization, saving it as a DataFrame requires serialization.
But map is still of great use: with the help of map method people could implement very sophisticated operations while just built-in operations could be done if we just use withColumn.
To sum it up, map is slower but more flexible, withColumn is surely the most efficient while it's functionality is limited.

Slowdowns of performance of ets select

I did some tests around performance of selection from ets tables and noted weird behaviour. For example we have a simple ets table (without any specific options) which stores key/value - a random string and a number:
:ets.new(:table, [:named_table])
for _i <- 1..2000 do
:ets.insert(:table, {:crypto.strong_rand_bytes(10)
|> Base.url_encode64
|> binary_part(0, 10), 100})
end
and one entry with known key:
:ets.insert(:table, {"test_string", 200})
Now there is simple stupid benchmark function which tries to select test_string from the ets table multiple times and to measure time of each selection:
test_fn = fn() ->
Enum.map(Enum.to_list(1..10_000), fn(x) ->
:timer.tc(fn() ->
:ets.select(:table, [{{:'$1', :'$2'},
[{:'==', :'$1', "test_string"}],
[:'$_']}])
end)
end) |> Enum.unzip
end
Now If I take a look at maximum time with Enum.max(timings) it will return a value which is approximately 10x times greater than almost of all other selections. So, for example:
iex(1)> {timings, _result} = test_fn.()
....
....
....
iex(2)> Enum.max(timings)
896
iex(3)> Enum.sum(timings) / length(timings)
96.8845
We may see here that maximum value is almost 10x times greater than average value.
What's happening here? Is it somehow related to GC, time for memory allocation or something like this? Do you have any ideas why selection from an ets table may give such slowdowns sometimes or how to profile this.
UPD.
here is the graph of timings distribution:
match_spec, the 2nd argument of the select/2 is making it slower.
According to an answer on this question
Erlang: ets select and match performance
In trivial non-specific use-cases, select is just a lot of work around match.
In non-trivial more common use-cases, select will give you what you really want a lot quicker.
Also, if you are working with tables with set or ordered_set type, to get a value based on a key, use lookup/2 instead, as it is lot faster.
On my pc, following code
def lookup() do
{timings, _} = Enum.map(Enum.to_list(1..10_000), fn(_x) ->
:timer.tc(fn() ->
:ets.lookup(:table, "test_string")
end)
end) |> Enum.unzip
IO.puts Enum.max(timings)
IO.puts Enum.sum(timings) / length(timings)
end
printed
0
0.0
While yours printed
16000
157.9
In case you are interested, here you can find the NIF C code for ets:select.
https://github.com/erlang/otp/blob/9d1b3bb0db87cf95cb821af01189f6d6be072f79/erts/emulator/beam/erl_db.c

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

add columns to data frame using foreach and %dopar%

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'
My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?
Simplified example:
require(foreach)
require(doSMP)
w <- startWorkers()
registerDoSMP(w)
transform_features <- function()
{
cols<-c(1,2,3,4) # in my real code I select certain columns (not all)
foreach(thiscol=cols, mydata) %dopar% {
name <- names(mydata)[thiscol]
print(paste('transforming variable ', name))
mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
}
}
n<-10 # I often have 100K-1M rows
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
ncol(mydata) # 4 columns
transform_features()
ncol(mydata) # if it works, there should be 8
Notice if you change %dopar% to %do% it works fine
Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.
See When should I use the := operator in data.table?
Might it be easier if you did something like
n<-10
mydata <- data.frame(
a=runif(n,1,100),
b=runif(n,1,100),
c=runif(n,1,100),
d=runif(n,1,100)
)
mydata_sqrt <- sqrt(mydata)
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')
mydata <- cbind(mydata, mydata_sqrt)
producing something like
> mydata
a b c d a_sqrt b_sqrt c_sqrt d_sqrt
1 29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2 5.037735 12.282458 3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3 80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4 39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5 11.459075 8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6 26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7 42.533937 7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8 41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9 59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
There are two ways you can handle this:
Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as #Henry suggested.
Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.
Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Tricks to manage the available memory in an R session

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionally rm() some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.
Any other nice tricks folks want to share? One per post, please.
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.dim)
names(out) <- c("Type", "Size", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
out
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
Ensure you record your work in a reproducible script. From time-to-time, reopen R, then source() your script. You'll clean out anything you're no longer using, and as an added benefit will have tested your code.
I use the data.table package. With its := operator you can :
Add columns by reference
Modify subsets of existing columns by reference, and by group by reference
Delete columns by reference
None of these operations copy the (potentially large) data.table at all, not even once.
Aggregation is also particularly fast because data.table uses much less working memory.
Related links :
News from data.table, London R presentation, 2012
When should I use the := operator in data.table?
Saw this on a twitter post and think it's an awesome function by Dirk! Following on from JD Long's answer, I would do this for user friendly reading:
# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
format(utils::object.size(x), units = "auto") })
obj.size <- napply(names, object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
out
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
lsos()
Which results in something like the following:
Type Size PrettySize Length/Rows Columns
pca.res PCA 790128 771.6 Kb 7 NA
DF data.frame 271040 264.7 Kb 669 50
factor.AgeGender factanal 12888 12.6 Kb 12 NA
dates data.frame 9016 8.8 Kb 669 2
sd. numeric 3808 3.7 Kb 51 NA
napply function 2256 2.2 Kb NA NA
lsos function 1944 1.9 Kb NA NA
load loadings 1768 1.7 Kb 12 2
ind.sup integer 448 448 bytes 102 NA
x character 96 96 bytes 1 NA
NOTE: The main part I added was (again, adapted from JD's answer) :
obj.prettysize <- napply(names, function(x) {
print(object.size(x), units = "auto") })
I make aggressive use of the subset parameter with selection of only the required variables when passing dataframes to the data= argument of regression functions. It does result in some errors if I forget to add variables to both the formula and the select= vector, but it still saves a lot of time due to decreased copying of objects and reduces the memory footprint significantly. Say I have 4 million records with 110 variables (and I do.) Example:
# library(rms); library(Hmisc) for the cph,and rcs functions
Mayo.PrCr.rbc.mdl <-
cph(formula = Surv(surv.yr, death) ~ age + Sex + nsmkr + rcs(Mayo, 4) +
rcs(PrCr.rat, 3) + rbc.cat * Sex,
data = subset(set1HLI, gdlab2 & HIVfinal == "Negative",
select = c("surv.yr", "death", "PrCr.rat", "Mayo",
"age", "Sex", "nsmkr", "rbc.cat")
) )
By way of setting context and the strategy: the gdlab2 variable is a logical vector that was constructed for subjects in a dataset that had all normal or almost normal values for a bunch of laboratory tests and HIVfinal was a character vector that summarized preliminary and confirmatory testing for HIV.
I love Dirk's .ls.objects() script but I kept squinting to count characters in the size column. So I did some ugly hacks to make it present with pretty formatting for the size:
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.size <- napply(names, object.size)
obj.prettysize <- sapply(obj.size, function(r) prettyNum(r, big.mark = ",") )
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size,obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
out <- out[c("Type", "PrettySize", "Rows", "Columns")]
names(out) <- c("Type", "Size", "Rows", "Columns")
if (head)
out <- head(out, n)
out
}
That's a good trick.
One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.
This doesn't really address memory management, but one important function that isn't widely known is memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. As Dirk mentioned, you need to be using 64-bit in order to take real advantage of this.
I quite like the improved objects function developed by Dirk. Much of the time though, a more basic output with the object name and size is sufficient for me. Here's a simpler function with a similar objective. Memory use can be ordered alphabetically or by size, can be limited to a certain number of objects, and can be ordered ascending or descending. Also, I often work with data that are 1GB+, so the function changes units accordingly.
showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {
objectList <- ls(parent.frame())
oneKB <- 1024
oneMB <- 1048576
oneGB <- 1073741824
memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))
memListing <- sapply(memoryUse, function(size) {
if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))
else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))
else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))
else return(paste(size, "bytes"))
})
memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)
if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),]
else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"
if(!missing(limit)) memListing <- memListing[1:limit,]
print(memListing, row.names=FALSE)
return(invisible(memListing))
}
And here is some example output:
> showMemoryUse(decreasing=TRUE, limit=5)
objectName memorySize
coherData 713.75 MB
spec.pgram_mine 149.63 kB
stoch.reg 145.88 kB
describeBy 82.5 kB
lmBandpass 68.41 kB
I never save an R workspace. I use import scripts and data scripts and output any especially large data objects that I don't want to recreate often to files. This way I always start with a fresh workspace and don't need to clean out large objects. That is a very nice function though.
Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%.
When you read stuff into R with for example read.csv they require a certain amount of memory.
After this you can save them with save("Destinationfile",list=ls())
The next time you open R you can use load("Destinationfile")
Now the memory usage might have decreased.
It would be nice if anyone could confirm whether this produces similar results with a different dataset.
To further illustrate the common strategy of frequent restarts, we can use littler which allows us to run simple expressions directly from the command-line. Here is an example I sometimes use to time different BLAS for a simple crossprod.
r -e'N<-3*10^3; M<-matrix(rnorm(N*N),ncol=N); print(system.time(crossprod(M)))'
Likewise,
r -lMatrix -e'example(spMatrix)'
loads the Matrix package (via the --packages | -l switch) and runs the examples of the spMatrix function. As r always starts 'fresh', this method is also a good test during package development.
Last but not least r also work great for automated batch mode in scripts using the '#!/usr/bin/r' shebang-header. Rscript is an alternative where littler is unavailable (e.g. on Windows).
For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.
dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
if( !i %% 10000 ) {
print( i, "; flushing to disk..." )
write.table( dfinal, file=tempfile, append=!first, col.names=first )
first <- FALSE
dfinal <- NULL # nuke it
}
# ... complex operations here that add data to 'dfinal' data frame
}
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )
Just to note that data.table package's tables() seems to be a pretty good replacement for Dirk's .ls.objects() custom function (detailed in earlier answers), although just for data.frames/tables and not e.g. matrices, arrays, lists.
I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
Calling gc () "by hand" can help if the size of the data get close to available memory.
Sometimes a different algorithm needs much less memory.
Sometimes there's a trade off between vectorization and memory use.
compare: split & lapply vs. a for loop.
For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.
The use of environments instead of lists to handle collections of objects which occupy a significant amount of working memory.
The reason: each time an element of a list structure is modified, the whole list is temporarily duplicated. This becomes an issue if the storage requirement of the list is about half the available working memory, because then data has to be swapped to the slow hard disk. Environments, on the other hand, aren't subject to this behaviour and they can be treated similar to lists.
Here is an example:
get.data <- function(x)
{
# get some data based on x
return(paste("data from",x))
}
collect.data <- function(i,x,env)
{
# get some data
data <- get.data(x[[i]])
# store data into environment
element.name <- paste("V",i,sep="")
env[[element.name]] <- data
return(NULL)
}
better.list <- new.env()
filenames <- c("file1","file2","file3")
lapply(seq_along(filenames),collect.data,x=filenames,env=better.list)
# read/write access
print(better.list[["V1"]])
better.list[["V2"]] <- "testdata"
# number of list elements
length(ls(better.list))
In conjunction with structures such as big.matrix or data.table which allow for altering their content in-place, very efficient memory usage can be achieved.
The llfunction in gData package can show the memory usage of each object as well.
gdata::ll(unit='MB')
If you really want to avoid the leaks, you should avoid creating any big objects in the global environment.
What I usually do is to have a function that does the job and returns NULL — all data is read and manipulated in this function or others that it calls.
With only 4GB of RAM (running Windows 10, so make that about 2 or more realistically 1GB) I've had to be real careful with the allocation.
I use data.table almost exclusively.
The 'fread' function allows you to subset information by field names on import; only import the fields that are actually needed to begin with. If you're using base R read, null the spurious columns immediately after import.
As 42- suggests, where ever possible I will then subset within the columns immediately after importing the information.
I frequently rm() objects from the environment as soon as they're no longer needed, e.g. on the next line after using them to subset something else, and call gc().
'fread' and 'fwrite' from data.table can be very fast by comparison with base R reads and writes.
As kpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.
As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.
A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will often at least exponentially expand at some point. This has made me focus a lot of attention on the quality as opposed to quantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.
There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).
Hopefully some of that might help others out.
This is a newer answer to this excellent old question. From Hadley's Advanced R:
install.packages("pryr")
library(pryr)
object_size(1:10)
## 88 B
object_size(mean)
## 832 B
object_size(mtcars)
## 6.74 kB
(http://adv-r.had.co.nz/memory.html)
This adds nothing to the above, but is written in the simple and heavily commented style that I like. It yields a table with the objects ordered in size , but without some of the detail given in the examples above:
#Find the objects
MemoryObjects = ls()
#Create an array
MemoryAssessmentTable=array(NA,dim=c(length(MemoryObjects),2))
#Name the columns
colnames(MemoryAssessmentTable)=c("object","bytes")
#Define the first column as the objects
MemoryAssessmentTable[,1]=MemoryObjects
#Define a function to determine size
MemoryAssessmentFunction=function(x){object.size(get(x))}
#Apply the function to the objects
MemoryAssessmentTable[,2]=t(t(sapply(MemoryAssessmentTable[,1],MemoryAssessmentFunction)))
#Produce a table with the largest objects first
noquote(MemoryAssessmentTable[rev(order(as.numeric(MemoryAssessmentTable[,2]))),])
As well as the more general memory management techniques given in the answers above, I always try to reduce the size of my objects as far as possible. For example, I work with very large but very sparse matrices, in other words matrices where most values are zero. Using the 'Matrix' package (capitalisation important) I was able to reduce my average object sizes from ~2GB to ~200MB as simply as:
my.matrix <- Matrix(my.matrix)
The Matrix package includes data formats that can be used exactly like a regular matrix (no need to change your other code) but are able to store sparse data much more efficiently, whether loaded into memory or saved to disk.
Additionally, the raw files I receive are in 'long' format where each data point has variables x, y, z, i. Much more efficient to transform the data into an x * y * z dimension array with only variable i.
Know your data and use a bit of common sense.
If you are working on Linux and want to use several processes and only have to do read operations on one or more large objects use makeForkCluster instead of a makePSOCKcluster. This also saves you the time sending the large object to the other processes.
I really appreciate some of the answers above, following #hadley and #Dirk that suggest closing R and issuing source and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:
First a wrapper:
#!/usr/bin/Rscript --vanilla --default-packages=utils
for(l in 1:length(fdir)) {
for(k in 1:length(fds)) {
system(paste("Rscript runConsensus.r", l, k))
}
}
with this script I basically control what my main script do runConsensus.r, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.
Hope it helps.
Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd file, or load it externally from an rmd file I have already previously saved. This is especially easy to do in R Markdown using the following code-chunk structure.
```{r Create OBJECT}
COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
Output OBJECT; }
```
```{r Generate or load OBJECT}
LOAD <- TRUE
SAVE <- TRUE
#NOTE: Set LOAD to TRUE if you want to load saved file
#NOTE: Set LOAD to FALSE if you want to generate the object from scratch
#NOTE: Set SAVE to TRUE if you want to save the object externally
if(LOAD) {
OBJECT <- readRDS(file = 'MySavedObject.rds')
} else {
OBJECT <- COMPLICATED.FUNCTION(x, y, z)
if (SAVE) { saveRDS(file = 'MySavedObject.rds', object = OBJECT) } }
```
With this code structure, all I need to do is to change LOAD depending on whether I want to generate the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD <- TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.
Running
for (i in 1:10)
gc(reset = T)
from time to time also helps R to free unused but still not released memory.
You also can get some benefit using knitr and puting your script in Rmd chuncks.
I usually divide the code in different chunks and select which one will save a checkpoint to cache or to a RDS file, and
Over there you can set a chunk to be saved to "cache", or you can decide to run or not a particular chunk. In this way, in a first run you can process only "part 1", another execution you can select only "part 2", etc.
Example:
part1
```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}
corpusTw <- corpus(twitter) # build the corpus
```
part2
```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}
dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)
```
As a side effect, this also could save you some headaches in terms of reproducibility :)
Based on #Dirk's and #Tony's answer I have made a slight update. The result was outputting [1] before the pretty size values, so I took out the capture.output which solved the problem:
.ls.objects <- function (pos = 1, pattern, order.by,
decreasing=FALSE, head=FALSE, n=5) {
napply <- function(names, fn) sapply(names, function(x)
fn(get(x, pos = pos)))
names <- ls(pos = pos, pattern = pattern)
obj.class <- napply(names, function(x) as.character(class(x))[1])
obj.mode <- napply(names, mode)
obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
obj.prettysize <- napply(names, function(x) {
format(utils::object.size(x), units = "auto") })
obj.size <- napply(names, utils::object.size)
obj.dim <- t(napply(names, function(x)
as.numeric(dim(x))[1:2]))
vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
obj.dim[vec, 1] <- napply(names, length)[vec]
out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
if (!missing(order.by))
out <- out[order(out[[order.by]], decreasing=decreasing), ]
if (head)
out <- head(out, n)
return(out)
}
# shorthand
lsos <- function(..., n=10) {
.ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}
lsos()
I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called
dataframe-> step1 -> step2 -> step3 -> result
raster-> multipliedRast -> meanRastF -> sqrtRast -> resultRast
I work with temporary objects that I call temp.
dataframe -> temp -> temp -> temp -> result
Which leaves me with less intermediate files and more overview.
raster <- raster('file.tif')
temp <- raster * 10
temp <- mean(temp)
resultRast <- sqrt(temp)
To save more memory I can simply remove temp when no longer needed.
rm(temp)
If I need several intermediate files, I use temp1, temp2, temp3.
For testing I use test, test2, ...
rm(list=ls()) is a great way to keep you honest and keep things reproducible.

Resources