GAE Datastore Performance (Column vs ListProperty) - performance

After watching "Google IO 2009: Building scalable, complex apps on App Engine" I performed some tests to help understand the impact on list de-serialization, but the results are quite surprising. Below are the test descriptions.
All tests are run on GAE server.
Each test is performed 5 times with its time and CPU usage recorded.
This test is to compare the speed of fetching (float) data in Columns V.S List
Both Column and List tables contain an extra datetime column for query.
Same query is used to fetch data on both Column and List tables.
TEST 1
- Fetch Single Row
- Table size: 500 Columns vs List of 500 (both contain 500 rows)
Table:ChartTestDbRdFt500C500R <-- 500 Columns x 500 Rows
OneRowCol Result <-- Fetching one row
[0] 0.02 (52) <-- Test 0, time taken = 0.02, CPU usage = 52
[1] 0.02 (60)
[2] 0.02 (56)
[3] 0.01 (46)
[4] 0.02 (57)
Table:ChartTestDbRdFt500L500R <-- List of 500 x 500 Rows
OneRowLst Result
[0] 0.01 (40)
[1] 0.02 (38)
[2] 0.01 (42)
[3] 0.05 (154)
[4] 0.01 (41)
TEST 2
- Fetch All Rows
- Table size: 500 Columns vs List of 500 (both contain 500 rows)
Table:ChartTestDbRdFt500C500R
AllRowCol Result
[0] 11.54 (32753)
[1] 10.99 (31140)
[2] 11.07 (31245)
[3] 11.55 (37177)
[4] 10.96 (34300)
Table:ChartTestDbRdFt500L500R
AllRowLst Result
[0] 7.46 (20872)
[1] 7.02 (19632)
[2] 6.8 (18967)
[3] 6.33 (17709)
[4] 6.81 (19006)
TEST 3
- Fetch Single Row
- Table size: 4500 Columns vs List of 4500 (both contain 10 rows)
Table:ChartTestDbRdFt4500C10R
OneRowCol Result
[0] 0.15 (419)
[1] 0.15 (433)
[2] 0.15 (415)
[3] 0.23 (619)
[4] 0.14 (415)
Table:ChartTestDbRdFt4500L10R
OneRowLst Result
[0] 0.08 (212)
[1] 0.16 (476)
[2] 0.07 (215)
[3] 0.09 (242)
[4] 0.08 (217)
CONCLUSION
Fetching a list of N items is actually quicker than N columns. Does anyone know why this is the case? I thought there is a performance hit on list de-serialization? Or did I performed my tests incorrectly? Any insight will be helpful, thanks!

BigTable is a column-oriented database.
That means that fetching a 'row' of N columns is in fact N different read operations, all on the same index.

Related

gnuplot give wrong results from stats matrix

Suppose that I have the file data.dat with follow content:
Days 1 2 4 6 10 15 20 30
Group 01 37.80 30.67 62.88 86.06 26.24 98.49 65.42 61.28
Group 02 38.96 72.99 38.24 74.11 39.54 91.59 81.14 91.22
Group 03 82.34 75.25 82.58 28.22 39.21 81.30 41.30 42.48
Group 04 75.52 42.83 66.80 20.50 94.08 74.78 95.09 53.16
Group 05 89.32 56.78 30.05 68.07 59.18 94.18 39.77 67.56
Group 06 70.03 78.71 37.59 60.55 46.40 82.73 67.34 93.38
Group 07 67.83 88.73 48.01 62.19 49.40 67.68 25.97 58.98
Group 08 61.15 96.06 59.62 39.42 60.06 94.18 76.06 32.02
Group 09 65.61 72.39 54.07 92.79 56.58 39.14 81.81 39.16
Group 10 59.65 77.81 40.51 68.49 66.15 80.33 87.31 42.07
The final intention is create a histogram using histogram clustered.
Besides the graph, I need of some values from data.dat such as
size_x, size_y, min, max, and mean. To achieve the last task I used
set datafile separator tab
stats 'data.dat' skip 1 matrix
The summed up output was:
* MATRIX: [9 X 10]
Minimum: 0.0000 [ 0 0 ]
Maximum: 98.4900 [ 6 0 ]
Mean: 56.0549
The size_x and size_y values are correct – 9 columns and 10 rows – but the min is not.
This is due to the fact that the first column is string-type.
When I include every
set datafile separator tab
stats 'data.dat' skip 1 matrix every ::1
to skip the first column, the summed up output is:
* MATRIX: [9 X 8]
Minimum: 20.5000 [ 0 3 ]
Maximum: 98.4900 [ 5 0 ]
Mean: 63.0617
This time the min and max values are right, but the size_y (shown 8, expected 9) and index from min (expected [ 3 3 ]) is not.
What is going on? I made some mistake? I'm not noticing something?
The program tries to read a value from the first field of each row, sees "Group xx" and ends up filling in 0 for that entry. You need to tell it to skip the first column.
Amended answer
I think there is a bug here, as well as confusion between documentation and the actual implementation. The matrix rows and columns as implemented by the every selector are indexed from 0 to N-1 as they would be for C language arrays. The documentation incorrectly states or at least implies that the first row and column is matrix[1][1] rather than [0][0]. So the full command needed for your case is
gnuplot> set datafile sep tab
gnuplot> stats 'data.dat' every 1:1:1:1 matrix
warning: matrix contains missing or undefined values
* FILE:
Records: 80
Out of range: 0
Invalid: 0
Header records: 0
Blank: 10
Data Blocks: 1
* MATRIX: [9 X 8]
Mean: 63.0617
Std Dev: 20.6729
Sample StdDev: 20.8033
Skewness: -0.1327
Kurtosis: 1.9515
Avg Dev: 17.4445
Sum: 5044.9400
Sum Sq.: 352332.2181
Mean Err.: 2.3113
Std Dev Err.: 1.6343
Skewness Err.: 0.2739
Kurtosis Err.: 0.5477
Minimum: 20.5000 [ 0 3 ]
Maximum: 98.4900 [ 5 0 ]
I.e. every 1:1:1:1 tells it for both rows and columns the index increment is 1 and the submatrix starts at [1][1] rather than at the origin [0][0].
The output values are all correct, but the indices shown for the size [9 x 8] and the min/max entries are wrong. I will file a bug report for both issues.
I got sidetracked trying to characterize the bug revealed by the original answer and forgot to mention a simpler alternative. For this specific case of one row of column headers and one column of rowheaders, gnuplot provides a special syntax that works without error:
set file separator tab
stats 'data.dat' matrix rowheaders columnheaders

Subtracting row 2 from row 1 repeatedly

I want to create in R a column in my data set where I subtract row 2 from row1, row 4 from row 3 and so forth. Moreover, I want that the subtraction result is repeated for each row (e.g.if the result from the subtraction row2-row1 is -0.294803, I want this value to be present both in row1 and row2, hence repeated twice for both factors of the subtraction, and so forth for all subtractions).
Here my data set.
I tried with the function aggregate but I didn't succeed.
Any hint?
Another possible solution can be:
x <- read.table("mydata.csv",header=T,sep=";")
x$diff <- rep(x$log[seq(2,nrow(x),by=2)] - x$log[seq(1,nrow(x),by=2)], each=2)
By using the function seq(), you can generate the sequences of row positions:
1, 3, 5, ... 9
2, 4, 6, ... 10
Afterwards, the code subtracts the rows 2...10 to the rows 1...9. Each result is replicated by using the command rep() and it's assigned to the new column diff.
solution 1
One way to that is with one simple loop:
(download mydata.csv)
a = read.table("mydata.csv",header=T,sep=";")
a$delta= NA
for(i in seq(1, nrow(a), by=2 )){
a[i,"delta"] = a[i+1,"delta"] = a[i+1,"log"] - a[i,"log"]
}
What is going on here is that the for loop iterates on every odd number (that's what the seq(...,by=2) does. So for the first, third, fifth, etc. row we assign to that row AND the following one the computed difference.
which returns:
> a
su match log delta
1 1 match 5.80 0.30
2 1 mismatch 6.10 0.30
3 2 match 6.09 -0.04
4 2 mismatch 6.05 -0.04
5 3 match 6.42 -0.12
6 3 mismatch 6.30 -0.12
7 4 match 6.20 -0.20
8 4 mismatch 6.00 -0.20
9 5 match 5.90 0.19
10 5 mismatch 6.09 0.19
solution 2
If you have a lot of data this approach can be slow. And generally R works better with another form of iterative functions which are the apply family.
The same code of above can be optimized like this:
a$delta = rep(
sapply(seq(1, nrow(a), by=2 ),
function(i){ a[i+1,"log"] - a[i,"log"] }
),
each=2)
Which gives the very same result as the first solution, should be faster, but also somewhat less intuitive.
solution 3
Finally it looks to me that you're trying to use a convoluted approach by using the long dataframe format, given your kind of data.
I'd reshape it to wide, and then operate more logically with separate columns, without the need of duplicate data.
Like this:
a = read.table("mydata.csv",header=T,sep=";")
a = reshape(a, idvar = "su", timevar = "match", direction = "wide")
#now creating what you want became a very simple thing:
a$delta = a[[3]]-a[[2]]
Which returns:
>a
su log.match log.mismatch delta
1 1 5.80 6.10 0.30
3 2 6.09 6.05 -0.04
5 3 6.42 6.30 -0.12
7 4 6.20 6.00 -0.20
9 5 5.90 6.09 0.19
The delta column contains the values you need. If you really need the long format for further analysis you can always go back with:
a= reshape(a, idvar = "su", timevar = "match", direction = "long")
#sort to original order:
a = a[with(a, order(su)), ]

Reduce computing time for reshape

I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
Many thanks!
Benchmarks Summary:
Using Stack (as suggested by #AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from #AnandaMahto, below are benchmarks using a large (larger) sample data.
Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two
hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).
Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :
http://cran.r-project.org/doc/manuals/R-data.html
Quickly reading very large tables as dataframes
But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using #RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table really turns the tables on that last run!

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
I've tried:
df <- do.call("rbind", list)
df <- ldply(list)
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6
loaded via a namespace (and not attached):
[1] tools_2.14.1
>
Given that you are looking for performance, it appears that a data.table solution should be suggested.
There is a function rbindlist which is the same but much faster than do.call(rbind, list)
library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
## user system elapsed
## 0.00 0.01 0.02
It is also very fast for a list of data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.frame <- rbindlist(Xdf))
## user system elapsed
## 0.03 0.00 0.03
For comparison
system.time(docall <- do.call(rbind, Xdf))
## user system elapsed
## 50.72 9.89 60.88
And some proper benchmarking
library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X),
rbindlist.data.frame = rbindlist(Xdf),
docall = do.call(rbind, Xdf),
replications = 5)
## test replications elapsed relative user.self sys.self
## 3 docall 5 276.61 3073.444445 264.08 11.4
## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0
## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
and against #JoshuaUlrich's solutions
benchmark(use.rbl.dt = rbl.dt(X),
use.rbl.ju = rbl.ju (Xdf),
use.rbindlist =rbindlist(X) ,
replications = 5)
## test replications elapsed relative user.self
## 3 use.rbindlist 5 0.10 1.0 0.09
## 1 use.rbl.dt 5 0.10 1.0 0.09
## 2 use.rbl.ju 5 0.33 3.3 0.31
I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame
rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.
# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]]) # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names # Give Xb the correct names.
Xb.df <- as.data.frame(Xb) # Convert Xb to a data.frame.
})
# user system elapsed
# 3.356 0.024 3.388
system.time(X1 <- do.call(rbind, X))
# user system elapsed
# 169.627 6.680 179.675
identical(X1,Xb.df)
# [1] TRUE
Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)
# My "rbind list" function
rbl.ju <- function(x) {
u <- unlist(x, recursive=FALSE)
n <- names(u)
un <- unique(n)
l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
names(l) <- un
d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
as.data.frame(rbindlist(x))
}
library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
system.time(dt <- rbl.dt(X)) # rbindlist only exists in recent versions
}
# user system elapsed
# 0.02 0.00 0.02
system.time(ju <- rbl.ju(X))
# user system elapsed
# 0.05 0.00 0.05
identical(dt,ju)
# [1] TRUE
Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.
This simple experiment seems to confirm that that's a very fruitful path to take:
## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
## First, rbind together all 50,000 data.frames in a single step
system.time({
X1 <- do.call(rbind, X)
})
# user system elapsed
# 137.08 57.98 200.08
## Doing it in two stages cuts the processing time by >95%
## - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
## - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
X3 <- do.call(rbind, X2)
})
# user system elapsed
# 6.14 0.05 6.21
## Checking that the results are the same
identical(X1, X3)
# [1] TRUE
You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.
However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?
do.call.rbind <- function (lst) {
while (length(lst) > 1) {
idxlst <- seq(from = 1, to = length(lst), by = 2)
lst <- lapply(idxlst, function(i) {
if (i == length(lst)) {
return(lst[[i]])
}
return(rbind(lst[[i]], lst[[i + 1]]))
})
}
lst[[1]]
}
I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]
The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.
If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.
> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1 a 100 327.728 1.755965 248.620 79.099 0 0
2 b 100 186.637 1.000000 181.874 4.751 0 0
The relative speed up will be exponentially better as you increase the length of the list.

How to use both binary and continuous features in the k-Nearest-Neighbor algorithm?

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:
Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.
I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?
You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.
It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.
If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,
ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)
In R, for instance, you can write this function:
ev_scaled = function(x) {
(x - min(x)) / (max(x) - min(x))
}
which works like this:
# generate some data:
# v1, v2 are two expectation variables in the same dataset
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
[1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
[1] 0.2 3.5 5.1 5.6 8.0 8.3 9.9 11.3 15.5 19.4
> mean(v1)
[1] 325
> mean(v2)
[1] 8.68
# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
[1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
[1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
[1] 0.5
> mean(v2_scaled)
[1] 0.442
> range(v1_scaled)
[1] 0 1
> range(v2_scaled)
[1] 0 1
You can also try Mahalanobis distance instead of Euclidean.

Resources