How to make computing/inserting a difference-of-dates column faster? - performance

Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
Code:
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
(Wondering if we can create a logical index where customer_id differs to the row above?)

to set NA to appropriate places, you again can use diff() and one-line trick:
> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA
explanation
let's take some vector x
x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)
we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.
y <- c(1,diff(x))
# y = 1 0 0 0 1 0 0 2 0 0 1 -2 0 0
and take those indexes, that are not equal to zero:
x[y!=0] <- NA

Related

Quickly compute `dot(a(n:end), b(1:end-n))`

Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.

R - how to pick a random sample with specific percentages

This is snapshot of my dataset
A B
1 34
1 33
1 66
0 54
0 77
0 98
0 39
0 12
I am trying to create a random sample where there are 2 1s and 3 0s from column A in the sample along with their respective B values. Is there a way to do that? Basically trying to see how to get a sample with specific percentages of a particular column? Thanks.

extracting values from maps and inserting into a summary table

I have several maps that I am working with. I want to extract the values (1, 0 and NA) from the maps and place them all into a summary matrix. Since I have so many maps, I think its best to do this as a for loop. This is the code I have so far and my maps and empty summary matrix are uploaded to my Dropbox here: DATASET here
setwd ('C:/Users/Israel/Dropbox/')
require (raster)
require(rgdal)
require (plyr)
#load in the emxpy matrix to be filled
range.summary<-read.csv('range_sizes.csv', header=T)
#load in maps and count pixels
G1.total<-raster('Group1/Summary/PA_current_G1.tif')
G1.total.df<-as.data.frame(G1.total)
#these are the values I need to be placed into the empty matrix (range.summary)
count (G1.total.df)
PA_current_G1 freq
1 0 227193
2 1 136871
3 NA 561188
Try this
I downloaded 3 images
library(raster)
wd <- 'D:\\Programacao\\R\\Stackoverflow\\raster'
allfiles <- list.files(file.path(wd), all.files = F)
# List of TIF files at dir.fun folder
tifs <- grep(".tif$", allfiles, ignore.case = TRUE, value = TRUE)
#stack rasterLayer
mystack <- stack(file.path(wd, tifs))
# calculate frequencies
freqs <- freq(mystack, useNA='ifany')
# rbind list to get a data.frame
freqsdf <- do.call(rbind.data.frame, freqs)
freqsdf
value count
PA_2050_26_G1.1 0 256157
PA_2050_26_G1.2 1 193942
PA_2050_26_G1.3 NA 475153
PA_2050_26_G2.1 0 350928
PA_2050_26_G2.2 1 99171
PA_2050_26_G2.3 NA 475153
PA_2050_26_sub_G1.1 0 112528
PA_2050_26_sub_G1.2 1 90800
PA_2050_26_sub_G1.3 NA 721924
str(freqsdf)
'data.frame': 9 obs. of 2 variables:
$ value: num 0 1 NA 0 1 NA 0 1 NA
$ count: num 256157 193942 475153 350928 99171 ...
Now it is a matter of work the output shape.

Sum up custom grand total on crosstab in BIRT

I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.

Find and Replace first NA in each column without for loops

Trying to do this without For Loop but can't figure it out.
I want to replace the first NA in a column with a default value of 0.0000001.
I am doing Last Observation Carried Forward (LOCF) imputation but want to give it a default value.
If I have the following data.frame:
> Col1 Col2 Col3 Col4
> 1 NA 10 99
> NA NA 11 99
> 1 NA 12 99
> 1 NA 13 NA
I want it to look like this:
> Col1 Col2 Col3 Col4
> 1 0.0000001 10 99
> 0.0000001 NA 11 99
> 1 NA 12 99
> 1 NA 13 0.0000001
This is the code I haev that works but is very slow...
#Temporary change for missing first observation
for (u in 1:ncol(data.frame))
{
for (v in 1:nrow(data.frame))
{
#Temporary change the first observations in a row to 0.0000001 until it encounters a value that isn't NA
if(is.na(temp_equity_df_merge2[v,u]))
{
temp_equity_df_merge2[v,u]=0.0000001
}
else break
}
I want to use apply or some variant that will be faster. I am looping over 20 columns and 1 million rows.
Thanks ahead of time for the help.
you can apply a function to each column:
myfun <- function(x) {
x[which(is.na(x))[1]] <- 0.1
return(x)
}
> data.frame(apply(dat, 2, myfun))
v1 v2 v3 v4
1 1.0 0.1 10 99.0
2 0.1 NA 11 99.0
3 1.0 NA 12 99.0
4 1.0 NA 13 0.1
>
Based on the comments, you can use apply to apply a function to each column. The function will replace the first NA with 0.0000001 and return a matrix. Then you can use na.locf to fill-in the remaining NAs. Finally, I wrapped it all in data.frame since you asked for a data.frame instead of a matrix
data.frame(na.locf(apply(dat, 2, function(x) {
firstNA <- head(which(is.na(x)), 1) #position of first NA
x[firstNA] <- 0.0000001
x
})))
Col1 Col2 Col3 Col4
1 1e+00 1e-07 10 9.9e+01
2 1e-07 1e-07 11 9.9e+01
3 1e+00 1e-07 12 9.9e+01
4 1e+00 1e-07 13 1.0e-07
Given you have such a large data set, I would use data.table and set to avoid copying the data. Both the apply solutions copy the data at least once.
The solution involves a for loop, but an efficient one ( doing length(valid_replace) things each of which are instantaneous)
library(data.table)
DT< -as.data.table(dat)
replacing <- lapply(DT, function(x)which(is.na(x))[1])
valid_replace <- Filter(Negate(is.na), replacing)
replace_with <- 0.0001
for(i in seq_along(valid_replace)){
set(DT, i = valid_replace[i], j = names(valid_replace)[i], value = replace_with)
}

Resources