Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
Code:
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
(Wondering if we can create a logical index where customer_id differs to the row above?)
to set NA to appropriate places, you again can use diff() and one-line trick:
> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA
explanation
let's take some vector x
x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)
we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.
y <- c(1,diff(x))
# y = 1 0 0 0 1 0 0 2 0 0 1 -2 0 0
and take those indexes, that are not equal to zero:
x[y!=0] <- NA
Related
Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.
This is snapshot of my dataset
A B
1 34
1 33
1 66
0 54
0 77
0 98
0 39
0 12
I am trying to create a random sample where there are 2 1s and 3 0s from column A in the sample along with their respective B values. Is there a way to do that? Basically trying to see how to get a sample with specific percentages of a particular column? Thanks.
I have several maps that I am working with. I want to extract the values (1, 0 and NA) from the maps and place them all into a summary matrix. Since I have so many maps, I think its best to do this as a for loop. This is the code I have so far and my maps and empty summary matrix are uploaded to my Dropbox here: DATASET here
setwd ('C:/Users/Israel/Dropbox/')
require (raster)
require(rgdal)
require (plyr)
#load in the emxpy matrix to be filled
range.summary<-read.csv('range_sizes.csv', header=T)
#load in maps and count pixels
G1.total<-raster('Group1/Summary/PA_current_G1.tif')
G1.total.df<-as.data.frame(G1.total)
#these are the values I need to be placed into the empty matrix (range.summary)
count (G1.total.df)
PA_current_G1 freq
1 0 227193
2 1 136871
3 NA 561188
Try this
I downloaded 3 images
library(raster)
wd <- 'D:\\Programacao\\R\\Stackoverflow\\raster'
allfiles <- list.files(file.path(wd), all.files = F)
# List of TIF files at dir.fun folder
tifs <- grep(".tif$", allfiles, ignore.case = TRUE, value = TRUE)
#stack rasterLayer
mystack <- stack(file.path(wd, tifs))
# calculate frequencies
freqs <- freq(mystack, useNA='ifany')
# rbind list to get a data.frame
freqsdf <- do.call(rbind.data.frame, freqs)
freqsdf
value count
PA_2050_26_G1.1 0 256157
PA_2050_26_G1.2 1 193942
PA_2050_26_G1.3 NA 475153
PA_2050_26_G2.1 0 350928
PA_2050_26_G2.2 1 99171
PA_2050_26_G2.3 NA 475153
PA_2050_26_sub_G1.1 0 112528
PA_2050_26_sub_G1.2 1 90800
PA_2050_26_sub_G1.3 NA 721924
str(freqsdf)
'data.frame': 9 obs. of 2 variables:
$ value: num 0 1 NA 0 1 NA 0 1 NA
$ count: num 256157 193942 475153 350928 99171 ...
Now it is a matter of work the output shape.
I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.
Trying to do this without For Loop but can't figure it out.
I want to replace the first NA in a column with a default value of 0.0000001.
I am doing Last Observation Carried Forward (LOCF) imputation but want to give it a default value.
If I have the following data.frame:
> Col1 Col2 Col3 Col4
> 1 NA 10 99
> NA NA 11 99
> 1 NA 12 99
> 1 NA 13 NA
I want it to look like this:
> Col1 Col2 Col3 Col4
> 1 0.0000001 10 99
> 0.0000001 NA 11 99
> 1 NA 12 99
> 1 NA 13 0.0000001
This is the code I haev that works but is very slow...
#Temporary change for missing first observation
for (u in 1:ncol(data.frame))
{
for (v in 1:nrow(data.frame))
{
#Temporary change the first observations in a row to 0.0000001 until it encounters a value that isn't NA
if(is.na(temp_equity_df_merge2[v,u]))
{
temp_equity_df_merge2[v,u]=0.0000001
}
else break
}
I want to use apply or some variant that will be faster. I am looping over 20 columns and 1 million rows.
Thanks ahead of time for the help.
you can apply a function to each column:
myfun <- function(x) {
x[which(is.na(x))[1]] <- 0.1
return(x)
}
> data.frame(apply(dat, 2, myfun))
v1 v2 v3 v4
1 1.0 0.1 10 99.0
2 0.1 NA 11 99.0
3 1.0 NA 12 99.0
4 1.0 NA 13 0.1
>
Based on the comments, you can use apply to apply a function to each column. The function will replace the first NA with 0.0000001 and return a matrix. Then you can use na.locf to fill-in the remaining NAs. Finally, I wrapped it all in data.frame since you asked for a data.frame instead of a matrix
data.frame(na.locf(apply(dat, 2, function(x) {
firstNA <- head(which(is.na(x)), 1) #position of first NA
x[firstNA] <- 0.0000001
x
})))
Col1 Col2 Col3 Col4
1 1e+00 1e-07 10 9.9e+01
2 1e-07 1e-07 11 9.9e+01
3 1e+00 1e-07 12 9.9e+01
4 1e+00 1e-07 13 1.0e-07
Given you have such a large data set, I would use data.table and set to avoid copying the data. Both the apply solutions copy the data at least once.
The solution involves a for loop, but an efficient one ( doing length(valid_replace) things each of which are instantaneous)
library(data.table)
DT< -as.data.table(dat)
replacing <- lapply(DT, function(x)which(is.na(x))[1])
valid_replace <- Filter(Negate(is.na), replacing)
replace_with <- 0.0001
for(i in seq_along(valid_replace)){
set(DT, i = valid_replace[i], j = names(valid_replace)[i], value = replace_with)
}