Find and Replace first NA in each column without for loops - performance

Trying to do this without For Loop but can't figure it out.
I want to replace the first NA in a column with a default value of 0.0000001.
I am doing Last Observation Carried Forward (LOCF) imputation but want to give it a default value.
If I have the following data.frame:
> Col1 Col2 Col3 Col4
> 1 NA 10 99
> NA NA 11 99
> 1 NA 12 99
> 1 NA 13 NA
I want it to look like this:
> Col1 Col2 Col3 Col4
> 1 0.0000001 10 99
> 0.0000001 NA 11 99
> 1 NA 12 99
> 1 NA 13 0.0000001
This is the code I haev that works but is very slow...
#Temporary change for missing first observation
for (u in 1:ncol(data.frame))
{
for (v in 1:nrow(data.frame))
{
#Temporary change the first observations in a row to 0.0000001 until it encounters a value that isn't NA
if(is.na(temp_equity_df_merge2[v,u]))
{
temp_equity_df_merge2[v,u]=0.0000001
}
else break
}
I want to use apply or some variant that will be faster. I am looping over 20 columns and 1 million rows.
Thanks ahead of time for the help.

you can apply a function to each column:
myfun <- function(x) {
x[which(is.na(x))[1]] <- 0.1
return(x)
}
> data.frame(apply(dat, 2, myfun))
v1 v2 v3 v4
1 1.0 0.1 10 99.0
2 0.1 NA 11 99.0
3 1.0 NA 12 99.0
4 1.0 NA 13 0.1
>

Based on the comments, you can use apply to apply a function to each column. The function will replace the first NA with 0.0000001 and return a matrix. Then you can use na.locf to fill-in the remaining NAs. Finally, I wrapped it all in data.frame since you asked for a data.frame instead of a matrix
data.frame(na.locf(apply(dat, 2, function(x) {
firstNA <- head(which(is.na(x)), 1) #position of first NA
x[firstNA] <- 0.0000001
x
})))
Col1 Col2 Col3 Col4
1 1e+00 1e-07 10 9.9e+01
2 1e-07 1e-07 11 9.9e+01
3 1e+00 1e-07 12 9.9e+01
4 1e+00 1e-07 13 1.0e-07

Given you have such a large data set, I would use data.table and set to avoid copying the data. Both the apply solutions copy the data at least once.
The solution involves a for loop, but an efficient one ( doing length(valid_replace) things each of which are instantaneous)
library(data.table)
DT< -as.data.table(dat)
replacing <- lapply(DT, function(x)which(is.na(x))[1])
valid_replace <- Filter(Negate(is.na), replacing)
replace_with <- 0.0001
for(i in seq_along(valid_replace)){
set(DT, i = valid_replace[i], j = names(valid_replace)[i], value = replace_with)
}

Related

Group columns and separate in R

I have a dataset, where I want to group 2 columns together (Proc Right and Days to Proc Right) and separate it from the next group of 2 columns (Proc Left and Days to Proc Left). During separation, I want to separate based on chronology of days to procedure, and assign 0 and NA to the other 2 columns which chronologically are later. I then want to create a new column pulling only the days to procedure.
To summarise:
Have this:
ID
Proc ID
Proc Right
Days to Proc Right
Proc Left
Days to Proc Left
1
108
4
41
4
168
1
105
4
169
4
42
1
101
3
270
0
NA
Want this:
ID
Proc ID
Proc Right
Days to Proc Right
Proc Left
Days to Proc Left
Days to Proc
1
108
4
41
0
NA
41
1
108
0
NA
4
168
168
1
105
0
NA
4
42
42
1
105
4
169
0
NA
169
1
101
3
270
0
NA
270
Would appreciate any help. Thanks
I have tried unite and cSplit, which separates the column groups, but doesn't help me assign 0 and NA to the other columns.

Make a matrix B of the first, fourth and fifth row and the first and fifth column from matrix A in OCTAVE

I have matrix A
A =
5 10 15 20 25
10 9 8 7 6
-5 -15 -25 -35 -45
1 2 3 4 5
28 91 154 217 280
And i need to make a matrix B of the first, fourth and fifth row and the first and fifth column from matrix A.
How can i do it?
>> B = A([1,4,5],[1,5])
B =
5 25
1 5
28 280
You should look up how to use index expressions in the Matlab and Octave language to extract and work with submatrices.
See the Octave help on Index expressions: https://octave.org/doc/latest/Index-Expressions.html

extracting values from maps and inserting into a summary table

I have several maps that I am working with. I want to extract the values (1, 0 and NA) from the maps and place them all into a summary matrix. Since I have so many maps, I think its best to do this as a for loop. This is the code I have so far and my maps and empty summary matrix are uploaded to my Dropbox here: DATASET here
setwd ('C:/Users/Israel/Dropbox/')
require (raster)
require(rgdal)
require (plyr)
#load in the emxpy matrix to be filled
range.summary<-read.csv('range_sizes.csv', header=T)
#load in maps and count pixels
G1.total<-raster('Group1/Summary/PA_current_G1.tif')
G1.total.df<-as.data.frame(G1.total)
#these are the values I need to be placed into the empty matrix (range.summary)
count (G1.total.df)
PA_current_G1 freq
1 0 227193
2 1 136871
3 NA 561188
Try this
I downloaded 3 images
library(raster)
wd <- 'D:\\Programacao\\R\\Stackoverflow\\raster'
allfiles <- list.files(file.path(wd), all.files = F)
# List of TIF files at dir.fun folder
tifs <- grep(".tif$", allfiles, ignore.case = TRUE, value = TRUE)
#stack rasterLayer
mystack <- stack(file.path(wd, tifs))
# calculate frequencies
freqs <- freq(mystack, useNA='ifany')
# rbind list to get a data.frame
freqsdf <- do.call(rbind.data.frame, freqs)
freqsdf
value count
PA_2050_26_G1.1 0 256157
PA_2050_26_G1.2 1 193942
PA_2050_26_G1.3 NA 475153
PA_2050_26_G2.1 0 350928
PA_2050_26_G2.2 1 99171
PA_2050_26_G2.3 NA 475153
PA_2050_26_sub_G1.1 0 112528
PA_2050_26_sub_G1.2 1 90800
PA_2050_26_sub_G1.3 NA 721924
str(freqsdf)
'data.frame': 9 obs. of 2 variables:
$ value: num 0 1 NA 0 1 NA 0 1 NA
$ count: num 256157 193942 475153 350928 99171 ...
Now it is a matter of work the output shape.

Sum up custom grand total on crosstab in BIRT

I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.

How to make computing/inserting a difference-of-dates column faster?

Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
Code:
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
(Wondering if we can create a logical index where customer_id differs to the row above?)
to set NA to appropriate places, you again can use diff() and one-line trick:
> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA
explanation
let's take some vector x
x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)
we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.
y <- c(1,diff(x))
# y = 1 0 0 0 1 0 0 2 0 0 1 -2 0 0
and take those indexes, that are not equal to zero:
x[y!=0] <- NA

Resources