How to count number of missing values for each column of a data frame with SparkR? - sparkr

I am processing a 2,5 GB csv file containing 1,1 million lines and 1000 numeric columns that seem to be sparsely populated. I currently execute Spark on a 1-core VM with 8 GB of RAM, and the data has been split into 16 partitions.
I tried something like the following, but it takes ages:
ldf <- dapplyCollect(
df,
function(df.partition) {
apply(df.partition, 2, function(col) {sum(is.na(col))})
})

Here's one way to do it, using sparklyr and dplyr. For the sake of a reproducible example, I am using flights data from nycflights13 package (336776 obs. of 19 variables)
library(nycflights13)
library(sparklyr)
library(dplyr)
sc <- sparklyr::spark_connect(master = "local", version = "2.1.0", hadoop_version = "2.7")
flights_spark <- sparklyr::copy_to(sc, flights)
src_tbls(sc)
flights_spark %>%
dplyr::mutate_all(is.na) %>%
dplyr::mutate_all(as.numeric) %>%
dplyr::summarise_all(sum) %>%
dplyr::collect()
And you get the results
> collect(flights_spark_isna_count)
# A tibble: 1 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 8255 0 8255 8713 0 9430 0 0 2512 0 0 9430
# ... with 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
In my old laptop, it took around 30 seconds all this code (i.e. including starting Spark session, reading the data into Spark and then counting the NAs; this last step took less than 10 seconds I think).
Of course your dataset is larger, but perhaps it works. (I tried it also in a larger dataset data I am working on, so about 2 million obs. and 146 variables and it takes only a couple of minutes).

Related

dplyr - sort one table by a COL from another table

I have a table with cars
And another table with rankings for each car
How do I:
Count (and %) of cars in CAR table
Sort by RANK table
Need output
next time please provide a reproducible example as mentioned in the comments.
Anyway, this might help:
library(tidyverse)
# example datasets
df1 <- tibble(car = c("Prius", "Tesla", "Prius", "BMW"),
name = NA_character_)
df2 <- tibble(car = c("BMW", "Prius", "Tesla"),
rank = 1:3)
# joining
left_join(df2, df1 %>%
count(car)) %>%
select(rank, car, count = n) %>%
mutate(pct = count / sum(count) * 100)
# A tibble: 3 × 4
rank car count pct
<int> <chr> <int> <dbl>
1 1 BMW 1 25
2 2 Prius 2 50
3 3 Tesla 1 25

How can we use filter to find the county(a column in counties dataset) with minimum population in resulted from summarize function in R?

counties_selected <- counties %>%
select(county, region, state, population, citizens)
counties_selected %>%
Summarize to find minimum population, maximum unemployment, and average income
summarize( min_population=min(population),max_unemployment=max(unemployment), average_income=mean(income))
A tibble: 1 x 3
min_population max_unemployment average_income
1 85 29.4 46832.

Reduce computing time for reshape

I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
Many thanks!
Benchmarks Summary:
Using Stack (as suggested by #AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from #AnandaMahto, below are benchmarks using a large (larger) sample data.
Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two
hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).
Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :
http://cran.r-project.org/doc/manuals/R-data.html
Quickly reading very large tables as dataframes
But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using #RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table really turns the tables on that last run!

Avoiding row-wise processing of data.frame in R

I was wondering what the best way is to avoid row-wise processing in R, most of row-wise stuff is done in internal C routines. For example: I have a data frame a:
chromosome_name start_position end_position strand
1 15 35574797 35575181 1
2 15 35590448 35591641 -1
3 15 35688422 35688645 1
4 13 75402690 75404217 1
5 15 35692892 35693969 1
What I want is: based on whether strand is positive or negative, startOFgene as start_position or end_position. One way to avoid for loop will be to separate data.frame with +1 strand and -1 strand and perform selection. What can be other way for speed up? The method does not scale-up if one has certain other complicated processing per row.
Maybe this is fast enough...
transform(a, startOFgene = ifelse(strand == 1, start_position, end_position))
chromosome_name start_position end_position strand startOFgene
1 15 35574797 35575181 1 35574797
2 15 35590448 35591641 -1 35591641
3 15 35688422 35688645 1 35688422
4 13 75402690 75404217 1 75402690
5 15 35692892 35693969 1 35692892
First, since all your columns are integer/numeric, you could use a matrix instead of a data.frame. Many operations on a matrix are a lot faster than the same operation on a data.frame, even though they're not very different in this case. Then you can use logical subsetting to create the startOFgene column.
# Create some large-ish data
M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE))
M <- do.call(rbind,replicate(1e3,M,simplify=FALSE))
A <- as.data.frame(M)
# Create startOFgene column in a matrix
m <- function() {
M <- cbind(M, startOFgene=M[,"start_position"])
negStrand <- sign(M[,"strand"]) < 0
M[negStrand,"startOFgene"] <- M[negStrand,"end_position"]
}
# Create startOFgene column in a data.frame
d <- function() {
A$startOFgene <- A$start_position
negStrand <- sign(A$strand) < 0
A$startOFgene[negStrand] <- A$end_position[negStrand]
}
library(rbenchmark)
benchmark(m(), d(), replications=10)[,1:6]
# test replications elapsed relative user.self sys.self
# 2 d() 10 18.804 1.000 16.501 2.224
# 1 m() 10 19.713 1.048 16.457 3.152

How to make computing/inserting a difference-of-dates column faster?

Can you make this R code faster? Can't see how to vectorize it.
I have a data-frame as follows (sample rows below):
> str(tt)
'data.frame': 1008142 obs. of 4 variables:
$ customer_id: int, visit_date : Date, format: "2010-04-04", ...
I want to compute the diff between visit_dates for a customer.
So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below.
The code at bottom does this, but takes >15 min on the 1M row dataset.
I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow.
Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.
customer_id visit_date visit_spend ivi
72 40 2011-03-15 18.38 5
73 40 2011-03-20 23.45 5
74 79 2010-04-07 150.87 NA
75 79 2010-04-17 101.90 10
76 79 2010-05-02 111.90 15
Code:
all_tt_cids <- unique(tt$customer_id)
# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
# ivi has a discontinuity when customer_id changes
tt$ivi[min(which(tt$customer_id==cid))] <- NA
}
(Wondering if we can create a logical index where customer_id differs to the row above?)
to set NA to appropriate places, you again can use diff() and one-line trick:
> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA
explanation
let's take some vector x
x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)
we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.
y <- c(1,diff(x))
# y = 1 0 0 0 1 0 0 2 0 0 1 -2 0 0
and take those indexes, that are not equal to zero:
x[y!=0] <- NA

Resources