Measuring accuracy in SparkR - sparkr

I am building a simple Random forest model on iris data in spark, I was hoping for some method of accuracy measure.
I thought of a simple column matching option too, however this did not work
Code:
library("SparkR")
sc = sparkR.session("local[*]")
iris_data <- as.DataFrame(iris)
train <- sample(iris_data, withReplacement=FALSE, fraction=0.5, seed=42)
test <- except(iris_data, train)
model_rf <- spark.randomForest(train, Species ~., "classification", numTrees = 10)
summary(model_rf)
Problem:
predictions <- predict(model_rf, test)
total_rows <- NROW(test)
predictions$correct <- (test$Species == test$prediction)
accuracy <- correct/total_rows
print(accuracy)
Error:
Error in column(callJMethod(x#sdf, "col", c)) :
P.S:
Using data bricks to run spark, don't mind running locally either

So this is how I did it,
total_rows <- NROW(test)
predictions$result <- ifelse((predictions$Species == predictions$prediction),
"TRUE", "FALSE")
correct <- NROW(predictions[predictions$result == "TRUE",])
accuracy <- correct/total_rows
cat(accuracy, "%")

Related

Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’

I have a strange Error and actually don't know how to solve it, even after checking other posts. Everything runs until the Kriging and then I receive the error: Error in (function (classes, fdef, mtable) unable to find an inherited method for function ‘krige’ for signature ‘"formula", "tbl_df"’
The strange thing is that everything worked a few days ago, I did not change anything in the code and now it doesn't run anymore. Some other posts related the problem with the Raster, but I could not find any discrepances. Is there something because of recent updates? I use for example the sp package.
Unfortunately I cannot provide the data I use, hopefully it can be solved without.
How can I solve the issue? Thank you in advance for the help.
homeDir = "D:/Folder/DataXYyear/"
y = 1992
Source = paste("Year", y, ".csv")
File = file.path(homeDir,Source)
GWMeas <- read_csv(File)
GWMeasX <- na.omit(GWMeas)
ggplot(
data = GWMeasX,
mapping = aes(x = X, y = Y, color = level)
) +
geom_point(size = 3) +
scale_color_viridis(option = "B") +
theme_classic()
GWMX_sf <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25832) %>%
cbind(st_coordinates(.))
v_emp_OK <- gstat::variogram(
level~1,
as(GWMX_sf, "Spatial") # switch from {sf} to {sp}
)
v_mod_OK <- automap::autofitVariogram(level~1, as(GWMX_sf, "Spatial"), model = "Sph")$var_model
GWMeasX %>% as.data.frame %>% glimpse
GW.vgm <- variogram(level~1, locations = ~X+Y, data = GWMeasX) # calculates sample variogram values
GW.fit <- fit.variogram(GW.vgm, model=vgm(model = "Gau")) # fit model
sf_GWlevel <- st_as_sf(GWMeasX, coords = c("X", "Y"), crs = 25833)
grd_sf <- sf_GWlevel %>%
st_bbox() %>%
st_as_sfc() %>%
st_make_grid(
cellsize = c(5000, 5000), # 5000m pixel size
what = "centers"
) %>%
st_as_sf() %>%
cbind(., st_coordinates(.))
grid <- as(grd_sf, "Spatial")
gridded(grid) <- TRUE
grid <- as(grid, "SpatialPixels")
createGrid <- function(XY.Spacing)
crs(grid) <- crs(GWMX_sf)
OK3 <- krige(formula = level~1, # variable to interpolate
data = GWMX_sf, # gauge data
newdata = grid, # grid to interpolate on
model = v_mod_OK, # variogram model to use
nmin = 4, # minimum number of points to use for the interpolation
nmax = 20, # maximum number of points to use for the interpolation
maxdist = 120e3 # maximum distance of points to use for the interpolation
)

How to distribute and fit h2o model by group in sparklyr

I would like to fit a model by group in h2o using some type of distributed apply function.
I tried the following but it doesn't work. Probably due to the fact I cannot pipe the sc object through.
df%>%
spark_apply(function(e)
h2o.coxph(x = predictors,
event_column = "event",
stop_column = "time_to_next",
training_frame = as_h2o_frame(sc, e, strict_version_check = FALSE))
group_by = "id"
)
I receive a pretty generic spark error like this:
error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 :
I'm not sure if you can return an entire H2OCoxPH model from sparklyr::spark_apply(): Errors are no method for coercing this S4 class to a vector if you set the fetch_result_as_sdf argument to FALSE and cannot coerce class ‘structure("H2OCoxPHModel", package = "h2o")’ to a data.frame if set to TRUE.
But if you can make your own vector or dataframe from the relevant parts of the model, I think you can do it.
Here I'll use a sample Cox Proportional Hazards file from H2O Docs Cox Proportional Hazards (CoxPH) and I'll use group_by = "surgery".
heart_hf <- h2o::h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
##### Convert to Spark DataFrame since I assume that is the use case
heart_sf <- sparklyr::copy_to(sc, heart_hf %>% as.data.frame())
##### Use sparklyr::spark_apply() on Spark DataFrame to "distribute and fit h2o model by group"
sparklyr::spark_apply(
x = heart_sf,
f = function(x) {
h2o::h2o.init()
heart_coxph <- h2o::h2o.coxph(x = c("age", "year"),
event_column = "event",
start_column = "start",
stop_column = "stop",
ties = "breslow",
training_frame = h2o::as.h2o(x, strict_version_check = FALSE))
return(data.frame(conc = heart_coxph#model$model_summary$concordance))
},
columns = list(surgery = "integer", conc = "numeric"),
group_by = c("surgery"))
# Source: spark<?> [?? x 2]
surgery conc
<int> <dbl>
1 1 0.588
2 0 0.614

Can facebook Prophet be applied to sparklyr via spark_apply

I'm trying to test if I can run prophet with sparklyr to make forecast for data in cluster. But when I use spark_apply the program is stuck.
Running sparklyr on an edgenode connected to a yarn-client with spark 2.2.0.
The data is sales by locations spanning last 4 years.
The plan is to create a dataframe with all the data and partition the data by locations then call prophet on each location and get prediction for the next 7 days.
Here I tried to pull data for one location and apply prophet but sparklyr was stuck.
library("sparklyr")
library("prophet")
sc <- spark_connect(master = "yarn-client",version = "2.2.0"))
query = "select * from saletable"
df <- sdf_sql(sc,query) %>%
filter(locationid=="1111") %>%
select(date,sales) %>%
sdf_repartition(partitions=1) %>%
select(ds=date,y=sales)
## try to predict sales the next 7 days and get the predictions
sparkly_prophet <- function(df){
m <- prophet::prophet(df)
future <- prophet::make_future_dataframe(m,periods=7,freq='day')
forecast <- predict(m,future)
return (dplyr::select(forecast,yhat) %>% tail(7))
}
Then I run but it gets stuck
spark_apply(df,sparkly_prophet)
When I've used spark_apply(), I've had better success including the function definition within the call to spark_apply(). I'm not sure why this is, but it may be worth a short to restructure your code as
spark_apply(
df,
function(df) {
m <- prophet::prophet(df)
future <- prophet::make_future_dataframe(m, periods = 7, freq = "day")
forecast <- predict(m, future)
yhat <- dplyr::select(forecast, yhat)
return(tail(yhat, 7))
}
)

how to delete a dataframe wirth BigR?

I have a dataframe created as such (from the BigR tutorials):
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
# Upload the data to the BigInsights server. This may take 15-20 seconds
air <- as.bigr.frame(airR)
air <- bigr.persist(air, dataSource="DEL",
dataPath="/user/bigr/examples/airline_demo.csv",
header=T, delimiter=",", useMapReduce=F)
Is it possible to delete the dataframe using the bigr library? If so, how?
Ah, I wasn't looking in the right place. I was looking for an operation on the dataset, but a file system operation is required:
bigr.rmfs("/user/bigr/examples/airline_demo.csv", force = F, recursive = T)

idata.frame: Why error "is.data.frame(df) is not TRUE"?

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
With idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.
Specifically the lapply(.SD,FUN) and .SDcols arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Resources