how to delete a dataframe wirth BigR? - biginsights

I have a dataframe created as such (from the BigR tutorials):
airfile <- system.file("extdata", "airline.zip", package="bigr")
airfile <- unzip(airfile, exdir = tempdir())
airR <- read.csv(airfile, stringsAsFactors=F)
# Upload the data to the BigInsights server. This may take 15-20 seconds
air <- as.bigr.frame(airR)
air <- bigr.persist(air, dataSource="DEL",
dataPath="/user/bigr/examples/airline_demo.csv",
header=T, delimiter=",", useMapReduce=F)
Is it possible to delete the dataframe using the bigr library? If so, how?

Ah, I wasn't looking in the right place. I was looking for an operation on the dataset, but a file system operation is required:
bigr.rmfs("/user/bigr/examples/airline_demo.csv", force = F, recursive = T)

Related

Can facebook Prophet be applied to sparklyr via spark_apply

I'm trying to test if I can run prophet with sparklyr to make forecast for data in cluster. But when I use spark_apply the program is stuck.
Running sparklyr on an edgenode connected to a yarn-client with spark 2.2.0.
The data is sales by locations spanning last 4 years.
The plan is to create a dataframe with all the data and partition the data by locations then call prophet on each location and get prediction for the next 7 days.
Here I tried to pull data for one location and apply prophet but sparklyr was stuck.
library("sparklyr")
library("prophet")
sc <- spark_connect(master = "yarn-client",version = "2.2.0"))
query = "select * from saletable"
df <- sdf_sql(sc,query) %>%
filter(locationid=="1111") %>%
select(date,sales) %>%
sdf_repartition(partitions=1) %>%
select(ds=date,y=sales)
## try to predict sales the next 7 days and get the predictions
sparkly_prophet <- function(df){
m <- prophet::prophet(df)
future <- prophet::make_future_dataframe(m,periods=7,freq='day')
forecast <- predict(m,future)
return (dplyr::select(forecast,yhat) %>% tail(7))
}
Then I run but it gets stuck
spark_apply(df,sparkly_prophet)
When I've used spark_apply(), I've had better success including the function definition within the call to spark_apply(). I'm not sure why this is, but it may be worth a short to restructure your code as
spark_apply(
df,
function(df) {
m <- prophet::prophet(df)
future <- prophet::make_future_dataframe(m, periods = 7, freq = "day")
forecast <- predict(m, future)
yhat <- dplyr::select(forecast, yhat)
return(tail(yhat, 7))
}
)

Measuring accuracy in SparkR

I am building a simple Random forest model on iris data in spark, I was hoping for some method of accuracy measure.
I thought of a simple column matching option too, however this did not work
Code:
library("SparkR")
sc = sparkR.session("local[*]")
iris_data <- as.DataFrame(iris)
train <- sample(iris_data, withReplacement=FALSE, fraction=0.5, seed=42)
test <- except(iris_data, train)
model_rf <- spark.randomForest(train, Species ~., "classification", numTrees = 10)
summary(model_rf)
Problem:
predictions <- predict(model_rf, test)
total_rows <- NROW(test)
predictions$correct <- (test$Species == test$prediction)
accuracy <- correct/total_rows
print(accuracy)
Error:
Error in column(callJMethod(x#sdf, "col", c)) :
P.S:
Using data bricks to run spark, don't mind running locally either
So this is how I did it,
total_rows <- NROW(test)
predictions$result <- ifelse((predictions$Species == predictions$prediction),
"TRUE", "FALSE")
correct <- NROW(predictions[predictions$result == "TRUE",])
accuracy <- correct/total_rows
cat(accuracy, "%")

How i can send data from mnesia via websocket

I have mnesia DB with table artists:
(gw#gw)227> lookup:for_test().
{atomic,["Baltic Baroque","Anna Luca","Karel Boehlee Trio",
"Bill Evans","Dino Saluzzi and Anja Lechner",
"Bill Evans Trio with Stan Getz","Duke Pearson",
"The John Butler Trio"]}
(gw#gw)228>
I want send this list to client by websocket in YAWS, but how i can do it? I broke my mind ... and nothing working. Please help by any info.
Best regards!
You have to convert this list of string to binaries and the pass is across over web socket, by the way what kind of data handling are you doing at the receiving end I mean you want to send this as JSON or just Comma separated values?
Yeeeees ... i solve my problem!
My YAWS Handle
handle_message({text, <<"q2">>}) ->
Var555 = unicode:characters_to_binary(my_json4:handle()),
{reply, {text, <<Var555/binary>>}};
And my mnesia select + convert to JSON. my_json4.erl
Data = [{obj,
[{art_id, Line#tmp.artist_id},
{art, unicode:characters_to_binary(Line#tmp.artist)},
{alb_id, Line#tmp.album_id},
{alb, unicode:characters_to_binary(Line#tmp.album)},
{path, unicode:characters_to_binary(Line#tmp.albumpath)},
{image, unicode:characters_to_binary(Line#tmp.image)},
{tracks, [unicode:characters_to_binary(X) ||X <- Line#tmp.tracks]}]}
|| Line <- Lines],
JsonData = {obj, [{data, Data}]},
rfc4627:encode(JsonData).
handle() ->
QHandle = qlc:q( [ X ||
X <- mnesia:table(artists),
X#artists.artist_id == 2]
),
Records = do(qlc:q([{tmp, X#artists.artist_id, X#artists.artist, A#albums.album_id, A#albums.album, A#albums.albumpath, A#albums.image, A#albums.tracklist} ||
X <- QHandle,
A <- mnesia:table(albums),
A#albums.artist_id == X#artists.artist_id])
),
Json = convert_to_json(Records),
Json.
do(Q) ->
F = fun() ->
qlc:e(Q)
end,
{atomic, Value} = mnesia:transaction(F),
Value.

r: dprint: size of image of table alteration

I am using the dprint package with knitr , mainly so that I can highlight rows from a table, which I have got working, but the output image leaves a fairly large space for a footnote, and it is taking up unnecessary space.
Is there away to get rid of it?
Also since I am fairly new to dprint, if anybody has better ideas/suggestions as to how to highlight tables and make them look pretty without any footnotes... or ways to tidy up my code that would be great!
An example of the Rmd file code is below...
```{r fig.height=10, fig.width=10, dev='jpeg'}
library("dprint")
k <- data.frame(matrix(1:100, 10,10))
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"), frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki", fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki", fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold", fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
x <- dprint(~., data=k,footnote=NA, pg.dim=c(10,10), margins=c(0.2,0.2,0.2,0.2),
style=CBs, row.hl=row.hl(which(k[,1]==5), col='red'),
fit.width=TRUE, fit.height=TRUE,
showmargins=TRUE, newpage=TRUE, main="TABLE TITLE")
```
Thanks in advance!
I haven't used dprint before, but I see a couple of different things that might be causing problems:
The start of your code chunk has defined the image width and height, which dprint seems to be trying to use.
You are setting both fit.height and fit.width. I think only one of those is used (in other words, the resulting image isn't stretched to fit both height and width, but only the one that seems to make most sense, in this case, width).
After tinkering around for a minute, here's what I did that minimizes the footnote. However, I don't know if there is a more efficient way to do this.
```{r dev='jpeg'}
library("dprint")
k <- data.frame(matrix(1:100, 10,10))
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"),
frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki",
fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki",
fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold",
fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
x <- dprint(~., data=k, style=CBs, pg.dim = c(7, 4.5),
showmargins=TRUE, newpage=TRUE,
main="TABLE TITLE", fit.width=TRUE)
```
Update
Playing around to determine the sizes of the images is a total drag. But, if you run the code in R and look at the structure of x, you'll find the following:
str(x)
# List of 3
# $ cord1 : num [1:2] 0.2 6.8
# $ cord2 : Named num [1:2] 3.42 4.78
# ..- attr(*, "names")= chr [1:2] "" ""
# $ pagenum: num 2
Or, simply:
x$cord2
# 3.420247 4.782485
These are the dimensions of your resulting image, and this information can probably easily be plugged into a function to make your plots better.
Good luck!
So here's my solution...with some examples...
I've just copied and pasted my Rmd file to demonstrate how to use it.
you should be able to just copy and paste it into a blank Rmd file and then knit to HTML to see the results...
Ideally what I would have liked would have been to make it all one nice neat function rather than splitting it up into two (i.e. setup.table & print.table) but since chunk options can't be changed mid chunk as suggested by Yihui, it had to be split up into two functions...
`dprint` + `knitr` Examples to create table images
===========
```{r}
library(dprint)
# creating the sytle object to be used
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"),
frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki",
fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki",
fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold",
fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
# creating a setup function to setup printing a table (will probably put this function into my .Rprofile file)
setup.table <- function(df,width=10, style.obj='CBs'){
require(dprint)
table.style <- get(style.obj)
a <- tbl.struct(~., df)
b <- char.dim(a, style=table.style)
p <- pagelayout(dtype = "rgraphics", pg.dim = NULL, margins = NULL)
f <- size.simp(a[[1]], char.dim.obj=b, loc.y=0, pagelayout=p)
# now to work out the natural table width to height ratio (w.2.h.r) GIVEN the style
w.2.h.r <- as.numeric(f$tbl.width/(f$tbl.height +b$linespace.col+ b$linespace.main))
height <- width/w.2.h.r
table.width <- width
table.height <- height
# Setting chunk options to have right fig dimensions for the next chunk
opts_chunk$set('fig.width'=as.numeric(width+0.1))
opts_chunk$set('fig.height'=as.numeric(height+0.1))
# assigning relevant variables to be used when printing
assign("table.width",table.width, envir=.GlobalEnv)
assign("table.height",table.height, envir=.GlobalEnv)
assign("table.style", table.style, envir=.GlobalEnv)
}
# function to print the table (will probably put this function into my .Rprofile file as well)
print.table <- function(df, row.2.hl='2012-04-30', colour='lightblue',...) {
x <-dprint(~., data=df, style=table.style, pg.dim=c(table.width,table.height), ..., newpage=TRUE,fit.width=TRUE, row.hl=row.hl(which(df[,1]==row.2.hl), col=colour))
}
```
```{r}
# Giving it a go!
# Setting up two differnt size tables
small.df <- data.frame(matrix(1:100, 10,10))
big.df <- data.frame(matrix(1:800,40,20))
```
```{r}
# Using the created setup.table function
setup.table(df=small.df, width=10, style.obj='CBs')
```
```{r}
# Using the print.table function
print.table(small.df,4,'lightblue',main='table title string') # highlighting row 4
```
```{r}
setup.table(big.df,13,'CBs') # now setting up a large table
```
```{r}
print.table(big.df,38,'orange', main='the big table!') # highlighting row 38 in orange
```
```{r}
d <- style() # the default style this time will be used
setup.table(big.df,15,'d')
```
```{r}
print.table(big.df, 23, 'indianred1') # this time higlihting row 23
```

idata.frame: Why error "is.data.frame(df) is not TRUE"?

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
With idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.
Specifically the lapply(.SD,FUN) and .SDcols arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Resources