How to do a fct_drop within a function (using tidy eval)? - tidyeval

Using the diamonds dataset...
Trying to create a function that will allow me to plot either cut or color on the x-axis...
...but first I want to filter the selected column to show only a certain number of levels.
I've got the filter working but the levels are still present... and they will show up in the chart. I need to do a fct_drop() on the selected column
Please see the code below for a reproducible example:
library(tidyverse)
diamonds <- diamonds %>%
mutate(cut = factor(cut),
color = factor(color))
reduce_for_plot <- function(data, column, how_many_levels) {
column2 <- enquo(column)
of_interest <- unique(data[[deparse(substitute(column))]])[1:how_many_levels]
data %>%
filter(!! column2 %in% of_interest)
# here is where I then do some kind of mutate... to fct_drop the selected column
# this line seems to work
# value_to_put_in <- fct_drop(data[[deparse(substitute(column))]])
# but this line doesn't
# data <- data %>%
# mutate(!! column = value_to_put_in)
}
diamonds %>%
reduce_for_plot(color, 1)

You were almost there! The problem in your code is that R doesn't allow ! on the LHS of =. So you need to use the fake operator := instead.
reduce_for_plot <- function(data, column, how_many_levels) {
col_expr <- enquo(column)
col_name <- rlang::as_name(col_expr)
of_interest <- unique(data[[col_name]])[1:how_many_levels]
data <- data %>%
filter(!!col_expr %in% of_interest)
value_to_put_in <- fct_drop(data[[col_name]][of_interest])
data %>%
mutate(!!col_name := value_to_put_in)
}
As you can see I have replaced all deparse(substitute(column)) by as_name(enquo(column)). However you can avoid these entirely by doing the computations in data context, which I think yields nicer code:
reduce_for_plot <- function(data, column, how_many_levels) {
column <- enquo(column)
data %>%
filter(!!column %in% unique(!!column)[1:how_many_levels]) %>%
mutate(!!column := fct_drop(!!column))
}

Related

Sort Range By Column Automatically

What I want to do: sort a range of data by the active column via shortcut key(s). I do this regularly and want to automate. However, the size and location of the range changes each time I do this.
My approach
Select a cell in the range of interest that's in the first row and specific column I want to sort by, call it C4
Select down to last row of the range
Select left and/or right to all columns of the range
Get column number / address of C4 (which should be 3, since C = column 3)
Sort range by column of C4
I've been able to accomplish 1. to 3. I'm struggling with 4. Below is what I've got. I suspect that if I can get 4. sorted, then then last line will accomplish 5.
function sortCol() {
// KGF - 22,49
// Purpose: automatically select a range of data and sort by column of active cell
// Status - wip; haven't determined how to get col # of active cell
// Last updated: 22-12-4
var spreadsheet = SpreadsheetApp.getActive();
var currentCell = spreadsheet.getCurrentCell();
let cc1 = currentCell;
spreadsheet.getSelection().getNextDataRange(SpreadsheetApp.Direction.DOWN).activate();
spreadsheet.getActiveRange().getDataRegion(SpreadsheetApp.Dimension.COLUMNS).activate();
// how to get col # of active cell?? 22,49
let cc1_col = cc1.getColumn
console.log(cc1_col)
Logger.log(cc1_col)
//
cc1.activateAsCurrentCell();
spreadsheet.getActiveRange().sort({column: cc1_col, ascending: true});
};

filter on named vector in dplyr (R)

I'm trying to find a quick way to convert a binary numerical variable into a factor using dplyr.
I have a dataset with this structure:
library(dplyr)
f<-as_tibble(data.frame(col1=c(1,1,0),col2=c("ham","spam","spam"),col3=c(1,2,8),col4=c(1,0,0)))
For now, I have tried using n_distinct
g<-f %>% select_if(is.numeric) %>% sapply(n_distinct)
But I don't know how to proceed by filtering out only those columns with n_distinct == 2. To be clear, my final output should be:
names(g[g==2])
[1] "col1" "col4"
Any idea?
Thank you
How about using select_if and define a function that check if the column is numeric as well as if the number of distint values is exactly 2. Try:
f %>%
select_if(~n_distinct(.) == 2 & is.numeric(.)) %>%
names()
Which gives you:
[1] "col1" "col4"

list of values for selectInput

I want to create a "selectInput" widget for which the choices of values are the names of columns in a dataset imported by a "fileInput" widget.
I tried to "tableOutput" the names of the dataset, as an argument to the "choices" parameters of the "selectInput" widget, but it doesn't work. The only choices I get in the widget are "name", "id" and "class".
Here's the code I used:
library(shiny)
ui <- fluidPage(
# Widget for loading data set
fileInput("file", label = h4("Input csv data set")),
# Widget for selecting the variable among names of columns of the data set
selectInput("select.variable", label = h4("Select variable from data set"),
choices = tableOutput("list.var"), selected = 1) # This approach doesn't work
)
server <- function(input, output) {
# The goal was to get the list of names of columns to use it as "choices"
# in the "selectInput" widget
output$list.var <- renderTable({
inFile <- input$file
if (is.null(inFile)) # To avoid error messages when the file is not yet loaded
return(NULL)
# After the file is loaded
data.fr <- read.csv(inFile$datapath)
list.var <- names(data.fr[1,]) # Get the names of the columns of the dataset
})
}
shinyApp(ui = ui, server = server)
Is there a way to use the names of columns of an imported dataset as choices for a "selectInput" widget?
Something like this should do the trick. I used renderUI to create the slider widget from your dataset
library(shiny)
ui <- fluidPage(
# Widget for loading data set
fileInput("file", label = h4("Input csv data set")),
uiOutput("myslider")
)
server <- function(input, output) {
# The goal was to get the list of names of columns to use it as "choices"
# in the "selectInput" widget
output$myslider <- renderUI({
# Widget for selecting the variable among names of columns of the data set
selectInput("select.variable", label = h4("Select variable from data set"),
choices = names(mydata()), selected = 1) # This approach doesn't work
})
mydata <- reactive({
inFile <- input$file
if (is.null(inFile)) # To avoid error messages when the file is not yet loaded
return(NULL)
# After the file is loaded
data.fr <- read.csv(inFile$datapath)
names(data.fr[1,]) # Get the names of the columns of the dataset
})
output$list.var <- renderTable({
mydata()
})
}
shinyApp(ui = ui, server = server)

How can I improve the performance when completing a table with statistical methods in Apache-Spark?

I have a dataset with 10 field and 5000 rows. I want to complete this dataset with some statistical methods in Spark with Scala. I filled the empty cells in a field with the mean value of that field, if it consists of continuous values and I put most frequent value in the field, if it consists of discrete values. Here is my code:
for(col <- cols){
val datacount = table.select(col).rdd.map(r => r(0)).filter(_ == null).count()
if(datacount > 0)
{
if (continuous_lst contains col) // put mean of data to null values
{
var avg = table.select(mean(col)).first()(0).asInstanceOf[Double]
df = df.na.fill(avg, Seq(col))
}
else if(discrete_lst contains col) // put most frequent categorical value to null values
{
val group_df = df.groupBy(col).count()
val sorted = group_df.orderBy(desc("count")).take(1)
val most_frequent = sorted.map(t => t(0))
val most_frequent_ = most_frequent(0).toString.toDouble.toInt
val type__ = ctype.filter(t => t._1 == col)
val type_ = type__.map(t => t._2)
df = df.na.fill(most_frequent_, Seq(col))
}
}
}
The problem is that this code works very slowly with this data. I use spark-submit with executor memory 8G parameter. And I use repartition(4) parameter before sending the data to this function.
I should work bigger sized datasets. So how can I speed up this code?
Thanks for your help.
Here is a suggestion:
import org.apache.spark.sql.funcitons._
def most_frequent(df: DataFrame, col: Column) = {
df.select(col).map { case Row(colVal) => (colVal, 1) }
.reduceByKey(_ + _)
.reduce({case ((val1, cnt1), (val2, cnt2)) => if (cnt1 > cnt2) (val1, cnt1) else (val2, cnt2)})._1
}
val new_continuous_cols = continuous_lst.map {
col => coalesce(col, mean(col)).as(col.toString)
}.toArray
val new_discrete_cols = discrete_lst.map {
col => coalesce(col, lit(most_frequent(table, col)).as(col.toString))
}.toArray
val all_new_cols = new_continuous_cols ++ new_discrete_cols
val newDF = table.select(all_new_cols: _*)
Considerations:
I assumed that continuous_lst and discrete_lstare lists of Column. If they are lists of String the idea is the same, but some adjustments are necessary;
Note that I used map and reduce to calculate the most frequent value of a column. That can be better than grouping by and aggregating in some cases. (Maybe there is room for improvement here, by calculating the most frequent values for all discrete columns at once?);
Additionally, I used coalesce to replace null values, instead of fill. This may result in some improvement as well. (More info about the coalesce function in the scaladoc API);
I cannot test at the moment, so there may be something missing that I didn't see.

How to make the filter function on dates in sparkR

'u' is a DataFrame containing ID = 1, 2, 3 .. and time= "2010-01-01", "2012-04-06", ..
ID and time have type string. I convert type of time' to 'Date'
u$time <- cast(u[[2]], "Date")
I now want the first time in u.
first <- first(u$time)
I now make a new time by adding 150 days to the first time
cluster<- first+150
I now want to make a subset. I want to have a new 'u' where the times are from the first 150 days.
ucluster <- filter(u, u$time < cluster)
but this can't run in sparkR. I get this message "returnstatus==0 is not TRUE".
The problem with your approach is that ucluster is a column of one item, rather than a date. If you take the first row and store its time in first, everything is working fine:
df <- data.frame(ID=c(1,2,3,4),time=c("2010-01-01", "2012-04-06", "2010-04-12", "2012-04-09"))
u <- createDataFrame(sqlContext,df)
u$time <- cast(u[[2]], "Date")
first <- take(u,1)$time
cluster <- first + 150
ucluster <- filter(u, u$time < cluster)
collect(ucluster)

Resources