GA API Filter Failure: Invalid value 'ga:userGender == female' for filters parameter." - filter

I tried to add a filter to extract data only from female users, but encounter the following error:
Invalid value 'ga:userGender == female' for filters parameter.
This is my code:
query.list <- Init(start.date = "2015-10-09", end.date = "2016-10-09", dimensions = "ga:userAgeBracket", metrics = "ga:users,ga:pageviews,ga:sessions", filter = "ga:userGender == female", max.results = 10000, table.id = "ga:xxxxxxxx")
ga.query <- QueryBuilder(query.list) cga.data <- GetReportData(ga.query, oauth_token)
I do not understand what I did wrong -- I thought my code was correct.

Related

applyInPandas() aggregation runs slowly on big delta table

I'm trying to create a gold table notebook in Databricks, however it would take 9 days to fully reprocess the historical data (43GB, 35k parquet files). I tried scaling up the cluster but it doesn't go above 5000 records/second. The bottleneck seems to be the applyInPandas() function. I'm wondering if I could replace pandas with anything else to make the gold notebook execute faster.
Silver table has 60 columns (read_id, reader_id, tracker_timestamp, event_type, ebook_id, page_id, agent_ip, agent_device_type, ...). Each row of data represents read event of an ebook. E.g 'page turn', 'click on image', 'click on link',... All of the events that have occurred in the single session have the same read.id. In the gold table I'm trying to group those events in sessions and calculate the number of times each event has occurred in the single session. So instead of 100+ rows of data for a read session in silver table I would end up just with a single aggregated row in gold table.
Input is the silver delta table:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pandas as pd
from pyspark.sql.functions import pandas_udf
input = (spark
.readStream
.format("delta")
.option("withEventTimeOrder", "true")
.option("maxFilesPerTrigger", 100)
.load(f"path_to_silver_bucket")
)
I use withWatermark and session_window functions to ensure I end up grouping all of the events from the single read session. (read session automatically ends 30 minutes after the last reader activity)
group = input.withWatermark("tracker_timestamp", "10 minutes").groupBy("read_id", F.session_window(input.tracker_timestamp, "30 minutes"))
In the next step I use the applyInPandas function like so:
sessions = group.applyInPandas(processing_function, schema=processing_function_output_schema)
Definition of the processing_function used in applyInPandas:
def processing_function(df):
surf_time_ms = df.query('event_type == "surf"')['duration'].sum()
immerse_time_ms = df.query('event_type == "immersion"')['duration'].sum()
min_timestamp = df['tracker_timestamp'].min()
max_timestamp = df['tracker_timestamp'].max()
shares = len(df.query('event_type == "share"'))
leads = len(df.query('event_type == "lead_store"'))
is_read = len(df.query('event_type == "surf"')) > 0
distinct_pages = df['page_id'].nunique()
data = {
"read_id": df['read_id'].values[0],
"surf_time_ms": surf_time_ms,
"immerse_time_ms": immerse_time_ms,
"min_timestamp": min_timestamp,
"max_timestamp": max_timestamp,
"shares": shares,
"leads": leads,
"is_read": is_read,
"number_of_events": len(df),
"distinct_pages": distinct_pages
}
for field in not_calculated_string_fields:
data[field] = df[field].values[0]
new_df = pd.DataFrame(data=data, index=['read_id'])
for x in all_events:
new_df[f"count_{x}"] = df.query(f"type == '{x}'").count()
for x in duration_events:
duration = df.query(f"event_type == '{x}'")['duration']
duration_sum = duration.sum()
new_df[f"duration_{x}_ms"] = duration_sum
if duration_sum > 0:
new_df[f"mean_duration_{x}_ms"] = duration.mean()
else:
new_df[f"mean_duration_{x}_ms"] = 0
return new_df
And finally, I'm writing the calculated row to the gold table like so:
for_partitioning = (sessions
.withColumn("tenant", F.col("story_tenant"))
.withColumn("year", F.year(F.col("min_timestamp")))
.withColumn("month", F.month(F.col("min_timestamp"))))
checkpoint_path = "checkpoint-path"
gold_path = f"gold-bucket"
(for_partitioning
.writeStream
.format('delta')
.partitionBy('year', 'month', 'tenant')
.option("mergeSchema", "true")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.start(gold_path))
Can anybody think of a more efficient way to do a UDF in PySpark than applyInPandas for the above example? I simply cannot afford to wait 9 days to reprocess 43GB of data...
I've tried playing around with different input and output options (e.g. .option("maxFilesPerTrigger", 100)) but the real problem seems to be applyInPandas.
You could rewrite your processing_function into native Spark if you really wanted.
"read_id": df['read_id'].values[0]
F.first('read_id').alias('read_id')
"surf_time_ms": df.query('event_type == "surf"')['duration'].sum()
F.sum(F.when(F.col('event_type') == 'surf', F.col('duration'))).alias('surf_time_ms')
"immerse_time_ms": df.query('event_type == "immersion"')['duration'].sum()
F.sum(F.when(F.col('event_type') == 'immersion', F.col('duration'))).alias('immerse_time_ms')
"min_timestamp": df['tracker_timestamp'].min()
F.min('tracker_timestamp').alias('min_timestamp')
"max_timestamp": df['tracker_timestamp'].max()
F.max('tracker_timestamp').alias('max_timestamp')
"shares": len(df.query('event_type == "share"'))
F.count(F.when(F.col('event_type') == 'share', F.lit(1))).alias('shares')
"leads": len(df.query('event_type == "lead_store"'))
F.count(F.when(F.col('event_type') == 'lead_store', F.lit(1))).alias('leads')
"is_read": len(df.query('event_type == "surf"')) > 0
(F.count(F.when(F.col('event_type') == 'surf', F.lit(1))) > 0).alias('is_read')
"number_of_events": len(df)
F.count(F.lit(1)).alias('number_of_events')
"distinct_pages": df['page_id'].nunique()
F.countDistinct('page_id').alias('distinct_pages')
for field in not_calculated_string_fields:
data[field] = df[field].values[0]
*[F.first(field).alias(field) for field in not_calculated_string_fields]
for x in all_events:
new_df[f"count_{x}"] = df.query(f"type == '{x}'").count()
The above can probably be skipped? As far as my tests go, new columns get NaN values, because .count() returns a Series object instead of one simple value.
for x in duration_events:
duration = df.query(f"event_type == '{x}'")['duration']
duration_sum = duration.sum()
new_df[f"duration_{x}_ms"] = duration_sum
if duration_sum > 0:
new_df[f"mean_duration_{x}_ms"] = duration.mean()
else:
new_df[f"mean_duration_{x}_ms"] = 0
*[F.sum(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"duration_{x}_ms") for x in duration_events]
*[F.mean(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"mean_duration_{x}_ms") for x in duration_events]
So, instead of
def processing_function(df):
...
...
sessions = group.applyInPandas(processing_function, schema=processing_function_output_schema)
you could use efficient native Spark:
sessions = group.agg(
F.first('read_id').alias('read_id'),
F.sum(F.when(F.col('event_type') == 'surf', F.col('duration'))).alias('surf_time_ms'),
F.sum(F.when(F.col('event_type') == 'immersion', F.col('duration'))).alias('immerse_time_ms'),
F.min('tracker_timestamp').alias('min_timestamp'),
F.max('tracker_timestamp').alias('max_timestamp'),
F.count(F.when(F.col('event_type') == 'share', F.lit(1))).alias('shares'),
F.count(F.when(F.col('event_type') == 'lead_store', F.lit(1))).alias('leads'),
(F.count(F.when(F.col('event_type') == 'surf', F.lit(1))) > 0).alias('is_read'),
F.count(F.lit(1)).alias('number_of_events'),
F.countDistinct('page_id').alias('distinct_pages'),
*[F.first(field).alias(field) for field in not_calculated_string_fields],
# skipped count_{x}
*[F.sum(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"duration_{x}_ms") for x in duration_events],
*[F.mean(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"mean_duration_{x}_ms") for x in duration_events],
)

Fine-tune a pre-trained model

I am new to transformer based models. I am trying to fine-tune the following model (https://huggingface.co/Chramer/remote-sensing-distilbert-cased) on my dataset. The code:
enter image description here
and I got the following error:
enter image description here
I will be thankful if anyone could help.
The preprocessing steps I followed:
input_ids_t = []
attention_masks_t = []
for sent in df_train['text_a']:
encoded_dict = tokenizer.encode_plus(
sent,
add_special_tokens = True,
max_length = 128,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors = 'tf',
)
input_ids_t.append(encoded_dict['input_ids'])
attention_masks_t.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids_t = tf.concat(input_ids_t, axis=0)
attention_masks_t = tf.concat(attention_masks_t, axis=0)
labels_t = np.asarray(df_train['label'])
and i did the same for testing data. Then:
train_data = tf.data.Dataset.from_tensor_slices((input_ids_t,attention_masks_t,labels_t))
and the same for testing data
It sounds like you are feeding the transformer_model 1 input instead of 3. Try removing the square brackets around transformer_model([input_ids, input_mask, segment_ids])[0] so that it reads transformer_model(input_ids, input_mask, segment_ids)[0]. That way, the function will have 3 arguments and not just 1.

Edit a datatable in shiny throwing an error : Can't subset with `[` using an object of class NULL

I am trying to edit a data table (DT: datatable) on server side and i am using a reactive data table which is rendered on the front end. Now i want to edit the data table and retrieve the information from edited data table. Here is a reproducible example of what i am trying to achieve as mentioned in the comments in the reprex :
library(shiny)
library(DT)
library(tidyverse)
d <- iris
ui <- fluidPage(
dataTableOutput("table1"),
dataTableOutput("table2")
)
server <- function(input,output,session){
# This is the main table I would want to display in its full context
output$table1 <- renderDataTable(
datatable(d)
)
get_row <-reactive({
d %>% slice(input$table1_rows_selected)})
# Here the table with row selected from table 1 is displayed
output$table2 <- renderDataTable({
datatable(get_row(),
editable = TRUE)
})
# Now as the cell gets edited in table 2, i want the edited value to show and make the last column values = NA
proxy <- dataTableProxy("table2")
observeEvent(eventExpr = input$table2_cell_edit, handlerExpr = {
x <- isolate(get_row())
info = input$table2_cell_edit
i = info$row
j = info$column
v = info$value
x[i, j] <<- DT::coerceValue(v, x[i, j])
x <- x %>% mutate(Species = NA)
replaceData(proxy, x, resetPaging = FALSE)
})
}
shinyApp(ui = ui, server = server)
I am getting an error Error in <<-: object 'x' not found. Not sure where am i wrong.
The solution came down to your input$table2_cell_edit names. It uses col not column. The error created an empty x that couldn't accept your assignment of the new value. I used a simpler assignment of the new value.
In the future, use the print() function to print out variables in your app to figure out what's being passed or not passed downstream. That's how a figured out this error.
library(shiny)
library(DT)
library(tidyverse)
d <- iris
ui <- fluidPage(
dataTableOutput("table1"),
dataTableOutput("table2")
)
server <- function(input,output,session){
# This is the main table I would want to display in its full context
output$table1 <- renderDataTable(
datatable(d)
)
get_row <-reactive({
req(input$table1_rows_selected)
d %>% slice(input$table1_rows_selected)})
# Here the table with row selected from table 1 is displayed
output$table2 <- renderDataTable({
req(get_row)
datatable(get_row(),
editable = TRUE)
})
# Now as the cell gets edited in table 2, i want the edited value to show and make the last column values = NA
proxy <- dataTableProxy("table2")
observeEvent(eventExpr = input$table2_cell_edit, handlerExpr = {
x <- isolate(get_row())
info = input$table2_cell_edit
i = info$row
### info uses 'col' not 'column'
j = info$col
v = info$value
### used a base R subset assignment
x[i, j] <- v
x <- x %>% mutate(Species = NA)
replaceData(proxy, x, resetPaging = FALSE)
})
}
shinyApp(ui = ui, server = server)

How to connect leaflet map clicks (events) with plot creation in a shiny app

Hello I am creating an environmental shiny app in which I want to use a leaflet map to create some simple plots based on openair package(https://rpubs.com/NateByers/Openair).
Aq_measurements() general form
AQ<- (aq_measurements(country = “country”, city = “city”, location = “location”, parameter = “pollutant choice”, date_from = “YYYdateY-MM-DD”, date_to = “YYYY-MM-DD”).
All parameters available in locations dataframe.
worldmet() general form
met <- importNOAA(code = "12345-12345", year = YYYYY:YYYY)
NOAA Code available in locations dataframe
Below I create a sample of my initial data frame:
location = c("100 ail","16th and Whitmore","40AB01 - ANTWERPEN")
lastUpdated = c("2018-02-01 09:30:00", "2018-02-01 03:00:00", "2017-03-07 10:00:00")
firstUpdated = c("2015-09-01 00:00:00","2016-03-06 19:00:00","2016-11-22 15:00:00")
pm25=c("FALSE","FALSE","FALSE")
pm10=c("TRUE","FALSE","FALSE")
no2=c("TRUE","FALSE","FALSE")
latitude=c(47.932907,41.322470,36.809700)
longitude=c(106.92139000,-95.93799000
,-107.65170000)
df = data.frame(location, lastUpdated, firstUpdated,latitude,longitude,pm25,pm10,no2)
As a general idea I want to be able to click on a certain location in the map based on this dataframe. Then I have one selectInput() and 2 dateInput(). The 2 dateInput() should take as inputs the df$firstUpdated and df$lastUpdated respectively. Then the selectInput() should take as inputs the pollutants that exist in the df based on "TRUE"/"FALSE" value. And then the plots should be created. All of these should be triggered by clicking on the map.
Up to now I was not able to achieve this so in order to help you understand I connected the selectInput() and the dateInput() with input$loc which is a selectIpnut() with locations in the first tab as I will not need this when I find the solution.
library(shiny)
library(leaflet)
library(plotly)
library(shinythemes)
library(htmltools)
library(DT)
library(utilr)
library(openair)
library(plotly)
library(dplyr)
library(ggplot2)
library(gissr)
library(ropenaq)
library(worldmet)
# Define UI for application that draws a histogram
ui = navbarPage("ROPENAQ",
tabPanel("CREATE DATAFRAME",
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
wellPanel(
uiOutput("loc"),
helpText("Choose a Location to create the dataframe.")
)
),
mainPanel(
)
)
),
tabPanel("LEAFLET MAP",
leafletOutput("map"),
wellPanel(
uiOutput("dt"),
uiOutput("dt2"),
helpText("Choose a start and end date for the dataframe creation. Select up to 2 dates")
),
"Select your Pollutant",
uiOutput("pollutant"),
helpText("While all pollutants are listed here, not all pollutants are measured at all locations and all times.
Results may not be available; this will be corrected in further revisions of the app. Please refer to the measurement availability
in the 'popup' on the map."),
hr(),
fluidRow(column(8, plotOutput("tim")),
column(4,plotOutput("polv"))),
hr(),
fluidRow(column(4, plotOutput("win")),
column(8,plotOutput("cal"))),
hr(),
fluidRow(column(12, plotOutput("ser"))
)
)
)
#server.r
# load data
# veh_data_full <- readRDS("veh_data_full.RDS")
# veh_data_time_var_type <- readRDS("veh_data_time_var_type.RDS")
df$location <- gsub( " " , "+" , df$location)
server = function(input, output, session) {
output$pollutant<-renderUI({
selectInput("pollutant", label = h4("Choose Pollutant"),
choices = colnames(df[,6:8]),
selected = 1)
})
#Stores the value of the pollutant selection to pass to openAQ request
###################################
#output$OALpollutant <- renderUI({OALpollutant})
##################################
# create the map, using dataframe 'locations' which is polled daily (using ropenaq)
#MOD TO CONSIDER: addd all available measurements to the popup - true/false for each pollutant, and dates of operation.
output$map <- renderLeaflet({
leaflet(subset(df,(df[,input$pollutant]=="TRUE")))%>% addTiles() %>%
addMarkers(lng = subset(df,(df[,input$pollutant]=="TRUE"))$longitude, lat = subset(df,(df[,input$pollutant]=="TRUE"))$latitude,
popup = paste("Location:", subset(df,(df[,input$pollutant]=="TRUE"))$location, "<br>",
"Pollutant:", input$pollutant, "<br>",
"First Update:", subset(df,(df[,input$pollutant]=="TRUE"))$firstUpdated, "<br>",
"Last Update:", subset(df,(df[,input$pollutant]=="TRUE"))$lastUpdated
))
})
#Process Tab
OAL_site <- reactive({
req(input$map_marker_click)
location %>%
filter(latitude == input$map_marker_click$lat,
longitude == input$map_marker_click$lng)
###########
#call Functions for data retrieval and processing. Might be best to put all data request
#functions into a seperate single function. Need to:
# call importNOAA() to retrieve meteorology data into temporary data frame
# call aq_measurements() to retrieve air quality into a temporary data frame
# merge meteorology and air quality datasets into one working dataset for computations; temporary
# meteorology and air quality datasets to be removed.
# call openAir() functions to create plots from merged file. Pass output to a dashboard to assemble
# into appealing output.
# produce output, either as direct download, or as an emailable PDF.
# delete all temporary files and reset for next run.
})
#fun
output$loc<-renderUI({
selectInput("loc", label = h4("Choose location"),
choices = df$location ,selected = 1
)
})
output$dt<-renderUI({
dateInput('date',
label = 'First Available Date',
value = subset(df$firstUpdated,(df[,1]==input$loc))
)
})
output$dt2<-renderUI({
dateInput('date2',
label = 'Last available Date',
value = subset(df$lastUpdated,(df[,1]==input$loc))
)
})
rt<-reactive({
AQ<- aq_measurements(location = input$loc, date_from = input$dt,date_to = input$dt2,parameter = input$pollutant)
met <- importNOAA(year = 2014:2018)
colnames(AQ)[9] <- "date"
merged<-merge(AQ, met, by="date")
# date output -- reports user-selected state & stop dates in UI
merged$location <- gsub( " " , "+" , merged$location)
merged
})
#DT
output$tim = renderPlot({
timeVariation(rt(), pollutant = "value")
})
}
shinyApp(ui = ui, server = server)
The part of my code that I believe input$MAPID_click should be applied is:
output$map <- renderLeaflet({
leaflet(subset(locations,(locations[,input$pollutant]=="TRUE")))%>% addTiles() %>%
addMarkers(lng = subset(locations,(locations[,input$pollutant]=="TRUE"))$longitude, lat = subset(locations,(locations[,input$pollutant]=="TRUE"))$latitude,
popup = paste("Location:", subset(locations,(locations[,input$pollutant]=="TRUE"))$location, "<br>",
"Pollutant:", input$pollutant, "<br>",
"First Update:", subset(locations,(locations[,input$pollutant]=="TRUE"))$firstUpdated, "<br>",
"Last Update:", subset(locations,(locations[,input$pollutant]=="TRUE"))$lastUpdated
))
})
output$dt<-renderUI({
dateInput('date',
label = 'First Available Date',
value = subset(locations$firstUpdated,(locations[,1]==input$loc))
)
})
output$dt2<-renderUI({
dateInput('date2',
label = 'Last available Date',
value = subset(locations$lastUpdated,(locations[,1]==input$loc))
)
})
rt<-reactive({
AQ<- aq_measurements(location = input$loc, date_from = input$dt,date_to = input$dt2)
met <- importNOAA(year = 2014:2018)
colnames(AQ)[9] <- "date"
merged<-merge(AQ, met, by="date")
# date output -- reports user-selected state & stop dates in UI
merged$location <- gsub( " " , "+" , merged$location)
merged
})
#DT
output$tim = renderPlot({
timeVariation(rt(), pollutant = "value")
})
Here is a minimal example. You click on your marker and you get a plot.
ui = fluidPage(
leafletOutput("map"),
textOutput("temp"),
plotOutput('tim')
)
#server.r
#df$location <- gsub( " " , "+" , df$location)
server = function(input, output, session) {
output$map <- renderLeaflet({
leaflet(df)%>% addTiles() %>% addMarkers(lng = longitude, lat = latitude)
})
output$temp <- renderPrint({
input$map_marker_click$lng
})
output$tim <- renderPlot({
temp <- df %>% filter(longitude == input$map_marker_click$lng)
# timeVariation(temp, pollutant = "value")
print(ggplot(data = temp, aes(longitude, latitude)) + geom_point())
})
}
shinyApp(ui = ui, server = server)

Cross validation in R

I have a problem cross validating a dataset in R.
mypredict.rpart <- function(object, newdata){
predict(object, newdata, type = "class")
}
res <- errorest(win~., data=df, model = rpart, predict = mypredict.rpart)
I get this error.
Error in predict.rpart(object, newdata, type = "class") :
Invalid prediction for rpart object
My dataset is made out of 16 numerical atributes and win is has two factor 0 and 1.
You can download the dataset on link
If you're doing classification, win should be a factor.
df$win = factor(df$win)
Then your code works for me:
> res
Call:
errorest.data.frame(formula = win ~ ., data = df, model = rpart,
predict = mypredict.rpart)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.4844

Resources