I am new to transformer based models. I am trying to fine-tune the following model (https://huggingface.co/Chramer/remote-sensing-distilbert-cased) on my dataset. The code:
enter image description here
and I got the following error:
enter image description here
I will be thankful if anyone could help.
The preprocessing steps I followed:
input_ids_t = []
attention_masks_t = []
for sent in df_train['text_a']:
encoded_dict = tokenizer.encode_plus(
sent,
add_special_tokens = True,
max_length = 128,
pad_to_max_length = True,
return_attention_mask = True,
return_tensors = 'tf',
)
input_ids_t.append(encoded_dict['input_ids'])
attention_masks_t.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids_t = tf.concat(input_ids_t, axis=0)
attention_masks_t = tf.concat(attention_masks_t, axis=0)
labels_t = np.asarray(df_train['label'])
and i did the same for testing data. Then:
train_data = tf.data.Dataset.from_tensor_slices((input_ids_t,attention_masks_t,labels_t))
and the same for testing data
It sounds like you are feeding the transformer_model 1 input instead of 3. Try removing the square brackets around transformer_model([input_ids, input_mask, segment_ids])[0] so that it reads transformer_model(input_ids, input_mask, segment_ids)[0]. That way, the function will have 3 arguments and not just 1.
I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?
Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
I have a swarmplot:
sns.swarmplot(y = "age gap corr", x = "cluster",
data = scatter_data, hue = 'group', dodge=True)
and I would like to adjust the transparency of the dots:
sns.swarmplot(y = "age gap corr", x = "cluster",
data = scatter_data, hue = 'group', dodge=True,
scatter_kws = {'alpha': 0.1})
sns.swarmplot(y = "age gap corr", x = "cluster",
data = scatter_data, hue = 'group', dodge=True,
plot_kws={'scatter_kws': {'alpha': 0.1}})
but neither of the above methods works.
any help is appreciated.
You can simply input the alpha argument directly in the swarmplot function:
import seaborn as sns
df = sns.load_dataset('diamonds').sample(1000)
sns.swarmplot(data=df, x='cut', y='carat', hue='color', alpha=0.5)
The documentation for swarmplot states
kwargs : key, value mappings
Other keyword arguments are passed through to matplotlib.axes.Axes.scatter().
Thus, you don't need to use scatter_kws={...}.
Compare this to, e.g., sns.lmplot, which states
{scatter,line}_kws : dictionaries
Additional keyword arguments to pass to plt.scatter and plt.plot.
Hello I am creating an environmental shiny app in which I want to use a leaflet map to create some simple plots based on openair package(https://rpubs.com/NateByers/Openair).
Aq_measurements() general form
AQ<- (aq_measurements(country = “country”, city = “city”, location = “location”, parameter = “pollutant choice”, date_from = “YYYdateY-MM-DD”, date_to = “YYYY-MM-DD”).
All parameters available in locations dataframe.
worldmet() general form
met <- importNOAA(code = "12345-12345", year = YYYYY:YYYY)
NOAA Code available in locations dataframe
Below I create a sample of my initial data frame:
location = c("100 ail","16th and Whitmore","40AB01 - ANTWERPEN")
lastUpdated = c("2018-02-01 09:30:00", "2018-02-01 03:00:00", "2017-03-07 10:00:00")
firstUpdated = c("2015-09-01 00:00:00","2016-03-06 19:00:00","2016-11-22 15:00:00")
pm25=c("FALSE","FALSE","FALSE")
pm10=c("TRUE","FALSE","FALSE")
no2=c("TRUE","FALSE","FALSE")
latitude=c(47.932907,41.322470,36.809700)
longitude=c(106.92139000,-95.93799000
,-107.65170000)
df = data.frame(location, lastUpdated, firstUpdated,latitude,longitude,pm25,pm10,no2)
As a general idea I want to be able to click on a certain location in the map based on this dataframe. Then I have one selectInput() and 2 dateInput(). The 2 dateInput() should take as inputs the df$firstUpdated and df$lastUpdated respectively. Then the selectInput() should take as inputs the pollutants that exist in the df based on "TRUE"/"FALSE" value. And then the plots should be created. All of these should be triggered by clicking on the map.
Up to now I was not able to achieve this so in order to help you understand I connected the selectInput() and the dateInput() with input$loc which is a selectIpnut() with locations in the first tab as I will not need this when I find the solution.
library(shiny)
library(leaflet)
library(plotly)
library(shinythemes)
library(htmltools)
library(DT)
library(utilr)
library(openair)
library(plotly)
library(dplyr)
library(ggplot2)
library(gissr)
library(ropenaq)
library(worldmet)
# Define UI for application that draws a histogram
ui = navbarPage("ROPENAQ",
tabPanel("CREATE DATAFRAME",
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
wellPanel(
uiOutput("loc"),
helpText("Choose a Location to create the dataframe.")
)
),
mainPanel(
)
)
),
tabPanel("LEAFLET MAP",
leafletOutput("map"),
wellPanel(
uiOutput("dt"),
uiOutput("dt2"),
helpText("Choose a start and end date for the dataframe creation. Select up to 2 dates")
),
"Select your Pollutant",
uiOutput("pollutant"),
helpText("While all pollutants are listed here, not all pollutants are measured at all locations and all times.
Results may not be available; this will be corrected in further revisions of the app. Please refer to the measurement availability
in the 'popup' on the map."),
hr(),
fluidRow(column(8, plotOutput("tim")),
column(4,plotOutput("polv"))),
hr(),
fluidRow(column(4, plotOutput("win")),
column(8,plotOutput("cal"))),
hr(),
fluidRow(column(12, plotOutput("ser"))
)
)
)
#server.r
# load data
# veh_data_full <- readRDS("veh_data_full.RDS")
# veh_data_time_var_type <- readRDS("veh_data_time_var_type.RDS")
df$location <- gsub( " " , "+" , df$location)
server = function(input, output, session) {
output$pollutant<-renderUI({
selectInput("pollutant", label = h4("Choose Pollutant"),
choices = colnames(df[,6:8]),
selected = 1)
})
#Stores the value of the pollutant selection to pass to openAQ request
###################################
#output$OALpollutant <- renderUI({OALpollutant})
##################################
# create the map, using dataframe 'locations' which is polled daily (using ropenaq)
#MOD TO CONSIDER: addd all available measurements to the popup - true/false for each pollutant, and dates of operation.
output$map <- renderLeaflet({
leaflet(subset(df,(df[,input$pollutant]=="TRUE")))%>% addTiles() %>%
addMarkers(lng = subset(df,(df[,input$pollutant]=="TRUE"))$longitude, lat = subset(df,(df[,input$pollutant]=="TRUE"))$latitude,
popup = paste("Location:", subset(df,(df[,input$pollutant]=="TRUE"))$location, "<br>",
"Pollutant:", input$pollutant, "<br>",
"First Update:", subset(df,(df[,input$pollutant]=="TRUE"))$firstUpdated, "<br>",
"Last Update:", subset(df,(df[,input$pollutant]=="TRUE"))$lastUpdated
))
})
#Process Tab
OAL_site <- reactive({
req(input$map_marker_click)
location %>%
filter(latitude == input$map_marker_click$lat,
longitude == input$map_marker_click$lng)
###########
#call Functions for data retrieval and processing. Might be best to put all data request
#functions into a seperate single function. Need to:
# call importNOAA() to retrieve meteorology data into temporary data frame
# call aq_measurements() to retrieve air quality into a temporary data frame
# merge meteorology and air quality datasets into one working dataset for computations; temporary
# meteorology and air quality datasets to be removed.
# call openAir() functions to create plots from merged file. Pass output to a dashboard to assemble
# into appealing output.
# produce output, either as direct download, or as an emailable PDF.
# delete all temporary files and reset for next run.
})
#fun
output$loc<-renderUI({
selectInput("loc", label = h4("Choose location"),
choices = df$location ,selected = 1
)
})
output$dt<-renderUI({
dateInput('date',
label = 'First Available Date',
value = subset(df$firstUpdated,(df[,1]==input$loc))
)
})
output$dt2<-renderUI({
dateInput('date2',
label = 'Last available Date',
value = subset(df$lastUpdated,(df[,1]==input$loc))
)
})
rt<-reactive({
AQ<- aq_measurements(location = input$loc, date_from = input$dt,date_to = input$dt2,parameter = input$pollutant)
met <- importNOAA(year = 2014:2018)
colnames(AQ)[9] <- "date"
merged<-merge(AQ, met, by="date")
# date output -- reports user-selected state & stop dates in UI
merged$location <- gsub( " " , "+" , merged$location)
merged
})
#DT
output$tim = renderPlot({
timeVariation(rt(), pollutant = "value")
})
}
shinyApp(ui = ui, server = server)
The part of my code that I believe input$MAPID_click should be applied is:
output$map <- renderLeaflet({
leaflet(subset(locations,(locations[,input$pollutant]=="TRUE")))%>% addTiles() %>%
addMarkers(lng = subset(locations,(locations[,input$pollutant]=="TRUE"))$longitude, lat = subset(locations,(locations[,input$pollutant]=="TRUE"))$latitude,
popup = paste("Location:", subset(locations,(locations[,input$pollutant]=="TRUE"))$location, "<br>",
"Pollutant:", input$pollutant, "<br>",
"First Update:", subset(locations,(locations[,input$pollutant]=="TRUE"))$firstUpdated, "<br>",
"Last Update:", subset(locations,(locations[,input$pollutant]=="TRUE"))$lastUpdated
))
})
output$dt<-renderUI({
dateInput('date',
label = 'First Available Date',
value = subset(locations$firstUpdated,(locations[,1]==input$loc))
)
})
output$dt2<-renderUI({
dateInput('date2',
label = 'Last available Date',
value = subset(locations$lastUpdated,(locations[,1]==input$loc))
)
})
rt<-reactive({
AQ<- aq_measurements(location = input$loc, date_from = input$dt,date_to = input$dt2)
met <- importNOAA(year = 2014:2018)
colnames(AQ)[9] <- "date"
merged<-merge(AQ, met, by="date")
# date output -- reports user-selected state & stop dates in UI
merged$location <- gsub( " " , "+" , merged$location)
merged
})
#DT
output$tim = renderPlot({
timeVariation(rt(), pollutant = "value")
})
Here is a minimal example. You click on your marker and you get a plot.
ui = fluidPage(
leafletOutput("map"),
textOutput("temp"),
plotOutput('tim')
)
#server.r
#df$location <- gsub( " " , "+" , df$location)
server = function(input, output, session) {
output$map <- renderLeaflet({
leaflet(df)%>% addTiles() %>% addMarkers(lng = longitude, lat = latitude)
})
output$temp <- renderPrint({
input$map_marker_click$lng
})
output$tim <- renderPlot({
temp <- df %>% filter(longitude == input$map_marker_click$lng)
# timeVariation(temp, pollutant = "value")
print(ggplot(data = temp, aes(longitude, latitude)) + geom_point())
})
}
shinyApp(ui = ui, server = server)
I have a slickgrid with 25,000+ rows. I have setup column filtering (see example) which works fine and is very fast.
I have now added the CheckboxSelectColumn plugin (see example) and while this worked it has crippled the speed of the filtering. Everything still works, just very much slower.
I have tried optimising the filtering by supplying RefreshHints (see example) but no joy.
Is it just the combination of filtering plus checkboxes plus large row count, or am I doing something wrong?
Here are the relevant bits of code (CoffeeScript).
Setup the Column Filters
setupColumnFilters:()->
$(grid.getHeaderRow()).delegate(':input', 'change keyup', (e) ->
columnId = $(this).data('columnId')
if columnId?
newVal = $.trim($(this).val())
columnFilters[columnId] = newVal
# Trying to optimise using RefreshHints
newLen = newVal?.length
oldlen = columnFilters[columnId]?.length ? 0
isNarrowing = newLen > oldlen
isExpanding = newLen < oldlen
renderedRange = grid.getRenderedRange()
dataView.setRefreshHints({
ignoreDiffsBefore: renderedRange.top,
ignoreDiffsAfter: renderedRange.bottom + 1,
isFilterNarrowing: isNarrowing,
isFilterExpanding: isExpanding
})
dataView.refresh()
)
grid.onHeaderRowCellRendered.subscribe((e, args) ->
node = $(args.node)
node.empty()
id = args.column.id
if id == '_checkbox_selector'
node.hide()
return
placeholder = 'filter by ' + id
html = '<input type="text" placeholder="' + placeholder + '">'
$(html)
.data('columnId', id)
.val(columnFilters[id])
.appendTo(node)
.focus(()->$(this).attr('placeholder', ''))
.blur(()-> $(this).attr('placeholder', placeholder) if $(this).val()?)
)
Setup the CheckboxSelect Plugin
setupCheckboxSelect:() ->
checkboxPlugin = new Slick.CheckboxSelectColumn({ cssClass: "slick-cell-checkboxsel" });
columns.unshift(checkboxPlugin.getColumnDefinition());
grid.setColumns(columns);
grid.registerPlugin(checkboxPlugin);
The Filter Function
filter: (item) =>
grid.setSelectedRows([])
columns = grid.getColumns()
for columnId, filter of columnFilters
if filter?
column = columns[grid.getColumnIndex(columnId)]
field = item[column.field]
return false unless (field? && field.toLowerCase().indexOf(filter.toLowerCase()) > -1)
return true
Whoa, why are you calling grid.setSelectedRows([]) in your filter?!?
It gets called 25'000 times whenever you refresh the data.
Besides being completely pointless, it does slow things down even more when you use the checkbox select column since it needs to synchronize the state (based on selection).