I need to group existing error codes into categories (example: codes from 101 to 118 belong to one service, from 201 to 213 to another) and to count the number of error codes by category.
I have used the case function and inside of it I used range for the error codes, and it works well, except for some reason it shows error code 004 as part of the "Other" case. Why is this happening?
| extend codeRange = case(Code in (range(001, 004, 1)), "GeneralMessages",
Code in (range(101, 118, 1)), "TransactionProcessing",
Code in (range(201, 213, 1)), "RulesExecution",
Code in (range(301, 335, 1)), "MerchantRefData",
Code in (range(401, 403, 1)), "BinProcessing",
Code in (range(501, 505, 1)), "ExchangeRateProcessing",
Code in (range(601, 603, 1)), "DecisionRouting",
Code in (range(701, 709, 1)), "TransactionRegistry",
Code in (range(801, 805, 1)), "ClientScore",
Code in (range(901, 903, 1)), "PayEngineConfig",
Code in (range(1001, 1003, 1)), "SecureService",
Code in (range(1101, 1108, 1)), "ProxyAPI",
"Other")
| project ErrorFrom, Message, Code, operation_Id, codeRange
Screenshot of the result
The expected result is that the codeRange for Code 004 will be set as GeneralMessages, not Other.
Are you comparing it to the string "004"? The range function will create an array of numbers with the values from 1 to 4 (including 4) so if you compare to to the string "004" it will not find it.
Try to cast the error code to int:
print range(001, 004, 1)
| mv-expand print_0
| where print_0 == toint("004")
Related
Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right place yet?
To take a random example, when I look at the files here, https://huggingface.co/Helsinki-NLP/opus-mt-en-zls/tree/main, I see separate "spm" (sentience piece model) files for source and target languages, and they are of different sizes (792kb vs. 850kb). But there is only a single "vocab.json" file. And the config.json file only mentions a single "vocab_size": 57680.
I've also been experimenting, e.g. tokenizer(inputs, text_target=inputs, return_tensors="pt"). If source and target used different vocabulary I would expect the returned input_ids and labels to use different numbers. But every model I've tried so far the numbers are identical (NO, my mistake - see update below).
Can a Huggingface tokenizer even support two vocabularies? If not then a model would need two tokenizers, which seems to clash with the way AutoTokenizer works.
UPDATE
Here is a test script to show the above model is actually using two spm vocabs with AutoTokenizer.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'Helsinki-NLP/opus-mt-en-zls'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
inputs = ['Filter all items from same host']
targets = ['Filtriraj sve stavke s istog hosta']
x=tokenizer(inputs, text_target=targets, return_tensors="pt")
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving inputs on both sides")
x=tokenizer(inputs, text_target=inputs, return_tensors="pt")
print(x) ## Expecting to see different numbers if they use different vocabs
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving targets on both sides")
x=tokenizer(targets, text_target=targets, return_tensors="pt") ## Expecting to see different numbers if they use different vocabs
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print(model)
The output is:
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
▁Filter all▁items from same host</s>
Filtriraj sve stavke s istog hosta</s>
Giving inputs on both sides
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 911, 90, 3188, 7, 98, 605, 6276, 0]])}
▁Filter all▁items from same host</s>
Filter all items from same host</s>
Giving targets on both sides
{'input_ids': tensor([[11638, 1392, 7636, 95, 120, 914, 465, 478, 95, 29,
25, 897, 6276, 27, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
Filtriraj sve stavke s istog hosta</s>
Filtriraj sve stavke s istog hosta</s>
When I choose identical strings in English or Croatian it gives slightly different numbers, showing that different tokenizers are involved. You can then see that the different ids sometimes map back to an identical string, sometimes not.
But when I print out the model we see it is actually a shared vocabulary, which makes the two spm models a bit pointless.
(encoder): MarianEncoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(decoder): MarianDecoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(lm_head): Linear(in_features=512, out_features=57680, bias=False)
I haven't got as far as finding out if a non-shared vocabulary is possible, but still yet to see evidence of one.
For Marian-based models, HuggingFace now supports separate vocabularies for source and target, but some models may not, especially older models.
(As you know, OPUS-MT models are based on MarianMT. The MarianMT framework supports it.)
Before https://github.com/huggingface/transformers/pull/15831, HuggingFace used a shared vocabulary file for Marian.
This PR updates the Marian model:
To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language
...
share_encoder_decoder_embeddings: to indicate if emb should be shared or not
So models trained with earlier versions of the framework, or that parameter set to false, only have one shared vocabulary file for source and target.
The primary variable is AgeGroup which has 2 levels. I am trying to get the sample size to output, but for some reason the app either gives error or wont output anything. Can anyone help? The are some comments in the code to help with confusion
Code:
library(shiny)
library(shinyWidgets)
library(survival)
library(shinyjs)
library(survminer)
# Define UI for application that draws a histogram
ui <- fluidPage(
# Application title
titlePanel("ProMBA Haslam Ad Sample Size"),
#Put in all key5 inputs as numeric inputs that the user will type in and choose starting, default values for these inputs
tabPanel("Inputs",
div( id ="form",
column(4,
numericInput("power", label=h6("Power"), value = .9),
numericInput("alpha", label=h6("Alpha"), value = .05),
numericInput("precision", label=h6("Precision"), value =0.05),
numericInput("Delta", label=h6("Delta"), value=.3),
column(4,
numericInput("sample", label=h6("Starting Sample Size"), value = 40),
numericInput("reps", label=h6("Number of Replications"), value=1000)),
),
column(4,
#title of output
h4("Calculated Sample Size"),
verbatimTextOutput(("n"),placeholder=TRUE)),
#create action buttons for users to run the form and to reset the form
textOutput("Sample Size (n)"),
column(4,
actionButton("action","Calculate"))
)))
server = function(input,output,session){
buttonGo = eventReactive(input$action, {withProgress(message = "Running", {
#relist the key inputs and save them to be able to be used in the rest of the code
n<-input$sample/2
alpha<-input$alpha
power <- input$power
beta<-1-input$power
precision<-input$precision
delta <- input$Delta +1
rep <- input$reps
nincrease<-10
#manually5 load in the data from the baseline data .xlxs file
Reporting <- c("12/13/21","12/14/21","12/15/21","12/16/21","12/17/21","12/18/21","12/19/21","12/20/21","12/21/21","12/22/21","12/23/21","12/24/21","12/25/21","12/26/21","12/27/21","12/28/21","12/29/21","12/30/21","12/31/21","1/1/22")
AdSet <- "Status Quo"
Results <- c(70,52, 33, 84, 37, 41, 22, 53, 78, 66, 100, 110, 52, 43, 63, 84, 16, 64, 21, 69)
ResultIndicator <- "actions:link_click"
Budget <- 100
CostPerClick<- c(1.43, 1.92, 3.03, 1.19, 2.70, 2.44, 4.55, 1.89, 1.28, 1.52, 1.00, 0.91, 1.92, 2.33, 1.59, 1.19, 6.25, 1.56, 4.76, 1.45)
Impressions <- c(7020, 8430, 5850, 7920, 6890, 7150, 6150, 7370, 8440, 6590, 6750,8720, 6410,7720, 6940, 8010, 7520, 7190, 6540, 6020)
df <- data.frame(Reporting, AdSet, Results, ResultIndicator,Budget,CostPerClick,Impressions)
#define the standard deviation of the results as well as the mean for group 1 of the 2 level variable and the mean for group 2
mean1 = mean(df$Results)
sd1 = sd(df$Results)
mean2 = delta*mean1
click=rep(0,n)
#Create 2 level variable
AgeGroup <- rep(c("Age21-35","Age36-50"),each=n)
#create new data frame with 2 level variable and click repetitions
DataFrame2 <- data.frame(AgeGroup,click)
#create new data frame binding all of the input variables together
DataFrame3 <- data.frame(cbind(n,alpha,power,precision,delta,rep))
#create for loop to find the pvalue of the ttest run with click~AgeGroup
trials=function(){
for(i in 1:nrow(DataFrame2)){
if(any(DataFrame2$AgeGroup[i]=="Age21-35")){DataFrame2$click[i] =rnorm(1,mean1,sd1)}else{DataFrame2$click[i] =rnorm(1,mean2,sd1)}
}
pvalttest=t.test(click~AgeGroup, data=DataFrame2)
return(pvalttest$p.value)
}
p_values=replicate(200,trials())
p_values=replicate(input$rep,trials())
#find if the p value is significance
significance=p_values[p_values<alpha]
#find the power of the signifiance and the pvalue
power <- length(significance)/length(p_values)
print(c(power,n))
#run a while loop to find the n within the goal power limits
goalpower<-1-beta
lowergoal<-goalpower-input$precision
uppergoal<-goalpower+input$precision
while (power<lowergoal||power>uppergoal){
if (power<lowergoal){
n=n+nincrease
AgeGroup=c()
click=c()
AgeGroup=rep(c("Age21-35","Age36-50"), each=n)
click=rep(NA,2*n)
Dataframe2=data.frame(AgeGroup,click)
p_values=replicate(input$reps, trials())
significance=p_values[p_values<alpha]
power=length(significance)/length(p_values)
print(c(n, power))
}else{
nincrease=nincrease%/%(10/9) #%/% fixes issue of rounding
n=n-nincrease
AgeGroup=c()
click=c()
AgeGroup=rep(c("Age21-35","Age36-50"), each=n)
click=rep(NA,2*n)
DataFrame2=data.frame(AgeGroup,click)
p_values=replicate(input$reps, trials())
significance=p_values[p_values<alpha]
power=length(significance)/length(p_values)
print(c(n, power))
}
}
#n is defined as the sample size of one of the levels of the 2 level variable, so mulitply by 2 to get full sample size
n*2
})
}
shinyApp(ui, server)
i dont need the app to be pretty. I just want it to run whenever someone clicks the calculate button
I'have implemented a keras lstm network in R and I can train it and use correctly the fit function. But when I use "predict" function, I meet following errors:
pred <- base_model %>% predict(matchsetarray)
WARNING:tensorflow:Model was constructed with shape (96, 2, 26) for input KerasTensor(type_spec=TensorSpec(shape=(96, 2, 26), dtype=tf.float32, name='lstm_3_input'), name='lstm_3_input', description="created by layer 'lstm_3_input'"), but it was called on an input with incompatible shape (32, 2, 26).
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: in user code:
File "C:\Users\Utente\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\keras\engine\training.py", line 1801, in predict_function *
return step_function(self, iterator)
File "C:\Users\Utente\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\keras\engine\training.py", line 1790, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\Utente\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\keras\engine\training.py", line 1783, in run_step **
outputs = model.predict_step(data)
File "C:\Users\Utente\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\keras\engine\training.py", line 1751, in predict_step
return self(x, training=False)
File "C:\Users\Utente\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\
However the dimension of predict input is (96,2,26), but "predict" consider the input "matchsetarray" of dimension (32, 2, 26). Could I force predict to read teh correct format?
I tryed to verify the dimension of input "matchsetarray":
dim(matchsetarray)
[1] 96 2 26
It's the correct dimension expected by "predict" function.
I am trying to perform multiple sample comparison and Tukey HSD using the statsmodels module, but I keep getting this error message, "ValueError: v must be > 1 when p >= .9". I have tried looking this up on the internet for a possible solution, but no avail. Any chance anyone familiar with this module could help me out decipher what I am doing wrong to prompt this error. I use Python version 2.7x and spyder. Below is a sample of my data and the print statement. Thanks!
import numpy as np
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)
###--- Here are the data I am using:
data1 = np.array([ 1, 1, 1, 1, 976, 24, 1, 1, 15, 15780])
data2 = np.array(['lau15', 'gr17', 'fri26', 'bays29', 'dantzig4', 'KAT38','HARV50', 'HARV10', 'HARV20', 'HARV41'], dtype='|S8')
####--- Here's my print statement code:
print pairwise_tukeyhsd(data1, data2, alpha=0.05)
Seems you have to provide more data than a single observation per group, in order for the test to work.
Minimal example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd,MultiComparison
data=[1,2,3]
groups=['a','b','c']
print("1st try:")
try:
print(pairwise_tukeyhsd(data,groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
data.append(2)
groups.append('a')
print("2nd try:")
try:
print( pairwise_tukeyhsd(data, groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
Output:
1st try:
/home/user/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
/home/user/.local/lib/python3.7/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
whoops! v must be > 1 when p >= .9
2nd try:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
====================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------
a b 0.5 0.1 -16.045 17.045 False
a c 1.5 0.1 -15.045 18.045 False
b c 1.0 0.1 -18.1046 20.1046 False
----------------------------------------------------
I have a DataFrame 'data' in sparkR which contains ID= 1,2,.. and amount= 232, 303, 444, 10, ...
I want to check if the sum of amount is greater than 5000.
sum(data$amount ) > 5000
Now sparkR should return TRUE if its TRUE and FALSE otherwise but all I get is this message
Column (SUM(amount)>5000)
How can I check if it's true?
It might not be the best possible solution, but it works. You did create a column of 1 item, but I did not find a way to get the result stored in that item, therefor I applied a different approach:
df <- data.frame(ID=c(1,2,3,4),amount=c(232, 303, 444, 10))
data <- createDataFrame(sqlContext,df)
data <- withColumn(data, "constant", data$ID * 0)
sumFrame <- agg(groupBy(data, data$constant), sumAmount = sum(data$amount))
localResult <- collect(sumFrame)
localResult$sumAmount > 5000
With this approach, I create a DataFrame of 1 row, but a DataFrame is collectable to obtain the result.