Naive Bayesian and zero-frequency issue - algorithm

I think I've implemented most of it correctly. One part confused me:
The zero-frequency problem:
Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value doesn’t occur with every class value.
Here's some of my client code:
//Clasify
string text = "Claim your free Macbook now!";
double posteriorProbSpam = classifier.Classify(text, "spam");
Console.WriteLine("-------------------------");
double posteriorProbHam = classifier.Classify(text, "ham");
Now say the word 'free' is present in the training data somewhere
//Training
classifier.Train("ham", "Attention: Collect your Macbook from store.");
*Lot more here*
classifier.Train("spam", "Free macbook offer expiring.");
But the word is present in my training data for category 'spam' only not in 'ham'. So when I go to calculate posteriorProbHam what do i do when I come across the word 'free'.

Still add one. The reason: Naive Bayes models P("free" | spam) and P("free" | ham) as being completely independent, so you want to estimate the probability of each completely independently. The Laplace estimator you're using for P("free" | spam) is (count("free" | spam) + 1) / count(spam); P("ham" | spam) is the same.
If you think about what it would mean to not add one, it wouldn't really make sense: seeing "free" one time in ham would make it less likely to see "free" in spam.

Related

How to include an array of weights to adjust importance of observed data in sm.tsa.UnobservedComponents?

I have used the following 5 lines to achieve a kalman filter with your work for a smoothed pricing model, and it worked great.
mod = sm.tsa.UnobservedComponents(obs, 'local level')
lm = sm.OLS(obs, xlm, missing='drop').fit()
obs_noise = abs(lm.resid).mean()
params = [obs_noise, obs_noise / obs_noise_level]
mod_filter, mod_smooth = mod.filter(params), mod.smooth(params)
However currently I would like to adjust the filtering smoothness at certain time, for example, when unemployment rate or interest rate made a big surge, I would like to make the output (Kalman filtered/smoothed) value closer to the observed value, while in most other time I will keep the what it is from the model. So, I have created an array, while a few items greater than 1, and the others will be exactly 1.
e.g.: ir_coeff = np.array([1,1,1,1,1.345,1.23,1.78,1,1,1])
What could be the best approach to achieve this? Thank you a lot in advance.
I have tried to include it in the output file with a dot product operation, however it is not reasonable to do this.

ANTL4 Parser (Flat parser vs Structor parse) for Language Translator

Over the last couple of months, with the help of members from this site, I have been able to write (Phase 1) a Lexer and Parser to translate Lang X to Java. Because I was new to this topic, I opted for a simple line by line, parser and now it's able to parse around 1000 language files in 15 minutes with a small number of errors/exceptions and circa 1M lines of code, with the problems being isolated to the source files not the parser. I will refer to this a flat parsing, for want of a better expression.
Now for Phase 2, the translation to Java. Like any language, mine has Data Structures, Procedures, Sub-routines, etc and I thought it best to alter the parser from below (for simplicity I have focussed on the Data Structure (called TABLE)):
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemBlockStart
| itemBlockEnd
| itemStatement
| tableHeader
;
itemBlockStart: BEGIN;
itemBlockEnd: END;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER (atom)*
;
// Item statement
itemStatement:
// Tables with Item statements
ITEM atom+
// Base atom lowest of the low
atom:
MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
to this:
// Execution units (by structure)
executableUnit:
tableStatement
| itemStatement
;
// Table statement, header and body
tableStatement:
tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;
Before we go any futher, TABLE and individual ITEM statements can occur anywhere in the code, on their own (Java output would be public) or inside a Procedure (Jave output would be private)
Imagine my dismay (if you will) when the parser produced the same number of errors, but took 10 times longer to parser the input. I kind of understand the increased time period, in terms of selecting the right path. My questions for the group are:
Is there a way to force the parser down the TABLE structure early to reduce the time period?
Whether having this logical tree structure grouping is worth the increased time?
My desire to move in this direction was to have a Listener callback with a mini tree with all the relevant items accessable to walk. I.e. If the mini tree wasnt inside a Procedure statement is was public in Java.
It's not entirely clear to me what performance difference you are referring to (presumably, the difference between the "line by line" parser, and this, full file, parser. (???)
A few things that "jump out" about your grammar, and could have some performance impact:
1 - itemBlockStart: BEGIN; and itemBlockEnd: END;. There's no point in have a rule that is a single Token. Just use the token in the rule definition.
2 - You are, probably unintentionally, being VERY relaxed in the acceptance of itemStartBlock and itemStopBlock in this rule (tableStatement: tableHeader (itemBlockStart | itemBlockEnd | itemStatement)*;). This could also have performance implications. I'm assuming in the rest of this response that BEGIN should appear at the beginning of an itemStatement and END should appear at the end (not that the three can appear in any order willy-nilly).
Try this refactoring:
// Main entry point of the program
program
: executableUnit+ EOF
;
// Execution units (line by line)
executableUnit:
| itemStatement # ItemStmt
| tableHeader # TableHeader
;
tableHeader: // A TABLE declaration statement
TABLE atom LETTER atom*
;
// Item statement
itemStatement: // Tables with Item statements
BEGIN ITEM atom+ END
;
// Base atom lowest of the low
atom: MINUS? INT #IntegerAtom
| REAL_FORMAT #RealAtom
| FIX_POINT #FixPointAtom
| (MINUS | EQUALS)? NAME DOT? #NameAtom
| LETTER #LetterAtom
| keywords DOT? #KeywordAtom
| DOLLAR atom DOLLAR #DollarAtom
| hex_assign #HexItem
;
admittedly, I can't quite make out what your intention is, but this should be a step in the right direction.
As Kaby76 points out, the greedy operator at the end of the tableHeader is quite likely to "gobble up" a lot of input. This is partly because of the lack of a terminator token (which would, no doubt, stop the token consumption earlier than not having a termination token. However, your atom rule seems to be something of a "kitchen sink" rule that can match all manner of input. Couple that with the use of atom+ and atom* and there's quite the likelihood of consuming a long stream of tokens. Is it really your intention that any of the atoms can appear one after the other with no structure? They appear to be pieces/parts of expressions. If that's the case, you will want to define your grammar for expressions. This added structure will both help performance and give you a MUCH more useful parse tree to act upon.
Much like the structure for tableStatement in your question's grammar, it doesn't really represent any structure (see my recommendation to change it to BEGIN ITEM atom+ END rather than accepting any combination in any order. The same thought process needs to be applied to atom. Both of these approaches let ANTLR march through your code consuming a LOT of tokens without any clue as to whether the order is actually correct (which is then very expensive to attempt to "back out of" when a problem is encountered).

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

I cannot understand mistake lmer

I tried to solve the problem reading other answers but did not get the solution.
I am performing a lmer model:
MODHET <- lmer(PERC ~ SITE + TREAT + HET + TREAT*HET + (1|PINE), data = PRESU).
Perc is the percentage of predation. Site is a categorical variable that I am using as blocking factor. It is site identity where I performed the experiement. TREAT is categorical variable of 2 levels. HET is a continuous variable. The number of observation is 56 divided in 7 sites
Maybe the problem is how I expressed the random factor. In every site I selected 8 pines among 15 to perform the experiment. I included the pine identity as categorical random factor. For instance in Site 1 pines are called a1,a3,a7 ecc, while in site 2 are called b1,b4,b12 ecc...
The output of the model is
Error: number of levels of each grouping factor must be < number of observations
I don´t understand where is the mistake. Could it be how I called the pines?
I tried also
MODHET <- lmer(PERC ~ SITE + TREAT + HET + TREAT*HET + (1|SITE:PINE), data = PRESU)
but the output is the same.
I hope that I explained well my problems. I read on this forum similar questions about it but I still do not get the solution.
Thank you for your help
Use argument control = lmerControl(check.nobs.vs.nRE = "ignore") in your lmer-call to suppress this error. However, I guess this does not solve the actual problem. It seems to me that your grouping level contains no "groups", probably "SITE" is your random intercept?
If you consider PINES nested as "subjects" within SITES, then I would suggest following formula:
MODHET <- lmer(PERC ~ TREAT*HET + (1|SITE), data = PRESU)
or,
MODHET <- lmer(PERC ~ TREAT*HET + (1 | SITE / PINE), data = PRESU)
But my answer may be wrong, I'm not sure whether I have enough information to fully understand what you're aiming at.
edit:
Sorry, nesting was not correctly specified, I fixed it in the above formula. See also this answer .

Need an algorithm that detects diffs between two files for additions and reorders

I am trying to figure out if there are existing algorithms that can detect changes between two files in terms of additions but also reorders. I have an example below:
1 - User1 commit
processes = 1
a = 0
allactive = []
2 - User2 commit
processes = 2
a = 0
allrecords = range(10)
allactive = []
3 - User3 commit
a = 0
allrecords = range(10)
allactive = []
processes = 2
I need to be able to say that for example user1 code is the three initial lines of code, user 2 added the "allrecords = range(10)" part (as well as a number change), and user 3 did not change anything since he/she just reordered the code.
Ideally, at commit 3, I want to be able to look at the code and say that from character 0 to 20 (this is user1's code), 21-25 user2's code, 26-30 user1's code etc.
I know there are two popular algorithms, Longest common subsequence and longest common substring but I am not sure which one can correctly count additions of new code but be able also to identify reorders.
Of course this still leaves out the question of having the same substring existing twice in a text. Are there any other algorithms that are better suited to this problem?
Each "diff" algorithm defines a set of possible code-change edit types, and then (typically) tries to find the smallest set of such changes that explains how the new file resulted from the old. Usually such algorithms are defined purely syntactically; semantics are not taken into account.
So what you want, based on your example, is an algorithm that allow "change line", "insert line", "move line" (and presumably "delete line" [not in your example but necessary for a practical set of edits]). Given this you ought to be able to define a dynamic programming algorithm to find a smallest set of edits to explain how one file differs from another. Note that this set is defined in terms of edits to whole-lines, rather like classical "diff"; of course classical diff does not have "change line" or "move line" which is why you are looking for something else.
You could pick different types of deltas. Your example explicitly noted "number change"; if narrowly interpreted, this is NOT an edit on lines, but rather within lines. Once you start to allow partial line edits, you need to define how much of a partial line edit is allowed ("unit of change"). (Will your edit set allow "change of digit"?)
Our Smart Differencer family of tools defines the set of edits over well-defined sub-phrases of the targeted language; we use formal language grammar (non)terminals as the unit of change. [This makes each member of the family specific to the grammar of some language] Deltas include programmer-centric concepts such as "replace phrase by phrase", "delete listmember", "move listmember", "copy listmember", "rename identifier"; the algorithm operates by computing a minimal tree difference in terms of these operations. To do this, the SmartDifferencer needs (and has) a full parser (producing ASTs) for the language.
You didn't identify the language for your example. But in general, for a language looking like that, the SmartDifferencer would typically report that User2 commit changes were:
Replaced (numeric literal) "1" in line 1 column 13 by "2"
Inserted (statement) "allrecords = range(10)" after line 2
and that User3 commit changes were:
Move (statement) at line 1 after line 4
If you know who contributed the original code, with the edits you can straightforwardly determine who contributed which part of the final answer. You have to decide the unit-of-reporting; e.g., if you want report such contributions on a line by line basis for easy readability, or if you really want to track that Mary wrote the code, but Joe modified the number.
To detect that User3's change is semantically null can't be done with purely syntax-driven diff tool of any kind. To do this, the tool has to be able to compute the syntactic deltas somehow, and then compute the side effects of all statements (well, "phrases"), requiring a full static analyzer of the language to interpret the deltas to see if they have such null effects. Such a static analyzer requires a parser anyway so it makes sense to do a tree based differencer, but it also requires a lot more than just parser [We have such language front ends and have considered building such tools, but haven't gotten there yet].
Bottom line: there is no simple algorithm for determining "that user3 did not change anything". There is reasonable hope that such tools can be built.

Resources