Limiting BART HuggingFace Model to complete sentences of maximum length - huggingface-transformers

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

Related

'Duplicate' NGram values in topic list created using bertopic

I've set the CountVectorizer to examine bi and trigrams (ngram_range=(1, 3)) . This seems very useful. However, I'm seeing "duplicate" terms e.g.:
The terms "justice," "India," "gate," and "along" appear to overlap significantly. I'm utilising these vocabularies to choose documents for further processing, and it appears that we have one phrase "pushing out" other terms that could otherwise surface. In fact, I'm conducting a broad search across all of these terms to pick target documents for additional processing, so I'm not sure what I'm "missing" otherwise. Is this something I'm thinking about correctly? In this case, would it be a "good thing" if "india gate" and "justice khanna" were combined into a single term?
also how can I combine these into a single term in bertopic so that these overlaps don't occur
In BERTopic, there is the diversity parameter that allows you to fine-tune the topic representations. The underlying algorithm for this is called MaximalMarginalRelevance. It is a value between 0 and 1 that indicates how diverse keywords in a single topic should be compared to one another. A value of 1 indicates high diversity and 0 indicates little diversity. It works as follows:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Get documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train BERTopic and apply MMR
topic_model = BERTopic(diversity=0.4)
topics, probs = topic_model.fit_transform(docs)
Do note that in the upcoming version, the diversity parameter is removed and will be replaced as follows:
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

How to include an array of weights to adjust importance of observed data in sm.tsa.UnobservedComponents?

I have used the following 5 lines to achieve a kalman filter with your work for a smoothed pricing model, and it worked great.
mod = sm.tsa.UnobservedComponents(obs, 'local level')
lm = sm.OLS(obs, xlm, missing='drop').fit()
obs_noise = abs(lm.resid).mean()
params = [obs_noise, obs_noise / obs_noise_level]
mod_filter, mod_smooth = mod.filter(params), mod.smooth(params)
However currently I would like to adjust the filtering smoothness at certain time, for example, when unemployment rate or interest rate made a big surge, I would like to make the output (Kalman filtered/smoothed) value closer to the observed value, while in most other time I will keep the what it is from the model. So, I have created an array, while a few items greater than 1, and the others will be exactly 1.
e.g.: ir_coeff = np.array([1,1,1,1,1.345,1.23,1.78,1,1,1])
What could be the best approach to achieve this? Thank you a lot in advance.
I have tried to include it in the output file with a dot product operation, however it is not reasonable to do this.

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

How to set up training and feature template files for NER? - CRF++

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?
You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)

Naive Bayesian and zero-frequency issue

I think I've implemented most of it correctly. One part confused me:
The zero-frequency problem:
Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value doesn’t occur with every class value.
Here's some of my client code:
//Clasify
string text = "Claim your free Macbook now!";
double posteriorProbSpam = classifier.Classify(text, "spam");
Console.WriteLine("-------------------------");
double posteriorProbHam = classifier.Classify(text, "ham");
Now say the word 'free' is present in the training data somewhere
//Training
classifier.Train("ham", "Attention: Collect your Macbook from store.");
*Lot more here*
classifier.Train("spam", "Free macbook offer expiring.");
But the word is present in my training data for category 'spam' only not in 'ham'. So when I go to calculate posteriorProbHam what do i do when I come across the word 'free'.
Still add one. The reason: Naive Bayes models P("free" | spam) and P("free" | ham) as being completely independent, so you want to estimate the probability of each completely independently. The Laplace estimator you're using for P("free" | spam) is (count("free" | spam) + 1) / count(spam); P("ham" | spam) is the same.
If you think about what it would mean to not add one, it wouldn't really make sense: seeing "free" one time in ham would make it less likely to see "free" in spam.

Resources