'Duplicate' NGram values in topic list created using bertopic - huggingface-transformers

I've set the CountVectorizer to examine bi and trigrams (ngram_range=(1, 3)) . This seems very useful. However, I'm seeing "duplicate" terms e.g.:
The terms "justice," "India," "gate," and "along" appear to overlap significantly. I'm utilising these vocabularies to choose documents for further processing, and it appears that we have one phrase "pushing out" other terms that could otherwise surface. In fact, I'm conducting a broad search across all of these terms to pick target documents for additional processing, so I'm not sure what I'm "missing" otherwise. Is this something I'm thinking about correctly? In this case, would it be a "good thing" if "india gate" and "justice khanna" were combined into a single term?
also how can I combine these into a single term in bertopic so that these overlaps don't occur

In BERTopic, there is the diversity parameter that allows you to fine-tune the topic representations. The underlying algorithm for this is called MaximalMarginalRelevance. It is a value between 0 and 1 that indicates how diverse keywords in a single topic should be compared to one another. A value of 1 indicates high diversity and 0 indicates little diversity. It works as follows:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Get documents
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train BERTopic and apply MMR
topic_model = BERTopic(diversity=0.4)
topics, probs = topic_model.fit_transform(docs)
Do note that in the upcoming version, the diversity parameter is removed and will be replaced as follows:
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)

Related

Doc2Vec How to find most similar document

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?
The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

How can I use more than one additional regressor with DeepAREstimator in gluon-ts?

When creating training or test data in gluon-ts we can specify an additional real-valued regressor in the DeepAREstimator by specifying a feat_dynamic_real. Is there support for multiple real-valued regressors?
There is a one_dim_target flag in gluonts.dataset.common.ListDataset which is used to create the training/test data objects. This seems like it could be needed to support multiple additional regressors, however I couldn't find a good example on the intended usage.
Here is the set up for creating training data with one additional regressor:
training_data = ListDataset(
[{"start": df.index[0], "target": df.values, "feat_dynamic_real": df['randomColumn'].values}],
freq = "5min", one_dim_target=False
)
and the Estimator:
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer
estimator = DeepAREstimator(freq="5min", prediction_length=12, trainer=Trainer(epochs=10))
predictor = estimator.train(training_data=training_data)
I'm looking for the syntax/configuration needed for multiple regressors.
Yes, there is support for that. First of, Gluon TS refers to regressors as features and the signal that we are trying to predict as target. Thus, the one_dim_target flag you mention is related to the dimension of the ouput and not the input.
Below is the code I use to associate a multi-dimensional feature (input) to each target signal (I use a one-dimensional target)
train_ds = ListDataset([{FieldName.TARGET: target,
FieldName.START: start,
FieldName.FEAT_DYNAMIC_REAL: fdr}
for (target, start, fdr) in zip(
target,
custom_ds_metadata['start'],
feat_dynamic_real)]
In the zip-function above,
target: Is a 1-dimensional numpy array containing the target signal, i.e., the shape of target is (1,#of time steps)
custom_ds_metadata['start'] : Is a pandas date variable indicating the beginning of the data
feat_dynamic_real: Is a 2-dimensional numpy array containing two feature signals, i.e., feat_dynamic_real has shape (#of features, #number of time steps)

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Is there a way to set min_df and max_df in gensim's tfidf model?

I am using gensim's tdidf model like so:
from gensim import corpora, models
dictionary = corpora.Dictionary(some_corpus)
mapped_corpus = [dictionary.doc2bow(text)
for text in some_corpus]
tfidf = models.TfidfModel(mapped_corpus)
Now I'd like to apply thresholds to remove terms that appear too frequently (max_df) and too infrequently (min_df). I know that scikit's CountVectorizer allows you to do this, but I can't seem to find how to set these thresholds in gensim's tfidf. Could someone please help?
You can filter your dictionary with
dictionary.filter_extremes(no_below=min_df, no_above=rel_max_df)
Note that no_below expects the minimum number of documents in which tokens must appear, whereas no_above expects a maximum relative frequency, e.g. 0.5. Afterwards you can then construct your corpus with the filtered dictionary. According to the gensim docs it is also possible to construct a TfidfModel with only a dictionary.

Resources