After training an LDA model on gensim LDA model i converted the model to a with the gensim mallet via the malletmodel2ldamodel function provided with the wrapper. Before and after the conversion the topic word distributions are quite different. The mallet version returns very rare topic word distribution after conversion.
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=13, id2word=dictionary)
model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)
model.save('ldamallet.gensim')
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda_mallet = gensim.models.wrappers.LdaMallet.load('ldamallet.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda_mallet, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)
Here is the output from gensim original implementation:
I can see there was a bug around this issue which has been fixed with the previous versions of gensim. I am using gensim=3.7.1
Here is an optional function to use instead of malletmodel2ldamodel (reported to have bugs):
from gensim.models.ldamodel import LdaModel
import numpy
def ldaMalletConvertToldaGen(mallet_model):
model_gensim = LdaModel(id2word=mallet_model.id2word, num_topics=mallet_model.num_topics, alpha=mallet_model.alpha, eta=0, iterations=1000, gamma_threshold=0.001, dtype=numpy.float32)
model_gensim.state.sstats[...] = mallet_model.wordtopics
model_gensim.sync_state()
return model_gensim
converted_model = ldaMalletConvertToldaGen(mallet_model)
I used it and it worked perfectly.
Related
I am using Pycaret for a classification problem and I want to get a list of all the categorical and numerical variables inferred by setup() for EDA. Is there a way to do this?
I have tried looking at any function in the documentation but couldn't find anything.
Currently, I find only one way to do it in PyCaret 3.x by accessing the private variable of the Experiment object.
from pycaret.datasets import get_data
from pycaret.classification import *
data = get_data('bank', verbose=False)
exp = setup(data = data, target = 'deposit', session_id=123, verbose=False);
print(f'Ordinal features: {exp._fxs["Ordinal"]}')
print(f'Numeric features: {exp._fxs["Numeric"]}')
print(f'Date features: {exp._fxs["Date"]}')
print(f'Text features: {exp._fxs["Text"]}')
print(f'Categorical features: {exp._fxs["Categorical"]}')
I'm using the optuna.integration.lightgbm for training a LightGBM model.
The issue is theres a TON out outputs, and frankly I just want to disable them (or atleast find a way to regularize it).
I have tried this and lots of stuff e.g
import optuna
import optuna.integration.lightgbm as lgb
from lightgbm import log_evaluation
optuna.logging.set_verbosity(optuna.logging.ERROR) #Ignore outputs from Optuna when training
params = {
"objective": "softmax",
"metric":"multi_logloss",
"boosting_type": "gbdt",
"is_unbalance":True,
"num_classes":4,
"num_boost_round":10,
"verbosity":-1
}
model = lgb.train(
params,
dtrain,
valid_sets=[dtrain, dval],
callbacks = [early_stopping(100,verbose=False),log_evaluation(0)],
)
but I still get "early_stopping" outputs, validation outputs from each round etc. etc. as seen
There's even a suggestion of using log_evalution(), which I have also passed.
I can't think of more ways to (try) to ignore outputs.
As per the Roberta-long docs, the way to load the roberta long model for sequence classification is
class RobertaLongSelfAttention(LongformerSelfAttention):
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False
):
return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
class RobertaLongForSequenceClassification(RobertaForSequenceClassification):
def __init__(self, config):
super().__init__(config)
print("Config.........")
config
for i, layer in enumerate(self.roberta.encoder.layer):
layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)
I have my own pre-trained long model and tokenizer, but when I try to call trainer.train(), this gives an error saying
TypeError: forward() takes from 2 to 7 positional arguments but 8 were given
I also tried pulling the simonlevine/bioclinical-roberta-long model and tokenizer just to check if something's wrong with my pretrained model and tokenizer, but still getting the same error.
Any idea how to get rid of?
Note : I have used transformers 4.17.0 for pre-training the converting pre-trained model to long and using the same version for finetuning
I recently discovered (reading the question below) that I could obtain german dependencies with the Stanford parser, using the NNDependencyParser.
Dependencies are null with the German Parser from Stanford CoreNLP
My problem is, my parsed dependencies are always simply adjacent words in the sentence, no real tree structure. Parsing "Die Sonne scheint am Himmel." would get me pairs of ("Die", "Sonne"), ("Sonne", "scheint"), ("scheint", "am") etc. as dependencies even when using collaped dependencies.
String modelPath = "edu/stanford/nlp/models/parser/nndep/UD_German.gz";
String taggerPath = "edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger";
String text = "Ich sehe den Mann mit dem Fernglas.";
MaxentTagger tagger = new MaxentTagger(taggerPath);
DependencyParser parser = DependencyParser.loadFromModelFile(modelPath);
DocumentPreprocessor tokenizer = new DocumentPreprocessor(new StringReader(text));
for (List<HasWord> sentence : tokenizer) {
List<TaggedWord> tagged = tagger.tagSentence(sentence);
GrammaticalStructure gs = parser.predict(tagged);
for (TypedDependency td : gs.typedDependenciesCollapsed()) {
System.out.println(td.toString());
}
Yes, our German dependency parsing model is currently broken (somehow the French model was included in the release and we currently don't seem to have a working German model).
However, you could train your own model using the data from the Universal Dependencies project. You can find some information on how to train the parser on its project page.
I would import a csv file into python with FileChooser. Then when using rpy2, I can perform Statistical analyses with R I know much better compared to Python. Below is a piece of my code:
import pygtk
pygtk.require("2.0")
import gtk
from rpy2.robjects.vectors import DataFrame
def get_open_filename(self):
filename = None
chooser = gtk.FileChooserDialog("Open File...", self.window,
gtk.FILE_CHOOSER_ACTION_OPEN,
(gtk.STOCK_CANCEL, gtk.RESPONSE_CANCEL,
gtk.STOCK_OPEN, gtk.RESPONSE_OK))
response = chooser.run()
if response == gtk.RESPONSE_OK:
don = DataFrame.from_csvfile(chooser.get_filename())
print(don)
chooser.destroy()
return filename
When runing the code, don is printed. But the question is: in don, there are two columns, X and Y I can't access to perform analyses. Thanks for your kind help
Did you check the documentation about extracting elements from a DataFrame ?