I'm currently comparing various pre-trained NMT models and can't help but wonder what the difference between MarianMT and OpusMT is. According to OpusMT's Github it is based on MarianMT. However in the Huggingface transformers implementation all pretrained MarianMT models start with "Helsinki-NLP/opus-mt". So I thought it was the same, but even though they're roughly the same size, they yield different translation results.
If someone could please shed some light on what the differences are I would be very thankful.
Marian is an open-source tool for training and serving neural machine translation, mostly developed at the University of Edinburgh, Adam Mickiewicz University in Poznań and at Microsoft. It is implemented in C++ and is heavily optimized for MT, unlike PyTorch-based Huggingface Transformers that aim for generality rather than efficiency in a specific use case.
The NLP group at the University of Helsinki trained many translation models using Marian on parallel data collected at Opus, and open-sourced those models. Later, they also did a conversion of the trained model into Huggingface Transformers and made them available via the Huggingface Hub.
MarianMT is a class in Huggingface Transformers for imported Marian models. You can train a model in Marian and convert it yourself. OpusMT models are Marian models trained on the Opus data in Helsinki converted to the PyTorch models. If you search the Huggingface Hub for Marian, you will find other MarianMT models than those from Helsinki.
Related
Current BERT base uncased clinical NER predict clinical entities( Problem, Test, Treatment)
I want to train on different clinical dataset to get entity like ( Disease, Medicine, Problem)
How to achieve that??
Model
There are several models in Huggingface which are trained on medical specific articles, those will definitely perform better than normal bert-base-uncased. BioELECTRA is one of them and it managed to outperform existing biomedical NLP models in several benchmark tests.
There are 3 different versions of those models depending on their pretraining dataset. But I think these 2 will be the best to start with.
Bioelectra-base-discriminator-pubmed: Pretrained on pubmed
Bioelectra-base-discriminator-pubmed-pmc: Pretrained on pubmed and pmc
NER Datasets:
Now coming to NER dataset there are several dataset you might like or you might want to create a composite dataset. Some of these are -
BC5-disease, NCBI-disease, BC5CDR-disease from BLUE benchmark
[Let me know if you need any help with model creation or setting up the finetuning setup. Also please use proper metrics to evaluate them and do share the metrics dashboard after it gets finished.]
We are trying to understand the underlying model of Rasa - the forums there still didnt get us an answer - on two main questions:
we understand that Rasa model is a transformer-based architecture. Was it
pre-trained on any data set? (eg wikipedia, etc)
then, if we
understand correctly, the intent classification is a fine tuning task
on top of that transformer. How come it works with such small
training sets?
appreciate any insights!
thanks
Lior
the transformer model is not pre-trained on any dataset. We use quite a shallow stack of transformer which is not as data hungry as deeper stacks of transformers used in large pre-trained language models.
Having said that, there isn't an exact number of data points that will be sufficient for training your assistant as it varies by the domain and your problem. Usually a good estimate is 30-40 examples per intent.
Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
I don't know of any good one. There's one linked from this project, but:
it's based on a custom fork from an older gensim, so won't load in recent code
it's not clear what parameters or data it was trained with, and the associated paper may have made uninformed choices about the effects of parameters
it doesn't appear to be the right size to include actual doc-vectors for either Wikipedia articles (4-million-plus) or article paragraphs (tens-of-millions), or a significant number of word-vectors, so it's unclear what's been discarded
While it takes a long time and significant amount of working RAM, there is a Jupyter notebook demonstrating the creation of a Doc2Vec model from Wikipedia included in gensim:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
So, I would recommend fixing the mistakes in your attempt. (And, if you succeed in creating a model, and want to document it for others, you could upload it somewhere for others to re-use.)
Yes!
I could find two pre-trained doc2vec models at this link
but still could not find any pre-trained doc2vec model which is trained on tweets
For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab
wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1)
model0.wv = wordVec_google
##some other code
model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)
but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for documents.
Anyone knows how to use doc2vec with google pretrained word vectors?
I tried this post: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/
but it does not work to load into gensim.models.Word2Vec object, perhaps it is a different gensim version.
The GoogleNews vectors are just raw vectors - not a full Word2Vec model.
Also, the gensim Doc2Vec class does not have general support for loading pretrained word-vectors. The Doc2Vec algorithm doesn't need pre-trained word-vectors – only some modes even use such vectors, and when they do, they're trained simultaneously as needed alongside the doc-vectors.
Specifically, the mode your code is using, dm=0, is the 'Paragraph Vectors' PV-DBOW mode, and does not use word-vectors at all. So even if there was a function to load them, they'd be loaded – then ignored during training and inference. (You would need to use PV-DM, 'dm=1', or add skip-gram word-training to PV-DBOW, dm=0, dbow_words=1, in order for such reused vectors to have any relevance to your training.)
Why do you think you want/need to use pre-trained vectors? (Especially, a set of 3 million word-vectors, from another kind of data, when a later step suggests you only care about a vocabulary of 20,000 words?)
If for some reason you feel sure you want to initialize Doc2Vec with wrod-vectors from elsewhere, and use a training mode where that would have some effect, you can look into the intersect_word2vec_format() method that gensim Doc2Vec inherits from Word2Vec.
That method specifically needs to be called after build_vocab() has already learned the corpus-specific vocabulary, and it only brings in the words from the outside source that are locally relevant. It's at best an advanced, experimental feature – see its source code, doc-comments, and discussion on the gensim list to understand its side-effects and limitations.
I am planning to use Google Prediction API for Sentiment Analysis. How can I generate the Traning model for this? Or where can I have any standard training model available for commercial use? I have already tried with the Sentiment Predictor provided in Prediction Gallery of Google Prediction API, but does not seem to work properly.
From my understanding, the "model" for the Google Prediction API is actually not a model, but a suite of models for regression as well as classification. That being said, it's not clear how the Prediction API decides what kind of regression or classification model is used when you present it with training data. You may want to look at how to train a model on the Google Prediction API if you haven't already done so.
If you're not happy with the results of the Prediction API, it might be an issue with your training data. You may want to think about adding more examples to the training file to see if the model comes up with better results. I don't know how many examples you used, but generally, the more you can add, the better.
However, if you want to look at creating one yourself, NLTK is a Python library that you can use to train your own model. Another Python library you can use is scikit-learn.
Hope this helps.
google prediction API is great BUT to train a model you will need...LOT OF DATA.
you can use the sentiment model that is alrady trained..