What is the algorithm behind Rasa NLU? - rasa-nlu

I see the Rasa NLU use the MITIE and spaCy, but can anyone explain the how they use it and the algorithm behind?

There is a post by Alan on the Rasa blog here that covers the basic approach used:
https://medium.com/rasa-blog/do-it-yourself-nlp-for-bot-developers-2e2da2817f3d
This should give a good idea of roughly what it's doing but if you are keen to find out more, you can easily look over the actual code used (which is the great advantage of open source solutions!) https://github.com/RasaHQ/rasa_nlu/tree/master/rasa_nlu

It depends what kind of NER you want to use for your bot.. basically you define a pipeline in your configuration file ... most preferred is spacy since its corpus is being updated regularly and widely used .. mitie is not that good as compare to spacy and also is an older version.
language: "en"
pipeline: "spacy_sklearn"
you can read in more details here :
choosing rasa nlu pipeline

Related

Dutch pre-trained model not working in gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Convention for creating good data set for RASA NER_CRF

I am trying to create a dataset for training RASA ner_crf for one type of entity. Please let me know the minimum number of sentences/variation_in_sentence_formation for good result. When I have one type of each of the possible sentence NER_CRF is not giving good result.
Rasa entity extraction depends heavily on the pipeline you have defined. Also depends on language model and tokenizers. So make sure you use good tokenizer. If it is normal English utterances try using tokenizer_ spacy before ner_crf. Also try with ner_spacy
As per my experience, 5 to 10 variations of utterances for each case gave a decent result to start with

Training Stanford CoreNLP co-reference

I would like to use the Stanford CoreNLP library to do co-referencing in Dutch.
My question is, how do I train the CoreNLP to handle Dutch co-referencing resolution?
We've already created a Dutch NER model based on the 'conll2002' set (https://github.com/WillemJan/Stanford_ner_bugreport/raw/master/dutch.gz), but we would also like to use the co-referencing module in the same way.
Look at the class edu.stanford.nlp.scoref.StatisticalCorefTrainer.
The appropriate properties file for English is in:
edu/stanford/nlp/scoref/properties/scoref-train-conll.properties
You may have to get the latest code base from GitHub:
https://github.com/stanfordnlp/CoreNLP
While we are not currently supporting training of the statistical coreference models in the toolkit, I do believe the code for training them is included and it is certainly possible it works right now. I have yet to verify if it is functioning properly.
Please let me know if you need any more assistance. If you encounter bugs I can try to fix them...we would definitely like to get the statistical coreference training operational for future releases!

In Stanford's NLP core API, how do I get a temporal expression range?

I want to use the Stanford NLP API to parse text and extract temporal expressions. The Core NLP package comes with SUTime, a library for recognizing and normalizing time expressions. Following the example on their site, I have easily found the expressions I want.
However, the online demo has a checkbox for 'include range', which is very useful to me. How can I pass this flag to the library API? I can't seem to find it in their documentation.
After combing through the Java NLP mailing list archives, I found this page which explains the issue. The way to pass options into the TimeAnnotator is to add properties, in this case:
props.setProperty("sutime.includeRange", "true");
I hope this helps someone in the future, maybe even myself :-)

Stemming - code examples or open source projects?

Stemming is something that's needed in tagging systems. I use delicious, and I don't have time to manage and prune my tags. I'm a bit more careful with my blog, but it isn't perfect. I write software for embedded systems that would be much more functional (helpful to the user) if they included stemming.
For instance:
Parse
Parser
Parsing
Should all mean the same thing to whatever system I'm putting them into.
Ideally there's a BSD licensed stemmer somewhere, but if not, where do I look to learn the common algorithms and techniques for this?
Aside from BSD stemmers, what other open source licensed stemmers are out there?
-Adam
Snowball stemmer (C & Java)
I've used it's Python binding, PyStemmer
Check out the nltk toolkit written in python. It has a very functional stemmer.
Another option for stemming would be WordNet, along with one of its APIs. Some basic information on stemming and lemmatization, including a description of the Porter stemming algorithm, can be found online in Introduction to Information Retrieval.
Lucene has a stemmer in, I believe (and IIRC it lets you use your own one if you want).
EDIT: Just checked, and Lucence refers to the Snowball site which is an open source stemming library as far as I can tell.

Resources