Convention for creating good data set for RASA NER_CRF - rasa-nlu

I am trying to create a dataset for training RASA ner_crf for one type of entity. Please let me know the minimum number of sentences/variation_in_sentence_formation for good result. When I have one type of each of the possible sentence NER_CRF is not giving good result.

Rasa entity extraction depends heavily on the pipeline you have defined. Also depends on language model and tokenizers. So make sure you use good tokenizer. If it is normal English utterances try using tokenizer_ spacy before ner_crf. Also try with ner_spacy
As per my experience, 5 to 10 variations of utterances for each case gave a decent result to start with

Related

How to implement nested entities in new LUIS version: CLU

I've been trying to replicate some of my previous LUIS knowledge into the new Microsoft Language portal (Conversational Language Understanding) and I'm getting stuck into one thing that we used to do so frequently that we have no longer been able to do: nested entities.
To clarify what my doubt is, pretend that we have just asked the user to provide a desired price range for a product. Then the user says: 'price range from $233 to $400'. With Luis, we could create a machine learning entity where we could have two other sub-entities: minimumValue, maximumValue like this:
nested entity in LUIS
and when we train and test, we get a result like this:
training and results in LUIS
My question is: how can we implement something similar in CLU?
I have already tried Quantity.NumberRange prebuilt entity and it does not cover all the possible scenarios! I tested it with many different ways of mentioning a range and it failed in many of them. Also, I tried to merge the prebuilt training with manual labeling training (learned) and the thing is that when the prebuilt did not find the minimum and maximum (meaning that it did not work) the manual labeling worked (but I couldn't specify the minimum and maximum because there are no nested entities in CLU). I would really appreciate any help

Dutch pre-trained model not working in gensim

When trying to upload the fasttext model (cc.nl.300.bin) in gensim I get the following error:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.bin.gz
!gunzip cc.nl.300.bin.gz
model = FastText_gensim.load_fasttext_format('cc.nl.300.bin')
model.build_vocab(cleaned_text, update=True)
AttributeError: 'FastTextTrainables' object has no attribute 'syn1neg'
The code goes wrong when building the vocab with my own dataset. The format of that dataset is all right, as I already used it to build and train other (not pre-trained) Word2Vec and FastText models.
I saw other had the same error on this blog, however their solution did not work for me: https://github.com/RaRe-Technologies/gensim/issues/2588
Also, I read somewhere that I should use 'load_facebook_model'? However I was not able to import load_facebook_model at all? Is this even a good way to solve this problem?
Any other suggestions?
Are you sure you're using the latest version of Gensim, 4.0.1, with many improvements to the FastText implementation?
And, there you will definitely want to use .load_facebook_model() to load a full .bin Facebook-format model:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model
But also note: the post-training expansion of the vocabulary is best considered an advanced & experimental function. It may not offer any improvement on typical tasks - indeed, without careful consideration of tradeoffs & balancing influence of later traiing against earlier, it can make things worse.
A FastText model trained on a large, diverse corpus may already be able to synthesize better-than-nothing guess vectors for out-of-vocabulary words, via its subword vectors.
If there's some data with very-different words & word-senses you need to integrate, it will often be better to re-train from scratch, using an equal combination of all desired text influences. Then you'll be doing things in a standard and balanced way, without harder-to-tune and harder-to-evaluate improvised changes to usual practice.

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

How to improve the accuracy of ner of StanfordCoreNLP?

I used NER of StanfordCoreNLP to recognize the entity including organization, location and person. But there exists something weird. For example, I input a sentence like "Cleveland Cavaliers" and it will recognize the 'Cleveland' as 'location' but not 'Cleveland Cavaliers' as organization.
I am not very familiar with the ner and I don't know how the NER works. My task is to get all the company name in the text and the result I have got is not very satisfactory. So there are two ways occuring to me to solve the problem. The first is to modify the dict and insert the correct data. The second is to train the model. But there are still some questions.
Will the first way work effectively?
If the answer of question 1 is yes, how to modify the dict?
Further more, the FAQ list at https://nlp.stanford.edu/software/crf-faq.shtml#a proposed the way to train the ner model but what confused me most is what I will get if I trained my model.
If I create a dataset containing like
"organization 'Cleveland
Cavaliers'"
to train the model, what will happen in the model? The dict inside the CRFClassifier will change?
Will the CRFClassifier modify the bug when I input 'Cleveland Cavaliers' and recognize the 'Cleveland Cavaliers' as an organization entity?
These are all my puzzles and I am preparing the dataset to try the second way. Can anybody answer the 4 questions above?
Thanks
I think the first solution is not very technical and every time you want to tag a new company, you need to update your dictionary.
I prefer your second solution and I do this before and trained a new model to tag my sentences.
If you have a good corpus that is big enough which tagged properly, It may take some time to train, but it worth the effort.

OpenNLP, Training Named Entity Recognition on unsupported languages: clarifications needed

I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?
1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)

Resources