I've been trying to replicate some of my previous LUIS knowledge into the new Microsoft Language portal (Conversational Language Understanding) and I'm getting stuck into one thing that we used to do so frequently that we have no longer been able to do: nested entities.
To clarify what my doubt is, pretend that we have just asked the user to provide a desired price range for a product. Then the user says: 'price range from $233 to $400'. With Luis, we could create a machine learning entity where we could have two other sub-entities: minimumValue, maximumValue like this:
nested entity in LUIS
and when we train and test, we get a result like this:
training and results in LUIS
My question is: how can we implement something similar in CLU?
I have already tried Quantity.NumberRange prebuilt entity and it does not cover all the possible scenarios! I tested it with many different ways of mentioning a range and it failed in many of them. Also, I tried to merge the prebuilt training with manual labeling training (learned) and the thing is that when the prebuilt did not find the minimum and maximum (meaning that it did not work) the manual labeling worked (but I couldn't specify the minimum and maximum because there are no nested entities in CLU). I would really appreciate any help
Related
I'm using h2o's xgboost implementation in Python. I've saved a model to disk and I'm trying to load it later on for analysis and predicting. I'm trying to access the input features list or, even better, the feature list used by the model which does not include the features it decided not to use. The way people advise doing this is to use varimp function to get the variable importance and while this does remove features that aren't used in the model this actually gives you the variable importance of intermediate features created by OHE the categorical features, not the original categorical feature names.
I've searched for how to do this and so far I've found the following but no concrete way to do this:
Someone asking something very similar to this and being told the feature has been requested in Jira
Said Jira ticket which has been marked resolved but I believe says this was implemented but not customer visible.
A similar ticket requesting this feature (original categorical feature importance) for variable importance heatmaps but it is still open.
Someone else who found an unofficial way to access the columns with model._model_json['output']['names'] but that doesn't give the features that weren't used by the model and they are told to use a different method that doesn't work if you have saved the model to disk and reloaded it (which I am doing).
The only option I see is to just use the varimp features, split on period character to break the OHE feature names, select the first part of all the splits, and then run a set over everything to get the unique column names. But I'm hoping there's a better way to do this.
What is the best practice approach to handle typos / misspelling on LUIS List Entities?
I have intents on LUIS which use a list entity (specifically Company Department - HR, Finance, etc). It is common for users to misspell this when putting forward their utterance. LUIS expects an exact match, it doesn't do a "smart" match, and therefore doesn't pick up the misspelled entity.
a) Using bing spell check is not necessarily a good solution. e.g. Certain departments are acronyms such as VRPA - and bing wont correct a typo there.
b) When I used LUIS a year ago, I would pre-process the utterance and use a Levenshtein distance algorithm to fix typos on list entities before feeding them to LUIS.
I would imagine that by now LUIS has some better out of the box way of handling this very common use case.
I'd appreciate input on what the best practice approach is to handle this.
#acambitsis and I exchanged messages via his UserVoice ticket, but I'm going to post the answer here for others.
A combination of Bing and Simple Entities might be what you're looking for, then (they're machine-learned).
I was able to accomplish something close and attached images.
In entities, I created a Simple entity with the role, VRPA. In intents, I created the Show Me intent and added sample utterances "Show me the VRPA" and "Show me the VPRA". I clicked on V**A and selected the Simple Entity:VRPA role. After training, I tried "show me the varp" and it correctly guessed "varp" was the "Simple:VRPA" entity.
You may also find RegEx entities useful. For acronyms, you could do something like: /[vrpa]/i and then any combination of VRPA/VPRA/VARP/ARVP would match.
I highly recommend reading through the Entity Types and Improve App Performance to see if anything jumps out to solve your particular issues.
This may not do exactly what you're looking for. If not, I'd recommend implementing a fuzzy-matching algo of your choice.
entities
intents
I am trying to create a dataset for training RASA ner_crf for one type of entity. Please let me know the minimum number of sentences/variation_in_sentence_formation for good result. When I have one type of each of the possible sentence NER_CRF is not giving good result.
Rasa entity extraction depends heavily on the pipeline you have defined. Also depends on language model and tokenizers. So make sure you use good tokenizer. If it is normal English utterances try using tokenizer_ spacy before ner_crf. Also try with ner_spacy
As per my experience, 5 to 10 variations of utterances for each case gave a decent result to start with
I'm trying to setup LUIS to connect to my Azure WebApp Bot, I've been asked by my IT Director to test the bot on a "Simple" Staff Directory Lookup (hosted in Azure SQL VM's).
I was trying to configure LUIS to understand intents such as 'Who is in Hospitality', or 'Who is Joe Bloggs', but I'm struggling with how to do this.
Do I use entities for departments and people? Are there Pre-Built Intents for 'Greetings' and other commonly used intents?
Any help would be appreciated.
You have several questions so I splitted my answer in 2 parts.
Information detection (departement, names)
[I want to] understand intents such as 'Who is in Hospitality', or 'Who is Joe
Bloggs', but I'm struggling with how to do this.
Do I use entities for departments and people?
Department:
If you have a limited and known list of departments, you can create an Entity which type will be List. It will process an exact text match on the items of this list (see doc here).
If you don't have this list, use an Entity of type Simple (see doc here) and label this entity in several (various) examples utterances that you provide. You can improve the detection by also adding a Phrase list in that case: it will help and is not processing an exact match in the list. And you should improve it over the time.
People:
For the people name detection, it will be a little bit more tricky. You can have a look to Communication.ContactName pre-built entity. If it's not working, create your own simple entity but I'm not sure that the results will be relevant.
"Small talk" part
Are there Pre-Built Intents for 'Greetings' and other commonly used
intents?
There is no pre-built intents but there is a Lab Project called Personality Chat that is designed to manage such cases (in English only for the moment): https://labs.cognitive.microsoft.com/en-us/project-personality-chat
It is still in a lab version, so you should not use in production, but it is mostly open-source so you can give it a try and see if it fits your needs.
I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).