What is the best practice approach to handle typos / misspelling on LUIS List Entities?
I have intents on LUIS which use a list entity (specifically Company Department - HR, Finance, etc). It is common for users to misspell this when putting forward their utterance. LUIS expects an exact match, it doesn't do a "smart" match, and therefore doesn't pick up the misspelled entity.
a) Using bing spell check is not necessarily a good solution. e.g. Certain departments are acronyms such as VRPA - and bing wont correct a typo there.
b) When I used LUIS a year ago, I would pre-process the utterance and use a Levenshtein distance algorithm to fix typos on list entities before feeding them to LUIS.
I would imagine that by now LUIS has some better out of the box way of handling this very common use case.
I'd appreciate input on what the best practice approach is to handle this.
#acambitsis and I exchanged messages via his UserVoice ticket, but I'm going to post the answer here for others.
A combination of Bing and Simple Entities might be what you're looking for, then (they're machine-learned).
I was able to accomplish something close and attached images.
In entities, I created a Simple entity with the role, VRPA. In intents, I created the Show Me intent and added sample utterances "Show me the VRPA" and "Show me the VPRA". I clicked on V**A and selected the Simple Entity:VRPA role. After training, I tried "show me the varp" and it correctly guessed "varp" was the "Simple:VRPA" entity.
You may also find RegEx entities useful. For acronyms, you could do something like: /[vrpa]/i and then any combination of VRPA/VPRA/VARP/ARVP would match.
I highly recommend reading through the Entity Types and Improve App Performance to see if anything jumps out to solve your particular issues.
This may not do exactly what you're looking for. If not, I'd recommend implementing a fuzzy-matching algo of your choice.
entities
intents
Related
I've gone through the ocumentation and tried understanding the Phrase List feature. Although I'm sure of the purpose of the Phrase List feature, I couldn't quite get the purpose of the "interchangable" option intutively.
Any thorough explanation would be appreciated.
#Srichakradhar at your suggestion, posting answer related to your question on gitter to here on StackOverflow as well to benefit the community as a whole!:
"...regarding your question on phrase lists, happy to speak high-levelly on what the feature does :)
#srichakradhar
So ultimately the goal with LUIS is to understand the meaning of the user’s input (utterance), and through calculations, it returns to you the value of how confident it is about the meaning of the input. Using phrase lists is one of the ways to improve the accuracy of determining the meaning of the user’s utterance
—more specifically, when adding features to a phrase list, it can put more weight on the score of an intent or entity.
Using a couple of examples to illustrate the high-level concept of how features help determine intent/entity score, and in turn predict the user’s utterance’s meaning:
For example, if I wanted to describe a class called Tablet, features I could use to describe it could include screen, size, battery, color, etc. If an utterance mentions any of the features, it’ll add points/weight to the score of predicting that the utterance’s meaning is describing Tablet. However, features that would be good to include in a phrase list are words that are maybe foreign, proprietary, or perhaps just rare. For example, maybe I would add, “SurfacePro”, “iPad”, or “Wugz” (a made-up tablet brand) to the phrase list of Tablet. Then if a user’s utterance includes “Wugz”, more points/weight would be put onto predicting that Tablet is the right entity to an utterance.
Or maybe the intent is Book.Flight and features include “Book”, “Flight”, “Cairo”, “Seattle”, etc. And the utterance is “Book me a flight to Cairo”, points/weight towards the score of Book.Flight intent would be added for “Book”, “flight”, “Cairo”.
Now, regarding interchangeable vs. non-interchangeable phrase lists.
Maybe I had a Cities phrase list that included “Seattle”, “Cairo”, “L.A.”, etc. I would make sure that the phrase list is non-interchangeable, because it would indicate that yes “Seattle” and “Cairo” are somehow similar to one-another, however they are not synonyms—I can’t use them interchangeably or rather one in place of the other. (“book flight to Cairo” is different from “book flight to Seattle”)
But if I had a phrase list of Coffee that included features “Coffee”, “Starbucks”, “Joe”, and marked the list as interchangeable, I’m specifying that the features in the list are interchangeable. (“I’d like a cup of coffee” means the same as “I’d like a cup of Joe”)
For more on Phrase Lists - Phrase List features in LUIS
For more on improving prediction - Tutorial: Add phrase list to improve predictions"
Taken from documentation (here):
A phrase list may be interchangeable or non-interchangeable. An
interchangeable phrase list is for values that are synonyms, and a
non-interchangeable phrase list is intended for values that aren't
synonyms but are similar in another way.
There is also a great reply here on MSDN:
Choose "Exchangeable" when the list of words or phases in your feature
form a class or group -- for example, months like "January",
"February", "March"; or names like "John", "Mary", "Frank". These
features are "exchangeable" in the sense that an utterance where one
word/phrase appears would be labeled similarly if the word/phrase were
exchanged with another. For example, if "show the calendar for January" has the same intent as "show the calendar for February", this
suggests choosing "exchangeable".
Choose "Not exchangeable" for words/phrases that are useful in your
domain, but which do not form a class or group. For example, the
words "calendar", "email", "show", and "send" might be relevant to
your domain, but might all be associated with different intents, like
"show my calendar" or "send an email".
If you're not sure, you can try either and see if there's any
difference in performance.
I am using luis.ai which is offered as a part of Microsoft Cognitive Services, in my project. I have a requirement of detecting names using LUIS. For the same, I have been using the phrase list feature. I have added some names in the list. But as we all know, the names list is never exhaustive. So, no matter how many names I add, since they don't have a specific pattern, when I test with some new names, the entity detection fails. I want to know if there's any other way in which we can have LUIS detect names of people.
Please let me know if you have a solution to this problem.
LUIS could be used to recognize and extract intents and entities from utterances, but based on my experience, it might not be 100% intelligent to identify person’s name, because person’s name could be anything.
As you did, adding not well-recognized names in phrase list could be as a solution. Besides, this github issue:Identifying the Names from the sentence using LUIS discussed a similar question, and as cahann mentioned, you can add and label more example utterances that contain not well-recognized name to make your LUIS app recognize Names better.
I trained my luis model to recognize an intent called "getDefinition" with example utterances such as: "What does BLANK mean" or "Can you explain BLANK to me?". It recognizes the intent correctly. I also added an entity called "topic" and trained it to recognize what topic the user is asking about. The problem is that luis only recognizes the exact topic the user is asking about if I used that specific term in one of the utterances before.
Does this mean I have to train it with all the possible terms a user can ask about or is there some way to have it recognize it anyway?
For example when I ask "What does blockchain mean" it correctly identifies the entity (topic) as blockchain because the word blockchain is in the utterance. But if I ask the same version of the question about another topic such as "what does mining mean", it doesn't recognize that as the entity.
Using a list or phrase list doesn't seem to be solving the problem. I want to eventually have thousands of topics the bot responds to, entering each topic in a list is tedious and inconvenient. Is there a way LUIS can recognize that its a topic just from the context?
What is the best way to go about this?
Same Doubt, Bit Modified. Sorry for Reposting this here.
At the moment LUIS cannot extract an entity just based on the the intent. Phrase lists will help LUIS extract tokens that don't have explicit training data. For example training LUIS with the utterance "What does blockchain mean?" does not mean that it will extract "mining" from "What does mining mean?" unless "mining" was either included in a phrase list, or a list entity. In addition to what Nicolas R said about tagging different values, another thing to consider is that using words not commonly found (or found at all) in the corpuses that LUIS uses for each culture will likely result in LUIS not extracting the words without assistance (either via Phrase list or list entity).
For example, if you created a LUIS application that dealt with units of measurement, while you might not be required to train it with units such as inch, meter, kilometer or ounce; you would probably have to train it with words like milliradian, parsec, and even other cultural spellings like kilometre. Otherwise these words would most likely not be extracted by LUIS. If a user provided the tokens "Planck unit", LUIS might provide a faulty extraction where it returns "unit" as the measurement entity instead of "Planck unit".
Lets suppose it is movie bot. I added entity MovieName, and phrase list containing movies. One of the movie name is "Star Wars", and if user misspell it to "Stra Wra" then how I can tackle this issue? Will Bing spell check service help for non English movie names, I'm not sure?
LUIS will not be able to capture misspelled entities by itself unless you provide examples with misspelled entities which is not practical.
So you need to feed the utterances corrected to LUIS.
For Bing spelling correction service you have to try it yourself, but I guess it will handle your case.
If you expect some common misspellings that you expect to be repeated, you could add them in an exchangeable phrase list feature. That will help with the prediction of these misspelled entities.
There are multiple ways to solve this:
Use synonyms with most common mistakes
Have another step in your pipeline (before going to LUIS), which matches user input to possible options and corrects them (even a self made solution would do great, but you can also try to add ElasticSearch with fuzzy queries)
This is silly, but I haven't found this information. If you have names of concepts and suitable references, just let me know.
I'd like to understand how should I validate a given named id for a generic entity, like, say, an email login, just like Yahoo, Google and Microsoft do.
I mean... If you do have an user named foo, trying to create foo2 will be denied, as it is likely to be someone trying to mislead users by using a fake id.
Coming to mind:
Levenshtein Distance
Hamming Distance
You're going to have to take a two pass approach.
The first is a potential RegEx expression to validate that the entity name meets your specifications as much as possible. For example, disallowing certain characters.
The second is to perform some type of fuzzy search during the name creation. This could be as simple as a LIKE '%value%' where clause or as complicated as using some type of full-text search and limiting hits to a certain relevance rating.
That said, I would guess the failure rate (both false positives and false negatives ) match would be high enough to justify not doing this.
Good luck.