I am building a LUIS intent for the submission of blood pressure. Blood pressure consists of systolic and diastolic blood pressure. I am therefore trying to group the two into a compound entity for overall blood pressure. In spoken language, the word "over" is used to separate them. In written language, a forward slash (/) is used. I have tried to train LUIS on both formats with example utterances. But whereas it easily picks up the version with "over", it doesn't work with the slash. Instead it only picks up the entity before the slash, but not the one following it.
Good:
my bloodpressure is { [ $SystolicBPQuantity ] over [ $DiastolicBPQuantity ] } .
Bad:
bloodpressure { [ $SystolicBPQuantity ] / 90 } .
I have tried numerous times to mark the second entitiy as DiastolicBPQuantity. It even let's me do it, but then forgets about it again.
Thanks for any suggestions! I think as a last resort I might put the word "slash" into my utterance and have my input parser replace any forward slashes with "slash" before passing it on to LUIS.
I found the ModifyLuisRequest method in LuisDialog and using it now to replace forward slash with "over". Works like a charm.
Related
- regex: regex features for intent classification
examples: |
- \bon road pric/i
- \bonroad pric/i
I have tested above regex and they are working fine. Hence I am sure there is no issue with regex expression
Example:
training-row-1] Please tell me on road price now.
training-row-2] Please tell me price now.
Based on above regex pattern, regex features which should get added are:
training-row-1] Please tell me on road price now. ==> TRUE (because regex match)
training-row-2] Please tell me price now. ==> FALSE (regex don't match)
My question is, In RegexFeaturizer, does regex match happens on whole sentence or on each token?
It make sense to have it on whole sentence.
Is above featurization which I have assumed is correct or no?
I've found the following docstring in the code for the RegexFeaturizer.
"""
Given a sentence, returns a vector of {1,0} values indicating which
regexes did match. Furthermore, if the message is tokenized, the
function will mark all tokens with a dict relating the name of the
regex to whether it was matched.
"""
So I think it's taking the entire sentence as input. It's hard to see inside of the feature space in Rasa but I've confirmed that the correct entity is picked up across tokens when using the RegexEntityExtractor. This is easily verified by temporarily adding entity examples in your NLU data (make sure it appears at least twice in intents) and running rasa interactive.
I build a bot in German language which should understand Swiss number formats:
English format for 1Mio: 1,000,000
German format for 1Mio: 1.000.000
Swiss format for 1Mio: 1'000'000
Unfortunately LUIS has no Swiss culture and will therefore not correctly understand 1'000'000 with builtin number entity. So my idea is to pre-process the user utterances before forwarding it to LUIS as follows: If I see a Swiss thousand separator (i.e. ') with at least one digit on the left and 3 digits on the right, then remove the Swiss thousand separator from the utterance before forwarding it to LUIS... and LUIS will then correctly recognize it because the numbers are cleaned of thousand separators.
Has anyone an idea how to do this in the bot? Or better in the middleware? I am new to BotFramework and pretty much lost.
Thanks!
Yes, you can modify the activity before you pass it to LUIS. You just need to come up with the appropriate regex to find and replace the '. For example, here's a bot where I'm updating this as part of the onTurn function, updated with a regex replace that I think will work for you (in nodejs):
async onTurn(context) {
if (context.activity.type === ActivityTypes.Message) {
context.activity.text = context.activity.text.replace(/(?<=\d{1})'(?=\d{3})/g,'')
const dc = await this.dialogs.createContext(context);
const results = await this.luisRecognizer.recognize(context);
The regex here is looking for the ' character preceeded by one digit (it's ok if it's more than one like in the middle of the number) and followed by 3 digits. You'd actually probably be ok with just /'(?=\d{3})/g which is a ' followed by three digits.
Same applies if you are using C# or a different turn handler, you just need to modify the activity.text before you pass it to LUIS.
In my LUIS application I have a 'Greeting' intent. The intent identified for 'hi' is 'Greeting' but for 'hi.......' some other intent is identified.
After training the 'hi.......' as 'Greeting' it gets identified as 'Greeting' correctly. There are some other variants too with special characters which need to be trained to make it work.
How do I make this to identify as Greeting without training with special characters?
This is being used in Microsoft Bot Framework v3 in C#
You can either train your LUIS model with all possible variations that include special characters or you can strip out all of the special characters before you send it to LUIS. I would recommend the latter. Here is an example of how you would do that in Node.
turnContext.activity.text = turnContext.activity.text.replace(/[^a-zA-Z ]/g, "", "");
Hope this helps!
I've been debugging a site to find the source of long page loading times, and I've narrowed it down to a regex that's used to extract URLs from text:
/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
This takes about 3 seconds to run on a large block of text. I found out that if I add the inverse of the first clause to the start of the regex ((?:[^\w+.-]|^)), it runs almost instantly:
/(?:[^\w+.-]|^)(?:([\w+.-]++):\/\/|(?:www\.))[^\s<]+/gx
It seems to me like the added clause shouldn't affect the regex at all, since nothing could cause that clause to fail (as those characters would be matched by the "[\w+.-]++" clause). Why does this make the regex run so much faster?
Edit
Some people have asked for an example of what I'm trying to do. To simplify things and to address the concerns people had in the comments, I'll be using the following two regexes:
# slow one
/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
# fast one
/[^\w+.-](?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
Fire up IRB/Pry and throw some text in a variable (this is a scrubbed version of what is actually searched against):
text = <<END_OF_TEXT
Unable to deliver message to email#example.com. Error message: request: <soap:Envelope xmlns:soap=";http://schemas.xmlsoap.org/soap/envelope/" xmlns:t=";http://schemas.microsoft.com/exchange/services/year/types" xmlns:m=";http://schemas.microsoft.com/exchange/services/year/messages"><soap:Header><t:RequestServerVersion Version="ExchangeYear"/></soap:Header><soap:Body><m:CreateItem MessageDisposition="SendAndSaveCopy"><m:SavedItemFolderId><t:DistinguishedFolderId Id="stuff"/></m:SavedItemFolderId><m:Items><t:Message><t:MimeContent>
END_OF_TEXT
Use the slow regex on it and note how slow it is:
text.gsub(/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/).to_a
Use the fast regex and note how fast it is:
text.gsub(/[^\w+.-](?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/).to_a
I figured out that this problem is specific to the type of data I used in the example (not a lot of spaces). If you run it against RFC 3986, which is much longer, both versions are equally fast.
The first pattern is slow because it starts with an alternation and the first branch of the alternation is very permissive since it allows any number of words characters or dots or hyphens. Consequence, this alternation takes a lot of time/steps before failing.
The second pattern is faster because (?:[^\w+.-]|^) (that is an alternation too) works like a kind of anchor. Indeed, even it is an alternation, it is quickly tested because the first branch matches only one character and the second is a zero-width assertion. So it takes less time/steps to fail. (in particular because it must be followed by a word character or a dot or an hypĥen, that is a binding condition)
But you can write this pattern in a better way. Since your are looking for urls, you can be more precise for the begining: the url can begin with, lets say, "http", "ftp", "sftp", "gopher", "www" (feel free to add other schemes if needed).
So you can describe the start with:
(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
To limit the cost of the alternation (5 branches to test at each positions in the string) you can use two tricks:
you can use a word boundary to quickly skip positions that are not the start or the end of a word:
\b(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
you can add a lookahead with the first letter of each branches, to quickly avoid uneeded positions in the string without to test the five branches:
\b(?=[fghsw])(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
So you can write a more efficient pattern like this:
/\b(?=[fghsw])(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)[^\s<]+/
In short: a pattern is efficient when it fail fast at bad positions in the string.
An other possible design that uses more memory and needs to check if the capture group exists for each match, but that is faster:
/[^ghsfw]*+(?:\B[ghsfw][^ghsfw]*)*+|\b((?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)[^\s<"&]+)/
(the idea is to divide the pattern in two main branches, the first one describes all that you want to avoid, and the second describes the urls. The effect is quick jumps to key positions in the string)
Note: when patterns begin to be long, you can use the free-spacing mode (or comment mode...) for readability and maintainability:
/(?x)
\b (?=[fghsw])
(?:
https?:\/\/ |
ftp:\/\/ |
sftp:\/\/ |
gopher:\/\/ |
www\.
)
[^\s<]+/
or you can use a formatted string and a join as suggested by Cary Swoveland in comments.
I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.