Specific question on improving LUIS speech recognition - azure-language-understanding

I am just testing LUIS with intents and entities and have a couple of things I need to fix.
When I say the word "patent" it was recognized as "patern" how can I improve this?
I am attempting to recognize individual letters for example a b c etc. I am Australian and so I say for "z" zed and not zzzzz (US). I set the SpeechConfig.SpeechRecognitionLanguage to "en-AU", however when I say my "z" the recognized text is "zed". If I say zzzz the recognized text is correct with z.
When i say the number five four eight nine it is recognized as "five 489" I am expecting "5489". I have an entity pattern regex of [0-9]{4} as I want to know when a 4 digit number is spoken, however this does not work when it recognizes "five 489" how am I supposed to fix this?
I want to be able to distinguish between "show me record 12345" and "show me the record". How should I do this? Should I have different Intents or should I have 1 intent but look for 1 with an entity of 12345 and one without?

Related

Pre-process user utterances in bot before forwarding them to LUIS

I build a bot in German language which should understand Swiss number formats:
English format for 1Mio: 1,000,000
German format for 1Mio: 1.000.000
Swiss format for 1Mio: 1'000'000
Unfortunately LUIS has no Swiss culture and will therefore not correctly understand 1'000'000 with builtin number entity. So my idea is to pre-process the user utterances before forwarding it to LUIS as follows: If I see a Swiss thousand separator (i.e. ') with at least one digit on the left and 3 digits on the right, then remove the Swiss thousand separator from the utterance before forwarding it to LUIS... and LUIS will then correctly recognize it because the numbers are cleaned of thousand separators.
Has anyone an idea how to do this in the bot? Or better in the middleware? I am new to BotFramework and pretty much lost.
Thanks!
Yes, you can modify the activity before you pass it to LUIS. You just need to come up with the appropriate regex to find and replace the '. For example, here's a bot where I'm updating this as part of the onTurn function, updated with a regex replace that I think will work for you (in nodejs):
async onTurn(context) {
if (context.activity.type === ActivityTypes.Message) {
context.activity.text = context.activity.text.replace(/(?<=\d{1})'(?=\d{3})/g,'')
const dc = await this.dialogs.createContext(context);
const results = await this.luisRecognizer.recognize(context);
The regex here is looking for the ' character preceeded by one digit (it's ok if it's more than one like in the middle of the number) and followed by 3 digits. You'd actually probably be ok with just /'(?=\d{3})/g which is a ' followed by three digits.
Same applies if you are using C# or a different turn handler, you just need to modify the activity.text before you pass it to LUIS.

ServiceNow "Greater than or is", "Less than or is", and "Between" String compare operators

In a condition-type field, or in like a filter for a list, if I select a field that is a string-type field, there are three compare operators in the list of available operators that I do not understand the function of when it comes to strings. I was hoping someone could help me understand these three compare operators:
Less than or is
Greater than or is
Between
Also, what does "Matches Pattern" refer to? I at least understand the other three in other contexts, but what does "Matches Pattern" refer to, something about Regex, which I am familiar with?
Here is a screenshot so that you can see exactly what I am talking about, just to make sure I am being clear with what I am asking. Thank you in advance.
String operator screenshot
String operators
With string less than and greater than take into consideration alphabetical order. Letter "A" is bigger than "B" and so on. If you provide more letter you will limit results even more.
Having records:
Albania
Andorra
Armenia
Austria
Azerbaijan
Belarus
Belgium
Bosnia & Herzegovina
Bulgaria
When you filer greater than or is "Belg" you will receive the following result:
Belgium
Bosnia & Herzegovina
Bulgaria
When you filer, greater than or is "Belg" AND less than or is "Bo". You will see:
Belgium
Bosnia & Herzegovina
You will get the same results using between operator.
Matches Pattern
Matches Pattern is pretty well described in documentation. Please have a look here for details doc link
In short "pattern" is like regex but only have asterisk (*) and question mark (?). Asterisk means any number (including zero) of any character where question mark means exactly one.
the story matches the story but not that story.
*story matches the story and that story, but not that story is the best.
st?ry matches story and stxry, but not my story or stairy.
*b?gus* matches bogus, my bogus story, and His bagus machine, but not my bgus story or my baigus story.

Where is the ruby alphabet with a visualization of edge cases?

I want to see a print out of the character layout in ruby ( sorry I don't know the lexicon for what I am asking)
It is something like
abcdefghijklmnopqrstuvwxyzaabbcc..zz
then it goes to
ABCDE....ZAABBCC
etc
It sounds like you want to generate a pattern similar to how most spreadsheet software (such as Excel) name their Columns.
This question asks exactly that:
How to convert a column number (eg. 127) into an excel column (eg. AA)
It has some great answers you can check out.
What makes your question different from the excel pattern is that you want to account for lower case letters as well.
Modifying any of those solutions to account for 52 letters instead of just 26 should be trivial.
You can print from a to zz with:
('a'..'zz').to_a.join
=> 'abc ... xyzaaabac ... zxzyzz'
And from A to ZZ with:
('A'..'ZZ').to_a.join
=> "ABC ... XYZAAABAC ... ZXZYZZ"
Also interesting is convert numerical to char:
(0..255).each {|e| p e.chr}

phone regex does not completely working

In my country the phone numbers follow a format like this (XX)XXXX-XXXX. But enter phone numbers according to the pattern in input texts it's too mainstream. Some people follow, but some people don't. I'd like to make a regex to catch all possible cases. By now it look like this:
/^[\(]?\d{2}?[\)]?\d{4}[. -]?\d{4}$/
And I prepared some test cases to prove the regex's functionality
# GOOD PHONES #
8432115262
843211 5262
843211.5262
843211-5262
32115262
3211.5262
3211 5262
3211-5262
(84)32115262
(84)3211.5262
(84)3211 5262
(84)3211-5262
# BAD PHONES #
!##$%*()
()32115262
()1231 3213
()1231.3213
()1231-3213
().3213
()-3213
()3213.
()3213-
3211-5a62
sakdiihbnmwlzi
Unfortunately, the wrong case ()32115262 is bypassing the regex. Altought it is clear why. this part [\(]?\d{2}?[\)]? is responsable for the mistake. From left to right, you can enter zero or one of (; You can enter zero or two digits; You can enter zero or one of ).
I'd like that part should be like this: If you put (, you will have to enter two digits and ), else you can enter zero or two digits. Something like this or with simmilar semantics is possible in regex world?
Thanks in advance
Something like this perhaps:
/^(?:\(\d{2}\)|\d{2}?)\d{4}[. -]?\d{4}$/
I used a non-matching group (?: ... ) and alternation to provide two possible options for the first part of the phone number.
Either it is \(\d{2}\) which means brackets with exactly two digits, or it is \d{2}? which means two digits or empty string.
Combine these two options together with | (which means OR) and you get the first part of the regex above: (?:\(\d{2}\)|\d{2}?)
It seemed to work for all your test cases!
try with this: ^(?:\(\d\d\)|\d\d)?\d{4}[. -]?\d{4}$
If pattern matches (..) then have to match 2 digits inside.

Algorithm for re-wrapping hard-wrapped text?

Let's say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company's support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:
This is a message from Contoso customer service.
Recently, you requested customer support. Below is a summary of your
request and our reply.
--------------------------------------------------------------------
Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m.
--------------------------------------------------------------------
John:
I've modified your address. You can confirm my work by logging into
"Your Account" on our Web site. Your order should ship out today.
Thanks for shopping at Contoso.
--------------------------------------------------------------------
You on Tuesday, December 30, 2008 at 8:03 a.m.
--------------------------------------------------------------------
Oops, I entered my address incorrectly. Can you change it to
Fred Smith
123 Main St
Anytown, VA 12345
Thanks!
--
Fred Smith
Contoso Product Lover
Generally, this all works great, but there's one area that I've kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.
I've written an algorithm that does this (though looking at the code, I'm not entirely sure how it works anymore--it could use some refactoring). But it can't distinguish between a hard-wrap newline, an "end of paragraph" newline, and a "semantic" newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br tag, such as the address that the Fred typed above.
My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer's e-mail be formatted something like the following:
Oops, I entered my address incorrectly. Can you change it to
Fred Smith 123 Main St Anytown, VA 12345
Thanks!
-- Fred Smith Contoso Product Lover
Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a "hard-wrap" newline and a "I really meant it like a br"-type newline, such as in the customer's address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)
Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation "good enough" when weighing the complexity of any given solution?
Thanks.
You could try to check if a newline has been inserted to keep the line length below a maximum (aka hard wrap): Just check for the longest line in the text. Then, for any given line, you append the first word of the following line to it. If the resulting line exceeds the maximum length, the line break probably was a hard wrap.
Even simpler you might just consider all breaks in (maxlength - 15) <= length <= maxlength as being hardwraps (with 15 just being an educated guess). This would certainly filter out intentional breaks as in addresses and stuff, and any missed break in this range wouldn't influence the result too badly.
I have two suggestions, as follows.
Pay attention to punctuation: this will help you to distinguish between a "hard-wrap" newline and an "end of paragraph" newline (because, if the line ends with a full stop, then it's more likely that the user intended it to be an end-of-paragraph.
Pay attention to whether a line is much shorter than the maximum line length: in the example above, you might have text that's being "hard-wrapped" at 79 characters, plus you have address lines which are only 30 characters long; because 30 is much less than 79, you know that the address lines were broken by the user and not by the user's text-wrap algorithm.
Also, pay attention to indents: lines which are indented with whitespace from the left may be supposed to be new paragraphs, broken from the previous lines, as they are on this forum.
Following Ole's advice above, I re-worked my implementation to look at a threshold. It seems to handle most scenarios I throw at it well enough without me having to go nuts and write code that actually understand the English language.
Basically, I first scan through the input string and record the longest line length in the variable inputMaxLineLength. Then as I'm rewrapping, if I encounter a newline that has an index between inputMaxLineLength and 85% of inputMaxLineLength, then I replace that newline with a space because I think it's a hard wrap newline--unless it's immediately followed by another newline, because then I assume that it's just a one-line paragraph that just happens to within that range. This can happen if someone types out a short bulleted list, for example.
Certainly not perfect, but "good enough" for my scenario, considering the text is usually half-mangled by a previous e-mail client to begin with.
Here's some code, my a-few-hours-old implementation that probably still underwraps in a few edge cases (using C#). It's a lot less complicated than my previous solution, which is nice.
Source Code
And here's some unit tests that exercise that code (using MSTest):
Test Code
If anyone has a better implementation (and no doubt a better implementation exists), I'll be happy to read your thoughts! Thanks.

Resources