LUIS - proper nouns mixed with adjective(s) and partial entity matches - azure-language-understanding

I have two hierarchical entities like this (simplified): "Order::OpenOrder, Order::AnyOrder, Job::OpenJob, Job::AnyJob" for a search application. I'm trying to train LUIS to correctly understand inputs like (a)"acme open orders", (b)"open acme orders", (c)"acme open jobs", (d)"open acme jobs" using utterances.
If I just use the two simplest utterances "open orders" -> Order::OpenOrder and "open jobs" -> Job:::OpenJob, then inputs (a) and (c) work fine. But example (b) finds Order::Open, but the string "acme" is included in the entity character range. Example (d) is unable to resolve any entities.
Complicating things is that it's also legal to just input "acme orders" or "acme jobs", where I've trained LUIS using utterances like "blah orders" and "blah jobs" where "orders" and "jobs" are mapped to Order:AnyOrder, Job::AnyJob, respectively. And then you can also just input things like "orders", "open Orders", etc.
Anyway, none of this is working consistently and I'm wondering if I'm taking the wrong approach to training LUIS to understand adjective-noun pairs where proper nouns can appear between them. Anyone else had a model like this who could share some advice?
Thanks,
-Erik

Related

How to execute search for FHIR patient with multiple given names?

We've implemented the $match operation for patient that takes FHIR parameters with the search criteria. How should this search work when the patient resource in the parameters contains multiple given names? We don't see anything in FHIR that speaks to this. Our best guess is that we treat it as an OR when trying to match on given names in our system.
We do see that composite parameters can be used in the query string as AND or OR, but not sure how this equates when using the $match operation.
$match is intrinsically a 'fuzzy' search. Different servers will implement it differently. Many will allow for alternate spellings, common short names (e.g. 'Dick' for 'Richard'), etc. They may also allow for transposition of month and day and all sorts of similar data entry errors. The 'closeness' of the match is reflected in the score the match is given. It's entirely possible get back a match candidate that doesn't match any of the given names exactly if the score on other elements is high enough.
So technically, I think SEARCH works this way:
AND
/Patient?givenname=John&givenname=Jacob&givenname=Jingerheimer
The above is an AND clause. There is (can be) a person named with multiple given names "John", "Jacob", "Jingerheimer".
Now I realize SEARCH and MATCH are 2 different operations.
But they are loosely related.
But Patient-Matching is an "art". Be careful, a "false positive" (with a high "score") is/could-be a very big deal.
But as mentioned from Lloyd....you have a little more flexibility with your implementation of $match.
I have worked on 2 different "teams".
One team, we never let "out the door" anything that was below a 80% match-score. (How you determine a match-score is a deeper discussion).
Another team, we made $match work with a "IF you give me enough information to find a SINGLE match, I'll give it to you" .. but if not, tell people "not enough info to match a single".
Patient Matching is HARD. Do not let anyone tell you different.
at HIMSS and other events..when people show a demo of moving data, I always ask "how did you match this single person on this side.....as it is that person on the other side?"
As in "without patient matching...alot of work-flows fall a part at the get go"
Side note, I actually reported a bug with the MS-FHIR-Server (which the team fixed very quickly) (for SEARCH) here:
https://github.com/microsoft/fhir-server/issues/760
"name": [
{
"use": "official",
"family": "Kirk",
"given": [
"James",
"Tiberious"
]
},
Sidenote:
The Hapi-Fhir object to represent this is "ca.uhn.fhir.rest.param.TokenAndListParam"
Sidenote:
There is a feature request for Patient Match on the Ms-Fhir-Server github page:
https://github.com/microsoft/fhir-server/issues/943

Problems in LUIS

1) In pattern, LUIS does not let you have more than 3 arguments using the OR operator, e.g (a|b|c|d) is illegal (why?)
2) In pattern, is there any way we can specify an optional text, something like "I want to [text] {entity}" so that the user can type in whatever between "to" and {entity} ?
3) In pattern, I cannot make the plural option for a word, e.g "How to contact the supplier[s]" doesn't work when the user types in "How to contact the suppliers". And I had to add "suppliers" to my entities list, which I find inconvenient
4) When you delete an intent, all the utterances automatically go to None intent. I think that should be an option "Do you want to move the utterances to None intent?"

Search Console API: Impressions don't add up comparing totals to contains / not contains keywords

We are using the Search Console (webmaster tools) API to download search performance results for our site to compare search performance on people searching using our company name vs non company name searches. We have found a problem where the impressions don't add up when comparing "all search results" to "search results via specific keywords".
For example, if we do a report to show all web results for all devices for our site on a specific date, we get 189,491 impressions. If we then report to show results with the keyword "Our Name" we get 61,046. If we report on "OurName" (same keyword but without spaces) we get 1,086. If we then report not contains "Our Name" and not contains "OurName" we get 65,827, which adds up to 127,959, meaning somewhere we have 61,532 impressions missing.
Interestingly, if we change the filter on not contains to also include device equals DESKTOP, it increases to 65,997, yet I would have expected this to be equal to or less than all device impressions.
From the data we have this seemed to have stopped working on the 27th November 2015 (before this the 3 figures always added up to the total, on this date and afterwards they don't). The impressions add up fine if we only do one contains and one not contains. Clicks always seem to add up correctly, so I'm wondering if one of these queries is excluding data with zero clicks?
We are using the .Net library to access the Search Console data, but we get the same results when using the API Explorer. It is hard to replicate using the search console, as this doesn't allow you to include multi "not contains" keywords. The total figures and the contains "our name" / "ourname" figures match between the API and the search console.
I've found a few other post on here where people are having similar problems but they are dated over a year ago, and we've only just noticed the problem in the last 3 weeks so I don't know if this is a new problem.
The query for the not contains is as follows:
POST https://www.googleapis.com/webmasters/v3/sites/{YOUR_SITE_URL}/searchAnalytics/query?fields=rows&key={YOUR_API_KEY}
{
"startDate": "2015-12-07",
"endDate": "2015-12-07",
"searchType": "web",
"dimensionFilterGroups": [
{
"filters": [
{
"dimension": "query",
"expression": "our name",
"operator": "notContains"
},
{
"dimension": "query",
"expression": "ourname",
"operator": "notContains"
}
]
}
]
}
Many thanks in advance for any help
cross posted from Google Search Console Forum
From the API reference, there is no OR operation available for multiple filter expressions:
"Whether all filters in this group must return true ("and"), or one or more must return true (not yet supported)."
BOTH filters must be passed to get into the total.
Does not contain "our name" AND Does not contain "ourname".
https://developers.google.com/webmaster-tools/v3/searchanalytics/query
Having said that, you probably are even more at a loss to explain some of your results...maybe you have a number of queries that contain both "our name" AND "ourname"??
I working on the same topic at the moment (excluding brand searches); like Google say, they excluding search queries that can contain privat information:
To protect user privacy, Search Analytics doesn't show all data. For example, we might not track some queries that are made a very small number of times or those that contain personal or sensitive information.
https://support.google.com/webmasters/answer/6155685?hl=en#tablegone
With this in mind you have a big block of data with no query information, so if you filter in any way, that whole block isn't included.
For example, we had like 325.000 total impressions on the 01.07., but if I do two separate queries one with including and one with excluding and add the values for clicks and impressions together, I get like the total numbers for that block where my queries living in.
In our case that is around 180.000 impression, so 145k impressions were made with queries I don't know and can't filter them.
In your case the 127,959 could be your total of impressions (depending of your keywords). So your non brand traffic with 65,827 impressions is more like 50% percent than 30%.
I hope it's more or less understandable.

Google Place API street type list

I am using google place to retrieve address, and somehow we want the street(route in google terminology) to be separated into street name and street type. We also want the street type to match an existing column in database.
But things get difficult when google place sometimes use XXXX Street and some times XXXX st
For instance, this is a typical google address
{
administrative_area_level_1: ['short_name', 'VIC'],
locality: ['long_name', 'Carlton'],
postal_code: ['long_name', '3053'],
route: ['long_name', 'Canada Ln'],
street_number: ['short_name', '12'],
subpremise: ['short_name', '13']
}
But it always shows Canada Lane in the suggestion box.
And sometimes even worse when the abbreviation does not match my local data model. For instance we use la instead of ln for short of lane.
It will be appreciated if anyone could tell me where to find a list of street type (and abbreviation) used by google API. Or Is there a way to disable the abbreviation option?
Sounds like you're after "street suffixes". These are complicated.
Not only they change across countries and languages, even within the same country and language they can be used in different ways; abbreviations can have multiple meanings: "St" can be "Street" of "Saint"; abbreviations are used or not depending on subtle rules that also change from place to place.
Same goes for cardinal points (North, South, East, West) that are parts of road / street names: "North St" or "N 11st Street"? It's complicated.
If you already have a good amount of addresses, and you only care about addresses in English, you could take the last word from each street name as the suffix. When matching to your own data, allow for abbreviations when matching, rather than trying to expand them.
For instance, don't try to expand "Canada La" into "Canada Lane" so that it matches "Lane". Instead, expand "Lane" into ["Lane", "La", "Ln"] and match suffixes to all values.
Then you'd need a strategy for "collisions", abbreviations that can mean 2+ suffixes. These seem to be rare, I can't remember any ("St" isn't, because "Saint" isn't a suffix) and USPS' http://pe.usps.gov/text/pub28/28apc_002.htm doesn't seem to have any.

tip needed to find semantic value of blocks inside strings

I have a problem, and besides it sounds trivial, it's not simple (for me) to find a straight forward, scalable and performatic solution. I have one input text where the website user can search for locations.
Today the location can be a city, a address in a city or a neighborhood in a city, and the user must separate the address or the neighborhood from the city using a comma, then it's easy for me to split the string and find if the first block is a address, a neighborhood or a city. If the user fails to fill the input with all the needed information, putting a address without a city, and I match more than a street with the same name, we show all the locations for him to choose the correct one.
Using the search log we find out that most of the users don't use the comma, even with all the tool tips pointing how to use the location search (thx google :p).
So, a new requirement for the location search is needed, to accept non comma separated addresses, like:
1. "5th Avenue"
2. "Manhattan"
3. "New York"
4. "5th Avenue Manhattan"
5. "5th Avenue Manhattan New York"
6. "Manhattan New York"
7. "5th Avenue New York"
But I can't find a way to find the meaning of each block or a dynamic way to make this work. Ie, if I get a string like "New Yok", "new" can be a address, and "york" can be a city.
My question is, is there some kind of technique or framework to achieve what I need or I will need to work my way in a algorithm (based on the number of words, commas, etc) to do that specifically?
Edit1:
Because I use SQL Server, I'm thinking about full text search multiple columns search, doing a exact match before and a non exact later. But I think some incomplete addresses will return thousands of rows.
Isn't the key that specificity decreases from left to right? That is, the right-most semantic element (whether "New York" or "Manhattan") is always the least-specific (if it's a Borough, then we don't have to worry about City, if it's a Street, we don't have to worry about Borough, etc.)
So reverse the tokens and recurse through, seeking either a complete hit ("Manhattan") or a keyword ("Avenue", "Street", "New") that indicates either the beginning or end of a semantic element. So after a pass, you might have:
"5th Avenue" -> TOKEN STREET_END_TOKEN
"Manhattan" -> BOROUGH
"New York" -> COMPOUND_BEGIN_TOKEN TOKEN
"5th Avenue Manhattan" -> TOKEN STREET_END_TOKEN BOROUGH
"5th Avenue Manhattan New York" -> TOKEN STREET_END_TOKEN BOROUGH COMPOUND_BEGIN_TOKEN TOKEN
"Manhattan New York" -> BOROUGH COMPOUND_BEGIN_TOKEN TOKEN
"5th Avenue New York" -> TOKEN STREET_END_TOKEN COMPOUND_BEGIN_TOKEN TOKEN
Which ought to give you enough to pattern-match against.
UPDATE:
OK, to expand on the general strategy:
Step 1 : Generate a pattern of the query structure by identifying keywords ("Manhattan"), and semantically-meaningful ("Street", "Avenue") or grammatically-significant ("New", "Saint") tokens.
Step 2: Match the generated pattern against a set of templates -- "* BOROUGH *" -> (Street) (BOROUGH) (City)", "* STREET_END_TOKEN" -> (Street name) (Street type), etc.
Step 3: The result of Step 2 ought to give you a sense of what kind of query you're dealing with. You'll have to apply domain rules at that point (if you know the complete query is TOKEN STREET_END_TOKEN then you know "Well, this is a query that just specifies a street" and you have to apply whatever rule is appropriate (grab the locale of their browser? Use their query history to guess which neighborhood and city? etc.)

Resources