tip needed to find semantic value of blocks inside strings - algorithm

I have a problem, and besides it sounds trivial, it's not simple (for me) to find a straight forward, scalable and performatic solution. I have one input text where the website user can search for locations.
Today the location can be a city, a address in a city or a neighborhood in a city, and the user must separate the address or the neighborhood from the city using a comma, then it's easy for me to split the string and find if the first block is a address, a neighborhood or a city. If the user fails to fill the input with all the needed information, putting a address without a city, and I match more than a street with the same name, we show all the locations for him to choose the correct one.
Using the search log we find out that most of the users don't use the comma, even with all the tool tips pointing how to use the location search (thx google :p).
So, a new requirement for the location search is needed, to accept non comma separated addresses, like:
1. "5th Avenue"
2. "Manhattan"
3. "New York"
4. "5th Avenue Manhattan"
5. "5th Avenue Manhattan New York"
6. "Manhattan New York"
7. "5th Avenue New York"
But I can't find a way to find the meaning of each block or a dynamic way to make this work. Ie, if I get a string like "New Yok", "new" can be a address, and "york" can be a city.
My question is, is there some kind of technique or framework to achieve what I need or I will need to work my way in a algorithm (based on the number of words, commas, etc) to do that specifically?
Edit1:
Because I use SQL Server, I'm thinking about full text search multiple columns search, doing a exact match before and a non exact later. But I think some incomplete addresses will return thousands of rows.

Isn't the key that specificity decreases from left to right? That is, the right-most semantic element (whether "New York" or "Manhattan") is always the least-specific (if it's a Borough, then we don't have to worry about City, if it's a Street, we don't have to worry about Borough, etc.)
So reverse the tokens and recurse through, seeking either a complete hit ("Manhattan") or a keyword ("Avenue", "Street", "New") that indicates either the beginning or end of a semantic element. So after a pass, you might have:
"5th Avenue" -> TOKEN STREET_END_TOKEN
"Manhattan" -> BOROUGH
"New York" -> COMPOUND_BEGIN_TOKEN TOKEN
"5th Avenue Manhattan" -> TOKEN STREET_END_TOKEN BOROUGH
"5th Avenue Manhattan New York" -> TOKEN STREET_END_TOKEN BOROUGH COMPOUND_BEGIN_TOKEN TOKEN
"Manhattan New York" -> BOROUGH COMPOUND_BEGIN_TOKEN TOKEN
"5th Avenue New York" -> TOKEN STREET_END_TOKEN COMPOUND_BEGIN_TOKEN TOKEN
Which ought to give you enough to pattern-match against.
UPDATE:
OK, to expand on the general strategy:
Step 1 : Generate a pattern of the query structure by identifying keywords ("Manhattan"), and semantically-meaningful ("Street", "Avenue") or grammatically-significant ("New", "Saint") tokens.
Step 2: Match the generated pattern against a set of templates -- "* BOROUGH *" -> (Street) (BOROUGH) (City)", "* STREET_END_TOKEN" -> (Street name) (Street type), etc.
Step 3: The result of Step 2 ought to give you a sense of what kind of query you're dealing with. You'll have to apply domain rules at that point (if you know the complete query is TOKEN STREET_END_TOKEN then you know "Well, this is a query that just specifies a street" and you have to apply whatever rule is appropriate (grab the locale of their browser? Use their query history to guess which neighborhood and city? etc.)

Related

how do i use for loop for correct options i gave under print command?

i know how to use for loop and if statement separately but when using together i am stuck on this problem.
print(''' a. mumbai
\n b.delhi
\n c. chennai''')
city=(input('enter your city:'))
for options in city:
print('choose you budget:')
print('''a.2k
\n b. 5k
\n c. 8k''')
else:
print('no hotels found')
i wanted that when i enter one of the options i gave under 'city', it gives me the 'budget' and when i enter something else it prints ' no hotels found' and gives the option to enter the city again until it satisfies one from the given options but instead when i enter anything it shows me the budget and later shows'no hotels found'.
is the for cannot be used for something written inside print or am doing it all wrong?
You need to follow this logic flow for what you want to achieve:
Display City Options
Get the user’s City Options Input
Display the Budget Options
Get the user’s Budget Options Input
Check if there are any offerings in the selected city (matching the value retrieved from Step 2) and budget option value (matching the value retrieved from Step 4)
Display any results from the matches of step 5 or display a message stating that no matches were found.
From my understanding you’re missing Step 4… you’re skipping the retrieval of the user’s budget option input.
You’re missing the if statement (in step 5) hence it’s executing the code found in the else statement.
You need to write the code with searches for the chosen city and budget match… you can use a boolean to retrieve the result of this match… and then write an if statement based on the boolean value to display the hotels which are in the chosen city and offer bookings matching the chosen budget… followed by an else statement displaying the “No hotels found”.
Here’s a fix to your code which was posted in your original post:
print(''' a. mumbai
\n b.delhi
\n c. chennai''')
city=(input('enter your city:'))
for options in city:
print('choose you budget:')
print('''a.2k
\n b. 5k
\n c. 8k''')
budget=(input(‘choose you budget:’)
!! Write code to search matches based
to search for input matches !!
if (boolean match == true)
foreach (resultmatch in resultmatches){
print resultmatch}
else:
print('no hotels found')
You need to write the result matches and if statement code… I included some pseudo code within !! !!

How to execute search for FHIR patient with multiple given names?

We've implemented the $match operation for patient that takes FHIR parameters with the search criteria. How should this search work when the patient resource in the parameters contains multiple given names? We don't see anything in FHIR that speaks to this. Our best guess is that we treat it as an OR when trying to match on given names in our system.
We do see that composite parameters can be used in the query string as AND or OR, but not sure how this equates when using the $match operation.
$match is intrinsically a 'fuzzy' search. Different servers will implement it differently. Many will allow for alternate spellings, common short names (e.g. 'Dick' for 'Richard'), etc. They may also allow for transposition of month and day and all sorts of similar data entry errors. The 'closeness' of the match is reflected in the score the match is given. It's entirely possible get back a match candidate that doesn't match any of the given names exactly if the score on other elements is high enough.
So technically, I think SEARCH works this way:
AND
/Patient?givenname=John&givenname=Jacob&givenname=Jingerheimer
The above is an AND clause. There is (can be) a person named with multiple given names "John", "Jacob", "Jingerheimer".
Now I realize SEARCH and MATCH are 2 different operations.
But they are loosely related.
But Patient-Matching is an "art". Be careful, a "false positive" (with a high "score") is/could-be a very big deal.
But as mentioned from Lloyd....you have a little more flexibility with your implementation of $match.
I have worked on 2 different "teams".
One team, we never let "out the door" anything that was below a 80% match-score. (How you determine a match-score is a deeper discussion).
Another team, we made $match work with a "IF you give me enough information to find a SINGLE match, I'll give it to you" .. but if not, tell people "not enough info to match a single".
Patient Matching is HARD. Do not let anyone tell you different.
at HIMSS and other events..when people show a demo of moving data, I always ask "how did you match this single person on this side.....as it is that person on the other side?"
As in "without patient matching...alot of work-flows fall a part at the get go"
Side note, I actually reported a bug with the MS-FHIR-Server (which the team fixed very quickly) (for SEARCH) here:
https://github.com/microsoft/fhir-server/issues/760
"name": [
{
"use": "official",
"family": "Kirk",
"given": [
"James",
"Tiberious"
]
},
Sidenote:
The Hapi-Fhir object to represent this is "ca.uhn.fhir.rest.param.TokenAndListParam"
Sidenote:
There is a feature request for Patient Match on the Ms-Fhir-Server github page:
https://github.com/microsoft/fhir-server/issues/943

Algorithm for translating MLB play-by-play records into descriptive text

I'm trying to collect a dataset that could be used for automatically generating baseball articles.
I have play-by-play records of MLB games from retrosheet.org that I would like to be written out to plain text, as those that could possibly appear as part of a recap news article.
Here are some examples of the play-by-play records:
play,2,0,semim001,32,.CBFFFBBX,9/F
play,2,0,phegj001,01,FX,S7/G
play,2,0,martn003,01,CX,3/G
play,2,1,youne003,00,,NP
The following is what I would like to achieve:
For the first example
play,2,0,semim001,32,.CBFFFBBX,9/F,
I want it to be written out as something like:
"semim001 (Marcus Semien) was on three balls and two strikes in the second inning as the away player. He hit the ball into play after one called strike, one ball, three fouls, and another two balls. The fly ball was caught by the right outfielder."
The plays are formatted in the following way:
The first field is the inning, an integer starting at 1.
The second field is either 0 (for visiting team) or 1 (for home team).
The third field is the Retrosheet player id of the player at the plate.
The fourth field is the count on the batter when this particular event (play) occurred. Most Retrosheet games do not have this information, and in such cases, "??" appears in this field.
The fifth field is of variable length and contains all pitches to this batter in this plate appearance and is described below. If pitches are unknown, this field is left empty, nothing is between the commas.
The sixth field describes the play or event that occurred.
Explanations for all the symbols in the fifth and sixth field can be found on this Retrosheet page.
With Python 3, I've been able to format all the info of invariable length into a formatted sentence, which is all but the last two fields. I'm having difficulty in thinking of an efficient way to unparse (correct me if this is the wrong term to use here) the fifth and sixth fields, the pitches and the events that occurred, due to their variable length and wide variety of things that can occur.
I think I could write out all the rules based on the info on the Retrosheet website, but I'm looking for suggestions for a smarter way to do this. I wrote natural language processing as tags, hoping this could be a trivial problem in that field. Any pointers will be greatly appreciated!

Possible algorithms to solve this problem

I have a list of extracted names of one hotel , and these are the names taken by n websites about the same hotel . The list contains m names about 1 hotel . I have to select one name from the list based on correctness , similarity , less mistakes . How can I achieve this ?
Any direction is helpful .
Example: List of names for hotelId 1 {"ABC Hotel","CDE hotel" , "Hotel ABC" ,"AB Hotel" , "Hotel BCA" ...}
With the initital research it looks like a graph related problem
This is not gonna work. You will not get similarities based on the names. Especially if almost every hotel has the keyword hotel in its name.
You need more information to match similarities.
Address, Geo location, attributes about the hotel could also help (wifi, parking, close to beach, pool), if this is a chain and so on. The more information you have the better the matching result you can get.
You can try to leverage some of Bing or Google APIs --> i.e. do a search for the hotel name with some details from address in Search APIs or in some Map APIS (e.g. search for ["ABC Hotel 5AV Philliadelphia","CDE hotel 5AV Philliadelphia" , "Hotel ABC 5AV Philliadelphia",..] then compare your data with the API response.

Google Place API street type list

I am using google place to retrieve address, and somehow we want the street(route in google terminology) to be separated into street name and street type. We also want the street type to match an existing column in database.
But things get difficult when google place sometimes use XXXX Street and some times XXXX st
For instance, this is a typical google address
{
administrative_area_level_1: ['short_name', 'VIC'],
locality: ['long_name', 'Carlton'],
postal_code: ['long_name', '3053'],
route: ['long_name', 'Canada Ln'],
street_number: ['short_name', '12'],
subpremise: ['short_name', '13']
}
But it always shows Canada Lane in the suggestion box.
And sometimes even worse when the abbreviation does not match my local data model. For instance we use la instead of ln for short of lane.
It will be appreciated if anyone could tell me where to find a list of street type (and abbreviation) used by google API. Or Is there a way to disable the abbreviation option?
Sounds like you're after "street suffixes". These are complicated.
Not only they change across countries and languages, even within the same country and language they can be used in different ways; abbreviations can have multiple meanings: "St" can be "Street" of "Saint"; abbreviations are used or not depending on subtle rules that also change from place to place.
Same goes for cardinal points (North, South, East, West) that are parts of road / street names: "North St" or "N 11st Street"? It's complicated.
If you already have a good amount of addresses, and you only care about addresses in English, you could take the last word from each street name as the suffix. When matching to your own data, allow for abbreviations when matching, rather than trying to expand them.
For instance, don't try to expand "Canada La" into "Canada Lane" so that it matches "Lane". Instead, expand "Lane" into ["Lane", "La", "Ln"] and match suffixes to all values.
Then you'd need a strategy for "collisions", abbreviations that can mean 2+ suffixes. These seem to be rare, I can't remember any ("St" isn't, because "Saint" isn't a suffix) and USPS' http://pe.usps.gov/text/pub28/28apc_002.htm doesn't seem to have any.

Resources