For example, let's say I have text like this:
LOCATION, Text goes on
LOCATION U.S. investigators goes on
LOCATION/LOCATION First Last's goes on
LOCATION (AB) -- The Stack Overflow goes on
WHITE PLAINS, N.Y. (AB) -- Text goes on
PUEBLO, Colo. (AB) -- Text goes on
How would I make an algorithm to determine the boundary between the LOCATION and the article text?
The algorithm should be flexible, since there are many formats of datelines in use.
I know this can be done by algorithm, since the boundary is clearly distinguishable even after randomising the chracters.
SYAEIDUA, Tuqw gzce ox
QOZHANEPAD G.L. qisuxhen aodien
ADFD/QOIEYTYE Qidne Opaidh's wien aidnen
QIUEHN (XC) -- Ehd Towneyd Apenaid goeis he
IQUEN AOIEND, B.I. (OG) -- Qien oane px
OIQHNED, Qien. (PA) -- Nwne oaien pdxdaf
The only confusion I can see is in the second case, where G.L. could either stand for U.S. and part of main text, or part of a city abbreviation and NOT part of main text. e.g.
WASHINGTON D.C. Government officials on Monday...
(Government officials on Monday...)
NEW YORK U.S. Government's latest statement...
(U.S. Government's latest statement...)
NEW YORK A.B. Conglomerate's CEO said on Monday...
(A.B. Conglomerate's CEO said on Monday...)
This is where regex falls short since it can't use a lookup table or the likes to differentiate between the two cases. I can't just hard-code in U.S. as a special case after all (see third case).
Any ideas?
Related
I'm trying to collect a dataset that could be used for automatically generating baseball articles.
I have play-by-play records of MLB games from retrosheet.org that I would like to be written out to plain text, as those that could possibly appear as part of a recap news article.
Here are some examples of the play-by-play records:
play,2,0,semim001,32,.CBFFFBBX,9/F
play,2,0,phegj001,01,FX,S7/G
play,2,0,martn003,01,CX,3/G
play,2,1,youne003,00,,NP
The following is what I would like to achieve:
For the first example
play,2,0,semim001,32,.CBFFFBBX,9/F,
I want it to be written out as something like:
"semim001 (Marcus Semien) was on three balls and two strikes in the second inning as the away player. He hit the ball into play after one called strike, one ball, three fouls, and another two balls. The fly ball was caught by the right outfielder."
The plays are formatted in the following way:
The first field is the inning, an integer starting at 1.
The second field is either 0 (for visiting team) or 1 (for home team).
The third field is the Retrosheet player id of the player at the plate.
The fourth field is the count on the batter when this particular event (play) occurred. Most Retrosheet games do not have this information, and in such cases, "??" appears in this field.
The fifth field is of variable length and contains all pitches to this batter in this plate appearance and is described below. If pitches are unknown, this field is left empty, nothing is between the commas.
The sixth field describes the play or event that occurred.
Explanations for all the symbols in the fifth and sixth field can be found on this Retrosheet page.
With Python 3, I've been able to format all the info of invariable length into a formatted sentence, which is all but the last two fields. I'm having difficulty in thinking of an efficient way to unparse (correct me if this is the wrong term to use here) the fifth and sixth fields, the pitches and the events that occurred, due to their variable length and wide variety of things that can occur.
I think I could write out all the rules based on the info on the Retrosheet website, but I'm looking for suggestions for a smarter way to do this. I wrote natural language processing as tags, hoping this could be a trivial problem in that field. Any pointers will be greatly appreciated!
I need to scrape the first three sentences from a paragraph, if they exist, using XPath.
I've already isolated the paragraph I want using:
//h3[contains(., 'Synopsis')]/following-sibling::p[1]
Which returns a plain, unformatted paragraph:
What do we do when the world's walls - its family structures, its value-systems, it political forms - crumble? The central character of this novel, 'Moor' Zogoiby, only son of a wealthy, artistic-bohemian Bombay family, finds himself in such a moment of crisis. His mother, a famous painter and an emotional despot, worships beauty, but Moor is ugly, he has a deformed hand. Moor falls in love, with a married woman; when their secret is revealed, both are expelled; a suicide pact is proposed, but only the woman dies. Moor chooses to accept his fate, plunges into a life of depravity in Bombay, then becomes embroiled in a major financial scandal. The novel ends in Spain, in the studio of a painter who was a lover of Moor's mother: in a violent climax Moor has, one more, to decide whether to save the life of his lover by sacrificing his own.
I only want the first three sentences, and I'm willing to be lenient and ignore that first question mark, I just want whatever comes before the first three periods.
concat(
substring-before(//h3[contains(., 'Synopsis')]/following-sibling::p[1]/text(), '.'),
'.',
substring-before(substring-after(//h3[contains(., 'Synopsis')]/following-sibling::p[1]/text(), '.'), '.'),
'.',
substring-before(substring-after(substring-after(//h3[contains(., 'Synopsis')]/following-sibling::p[1]/text(), '.'), '.'), '.'),
'.'
)
(It's fun to do crazy things with XPath, but in real-life scenarios I wouldn't use it for tasks like this unless forced to it by absolute lack of other possibilities.)
This call https://maps.googleapis.com/maps/api/place/autocomplete/xml?input=qqqqqqq (plus your key) returns addresses like 'qqqqqqqqqq, Florida, USA' and 'qqqqqqqqqqqqqqqqqqqqqqqq - Luizote de Freitas, Uberlândia - State of Minas Gerais, Brazil'. I understand that QQQ might be a valid name, but qqqqqqqqqqqqqqqqqqqqqqqq? And it works the same way for any sequence of repeating letters or numbers.
Ok, let's say this is google having bad data. But how to explain results for 'www': 'Best Buy, Middlesex Turnpike, Burlington, MA, USA', 'Acton Toyota of Littleton, Great Road, Littleton, MA, USA'? I do not see any sane correlation between 'www' and the results.
You can see similar behaviour in google maps, so it's not just autocomplete API.
Any theories?
When I execute request https://maps.googleapis.com/maps/api/place/autocomplete/json?input=www&key=MY_API_KEY from my location I get really weird predictions as well
Montpellier, France (place ID ChIJsZ3dJQevthIRAuiUKHRWh60, type locality)
Berlin, Germany (place ID ChIJAVkDPzdOqEcRcDteW0YgIQQ, type locality)
Hamburg, Germany (place ID ChIJuRMYfoNhsUcRoDrWe_I9JgQ, type locality)
Munich, Germany (place ID ChIJ2V-Mo_l1nkcRfZixfUq4DAE, type locality)
Vienna, Austria (place ID ChIJn8o2UZ4HbUcRRluiUYrlwv0, type locality)
Note all of them have locality type, and indeed it smells like a bug, because I cannot see how on earth the text 'www' might match these predictions. Apparently, something is broken on Google backend and leads to the strange behavior in places autocomplete.
I can confirm that I can see this problem on Google Maps web site as well
At this point I believe the best option for us is sending a feedback to Google Maps team and hope they will fix it soon.
I am using google place to retrieve address, and somehow we want the street(route in google terminology) to be separated into street name and street type. We also want the street type to match an existing column in database.
But things get difficult when google place sometimes use XXXX Street and some times XXXX st
For instance, this is a typical google address
{
administrative_area_level_1: ['short_name', 'VIC'],
locality: ['long_name', 'Carlton'],
postal_code: ['long_name', '3053'],
route: ['long_name', 'Canada Ln'],
street_number: ['short_name', '12'],
subpremise: ['short_name', '13']
}
But it always shows Canada Lane in the suggestion box.
And sometimes even worse when the abbreviation does not match my local data model. For instance we use la instead of ln for short of lane.
It will be appreciated if anyone could tell me where to find a list of street type (and abbreviation) used by google API. Or Is there a way to disable the abbreviation option?
Sounds like you're after "street suffixes". These are complicated.
Not only they change across countries and languages, even within the same country and language they can be used in different ways; abbreviations can have multiple meanings: "St" can be "Street" of "Saint"; abbreviations are used or not depending on subtle rules that also change from place to place.
Same goes for cardinal points (North, South, East, West) that are parts of road / street names: "North St" or "N 11st Street"? It's complicated.
If you already have a good amount of addresses, and you only care about addresses in English, you could take the last word from each street name as the suffix. When matching to your own data, allow for abbreviations when matching, rather than trying to expand them.
For instance, don't try to expand "Canada La" into "Canada Lane" so that it matches "Lane". Instead, expand "Lane" into ["Lane", "La", "Ln"] and match suffixes to all values.
Then you'd need a strategy for "collisions", abbreviations that can mean 2+ suffixes. These seem to be rare, I can't remember any ("St" isn't, because "Saint" isn't a suffix) and USPS' http://pe.usps.gov/text/pub28/28apc_002.htm doesn't seem to have any.
iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."