How to localize country names in phrases? - cocoa

I want to localize phrases like "Places in {country name}" where the country name is dynamically obtained (e.g. by -[NSLocale localizedStringForCountryCode:]).
The problem is that for some country names, an article must be prepended:
Places in the United States (en)
Sometimes in plural form, male or female, capitalized or not:
Lugares en los estados unidos (es)
Lugares en las Maldivas (es)
The article might even have to be declined according to the casus (nominative, dative etc) of the country name in the phrase:
Orte in der Schweiz (de)
Wir fahren in die Schweiz (de for "we go to Switzerland")
Or we even might might a different preposition:
Orte auf den Malediven (de, using "auf" instead of "in" because the Maldives are islands)
Is there either a library or a good set of rules (e.g. regular expression based) that one could use to accomplish this?
While I'm primarily searching for an iOS solution, I'm open to port any solution from other platforms.

This problem is best solved by avoiding it altogether: localise the entire phrases, and key them by the ISO country code.
The reason is that the rules are complex, not necessarily regular, and differ from language to language. Take Russian for example, where the country name itself (as opposed to just a preposition) has to be modified based on the casus in which the word appears in the sentence — I assume the same holds for various other Slavic languages.

Related

Strange results from Google places autocomplete for sequence of repeating letters

This call https://maps.googleapis.com/maps/api/place/autocomplete/xml?input=qqqqqqq (plus your key) returns addresses like 'qqqqqqqqqq, Florida, USA' and 'qqqqqqqqqqqqqqqqqqqqqqqq - Luizote de Freitas, Uberlândia - State of Minas Gerais, Brazil'. I understand that QQQ might be a valid name, but qqqqqqqqqqqqqqqqqqqqqqqq? And it works the same way for any sequence of repeating letters or numbers.
Ok, let's say this is google having bad data. But how to explain results for 'www': 'Best Buy, Middlesex Turnpike, Burlington, MA, USA', 'Acton Toyota of Littleton, Great Road, Littleton, MA, USA'? I do not see any sane correlation between 'www' and the results.
You can see similar behaviour in google maps, so it's not just autocomplete API.
Any theories?
When I execute request https://maps.googleapis.com/maps/api/place/autocomplete/json?input=www&key=MY_API_KEY from my location I get really weird predictions as well
Montpellier, France (place ID ChIJsZ3dJQevthIRAuiUKHRWh60, type locality)
Berlin, Germany (place ID ChIJAVkDPzdOqEcRcDteW0YgIQQ, type locality)
Hamburg, Germany (place ID ChIJuRMYfoNhsUcRoDrWe_I9JgQ, type locality)
Munich, Germany (place ID ChIJ2V-Mo_l1nkcRfZixfUq4DAE, type locality)
Vienna, Austria (place ID ChIJn8o2UZ4HbUcRRluiUYrlwv0, type locality)
Note all of them have locality type, and indeed it smells like a bug, because I cannot see how on earth the text 'www' might match these predictions. Apparently, something is broken on Google backend and leads to the strange behavior in places autocomplete.
I can confirm that I can see this problem on Google Maps web site as well
At this point I believe the best option for us is sending a feedback to Google Maps team and hope they will fix it soon.

Google Place API street type list

I am using google place to retrieve address, and somehow we want the street(route in google terminology) to be separated into street name and street type. We also want the street type to match an existing column in database.
But things get difficult when google place sometimes use XXXX Street and some times XXXX st
For instance, this is a typical google address
{
administrative_area_level_1: ['short_name', 'VIC'],
locality: ['long_name', 'Carlton'],
postal_code: ['long_name', '3053'],
route: ['long_name', 'Canada Ln'],
street_number: ['short_name', '12'],
subpremise: ['short_name', '13']
}
But it always shows Canada Lane in the suggestion box.
And sometimes even worse when the abbreviation does not match my local data model. For instance we use la instead of ln for short of lane.
It will be appreciated if anyone could tell me where to find a list of street type (and abbreviation) used by google API. Or Is there a way to disable the abbreviation option?
Sounds like you're after "street suffixes". These are complicated.
Not only they change across countries and languages, even within the same country and language they can be used in different ways; abbreviations can have multiple meanings: "St" can be "Street" of "Saint"; abbreviations are used or not depending on subtle rules that also change from place to place.
Same goes for cardinal points (North, South, East, West) that are parts of road / street names: "North St" or "N 11st Street"? It's complicated.
If you already have a good amount of addresses, and you only care about addresses in English, you could take the last word from each street name as the suffix. When matching to your own data, allow for abbreviations when matching, rather than trying to expand them.
For instance, don't try to expand "Canada La" into "Canada Lane" so that it matches "Lane". Instead, expand "Lane" into ["Lane", "La", "Ln"] and match suffixes to all values.
Then you'd need a strategy for "collisions", abbreviations that can mean 2+ suffixes. These seem to be rare, I can't remember any ("St" isn't, because "Saint" isn't a suffix) and USPS' http://pe.usps.gov/text/pub28/28apc_002.htm doesn't seem to have any.

Missing relations in new stanford-corenlp-3.2.0-models.jar

I used stanford-parser-2.0.4-models.jar earlier in my application . Now I want to port my application to stanford-corenlp-3.2.0-models.jar. I used edu.stanford.nlp.trees.EnglishGrammaticalRelations.PURPOSE_CLAUSE_MODIFIER and edu.stanford.nlp.trees.EnglishGrammaticalRelations.COMPLEMENTIZER in my application to identify purpose clause modifier and complementizer relations from semantic graph edges but unfortunately I could not see them in latest version of stanford-corenlp-3.2.0-models.jar. Could some one suggest how can I do it using new jar and explain me what could be the reason behind this avoiding these relations in new jar.
I could find the those details in stanford-corenlp-3.2.0-sources.jar. As part of this they removed these relations and treated them as special cases with the existing relations.
find the below comments I could see from source code
The "purpose clause modifier" grammatical relation has been discontinued
It is now just seen as a special case of an advcl. A purpose clause
modifier of a VP is a clause headed by "(in order) to" specifying a
purpose. Note: at present we only recognize ones that have
"in order to" or are fronted. Otherwise we can't use our surface representations to
distinguish these from xcomp's. We can also recognize "to" clauses
introduced by "be VBN".
<p/>
Example: <br/>
"He talked to the president in order to secure the account" →
<code>purpcl</code>(talked, secure)
The "complementizer" grammatical relation is a discontinued grammatical relation. A
A complementizer of a clausal complement was the word introducing it.
It only matched "that" or "whether". We've now merged this in with "mark" which plays a similar
role with other clausal modifiers.
<p/>
<p/>
Example: <br/>
"He says that you like to swim" →
<code>complm</code>(like, that)

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Localized exponential notation?

i'm trying to convert numbers into localized strings.
For integers and money values it's pretty simple, since the string is just a series of digits and digit grouping separators. E.g.:
12 345 678 901 (Bulgarian)
12.345.678.901 (Catalan)
12,345,678,901 (English)
12,34,56,78,901 (Hindi)
12.345.678.901 (Frisian)
12?345?678?901 (Pashto)
12'345'678'901 (German)
i use the Windows GetNumberFormat function to format integers (and GetCurrencyFormat to format money values).
But some numbers cannot be reasonably represented in fixed notation, and require scientific notation:
6.0221417930×1023
or more specifically E notation:
6.0221417930E23
How can i get the localized version of scientific notation?
i suppose i could construct it using localized numbers:
6.0221417930E23
6,0221417930E23
6.0221417930e23
6·0221417930E23
6·0221417930e23
6,0221417930e23
6,,0221417930e23
6.0221417930E+23
6,0221417930E+23
6.0221417930e+23
6,0221417930e+23
6·0221417930E+23
6·0221417930e+23
6,,0221417930e+23
6.0221417930E23
6,0221417930E23
6.0221417930e23
6,0221417930e23
6·0221417930E23
6·0221417930e23
6,,0221417930e23
6.0221417930X10^23
6,0221417930X10^23
6.0221417930x10^23
6,0221417930x10^23
6·0221417930X10^23
6·0221417930x10^23
6,,0221417930x10^23
6.0221417930·10^23
6,0221417930·^23
6.0221417930.10^23
6,0221417930.10^23
6·0221417930·^23
6·0221417930.10^23
6,,0221417930.10^23
but i don't know if other cultures (cultures besides mine) use an E for exponentiation.
To the best of my knowledge, exponentiation notation is not part of Windows or .NET locale data. However, the Unicode CLDR can help once again: Its <numbers> sections contains what you are looking for:
/numbers/symbols/exponential says E or its equivalent in the given culture.
/numbers/scientificFormats/ shows the exponentiation pattern.
You'll need to download the zipped core CLDR data and extract the file for each culture you're interested in from the common/main directory.
If you want to be able to support all cultures, you'll have to gather the relevant info from all culture files and pack it into your own specific DB. Not quite a trivial work but it's possible.
I gave a quick look to the data in a few very different cultures such as en, fr, zh, ru, vi, ar: They all contain the same pattern: #E0. It looks like either the data is not accurate (I seriously doubt.) or you don't have to care really: Everybody does it the same way and you shouldn't actually care.
For Polish it should be 6,0221417930·1023.
I don't think CLDR mentioned by Serge (great answer BTW) is valid here. However, it is still the best source of information. Otherwise you would need to ask your translators to translate the pattern for you (which would require a comment with good explanation what you are up to).

Resources