Localized exponential notation? - windows

i'm trying to convert numbers into localized strings.
For integers and money values it's pretty simple, since the string is just a series of digits and digit grouping separators. E.g.:
12 345 678 901 (Bulgarian)
12.345.678.901 (Catalan)
12,345,678,901 (English)
12,34,56,78,901 (Hindi)
12.345.678.901 (Frisian)
12?345?678?901 (Pashto)
12'345'678'901 (German)
i use the Windows GetNumberFormat function to format integers (and GetCurrencyFormat to format money values).
But some numbers cannot be reasonably represented in fixed notation, and require scientific notation:
6.0221417930×1023
or more specifically E notation:
6.0221417930E23
How can i get the localized version of scientific notation?
i suppose i could construct it using localized numbers:
6.0221417930E23
6,0221417930E23
6.0221417930e23
6·0221417930E23
6·0221417930e23
6,0221417930e23
6,,0221417930e23
6.0221417930E+23
6,0221417930E+23
6.0221417930e+23
6,0221417930e+23
6·0221417930E+23
6·0221417930e+23
6,,0221417930e+23
6.0221417930E23
6,0221417930E23
6.0221417930e23
6,0221417930e23
6·0221417930E23
6·0221417930e23
6,,0221417930e23
6.0221417930X10^23
6,0221417930X10^23
6.0221417930x10^23
6,0221417930x10^23
6·0221417930X10^23
6·0221417930x10^23
6,,0221417930x10^23
6.0221417930·10^23
6,0221417930·^23
6.0221417930.10^23
6,0221417930.10^23
6·0221417930·^23
6·0221417930.10^23
6,,0221417930.10^23
but i don't know if other cultures (cultures besides mine) use an E for exponentiation.

To the best of my knowledge, exponentiation notation is not part of Windows or .NET locale data. However, the Unicode CLDR can help once again: Its <numbers> sections contains what you are looking for:
/numbers/symbols/exponential says E or its equivalent in the given culture.
/numbers/scientificFormats/ shows the exponentiation pattern.
You'll need to download the zipped core CLDR data and extract the file for each culture you're interested in from the common/main directory.
If you want to be able to support all cultures, you'll have to gather the relevant info from all culture files and pack it into your own specific DB. Not quite a trivial work but it's possible.
I gave a quick look to the data in a few very different cultures such as en, fr, zh, ru, vi, ar: They all contain the same pattern: #E0. It looks like either the data is not accurate (I seriously doubt.) or you don't have to care really: Everybody does it the same way and you shouldn't actually care.

For Polish it should be 6,0221417930·1023.
I don't think CLDR mentioned by Serge (great answer BTW) is valid here. However, it is still the best source of information. Otherwise you would need to ask your translators to translate the pattern for you (which would require a comment with good explanation what you are up to).

Related

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Convert to E164 only if possible?

Can I determine if the user entered a phone number that can be safely formatted into E164?
For Germany, this requires that the user started his entry with a local area code. For example, 123456 may be a subscriber number in his city, but it cannot be formatted into E164, because we don't know his local area code. Then I would like to keep the entry as it is. In contrast, the input 089123456 is independent of the area code and could be formatted into E164, because we know he's from Germany and we could convert this into +4989123456.
You can simply convert your number into E164 using libphonenumber
and after conversion checks if both the strings are same or not. If they're same means a number can not be formatted, otherwise the number you'll get from library will be formatted in E164.
Here's how you can convert
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
String formattedNumber = phoneUtil.format(inputNumber, PhoneNumberFormat.E164);
Finally compare formattedNumber with inputNumber
It looks as though you'll need to play with isValidNumber and isPossibleNumber for your case. format is certainly not guaranteed to give you something actually dialable, see the javadocs. This is suggested by the demo as well, where formatting is not displayed when isValidNumber is false.
I also am dealing with this FWIW. In the context of US numbers: The issue is I'd like to parse using isPossibleNumber in order to be as lenient as possible, and store the number in E164. However then we accept, e.g. +15551212. This string itself even passes isPossibleNumber despite clearly (I think) not being dialable anywhere.

Encrypt passwords in human-readable format

I'm looking for one or more encryption algorithms that can encrypt passwords in a way that the encrypted text must be readable by humans.
For example if I give as password:
StackOverflow
The algorithm should gives me:
G0aThiR4i s0ieFer
So I want to get an encrypted password easily readable by humans (not strange strings with special chars or thousand of characters).
There are algorithms that fit this need?
RFC 1751, which defines a "Convention for Human-Readable 128-bit Keys" – basically just a mapping of blocks of bits to strings of English words.
For example, the 128-bit key of:
CCAC 2AED 5910 56BE 4F90 FD44 1C53 4766
would become
RASH BUSH MILK LOOK BAD BRIM AVID GAFF BAIT ROT POD LOVE
Algorithm is used for fixed-length 128-bit keys, that's a base for data size. Source data can be truncated or expanded to match the base.
Spec & implementation in C # https://www.rfc-editor.org/rfc/rfc1751
It ain't well known. I couldn't find implementation mention apart from one in spec & references to lost python library.
Your question puzzled me in many ways, leading to the hare-brained idea of creating a completely human-readable paragraph from a base 16 hash result. What's more readable than English sentences?
To maintain the security of a hashed password, the algorithm is as follows:
Hash a user's password using any of the basic techniques.
Run the hash through my special HashViewer
Enjoy the Anglicized goodness.
For example, if this Base16 hash is the input:
c0143d440bd61b33a65bfa31ac35fe525f7b625f10cd53b7cac8824d02b61dfc
the output from HashViewer would be this:
Smogs enjoy dogs. Logs unearth backlogs. Logs devour frogs. Grogs decapitate clogs and dogs conceptualize fogs. Fogs greet clogs. Cogs conceptualize warthogs. Bogs unearth dogs despite bogs squeeze fogs. Cogs patronize catalogs. Cogs juggle cogs. Warthogs debilitate grogs; unfortunately, clogs juggle cogs. Warthogs detest frogs; conversely, smogs decapitate cogs. Fogs conceptualize balrogs. Smogs greet smogs whenever polywogs accost eggnogs. Logs decapitate frogs. Eggnogs conceptualize clogs. Dogs decapitate warthogs. (smogs )
(The last words in parenthesis were words that were left over)
In this glorious paragraph form, we can look at two separate hashes and easily compare them to see if they are different.
As an extra feature, there is a function to convert the English text back in to the hash string.
Enjoy!
Step 1) Apply regular encryption algorithm.
Step 2) Base 26-encode it using letters a thru z
Step 3) ???
Step 4) Profit!
Even cooler would be to get a letter-bigraph distribution for the english language. Pick a simple pseudo-random-number algorithm and seed it with the non-human-readable password hash. Then use the algorithm to make up a word according to how likely a letter is to follow the letter before it in the english language. You'll get really english-sounding words, and consistently from the same seed. The chance for collisions might be unseasonably high, though.
EncryptedPassword = Base64(SHA1(UTF8(Password) + Salt))
Password is string
Salt is cryptographically strong random byte[]
UTF8: string -> byte[]
SHA1: byte[] -> byte[]
Base64: byte[] -> string
EncryptedPassword is string
Store (EncryptedPassword, Salt).
Throw in whatever encryption algorithm you want instead of SHA1.
For example, if password is StackOverflow you get the encrypted password LlijGz/xNSRXQXtbgNs+UIdOCwU.

iphone's phone number splitting algorithm?

iPhone has a pretty good telephone number splitting function, for example:
Singapore mobile: +65 9852 4135
Singapore resident line: +65 6325 6524
China mobile: +86 135-6952-3685
China resident line: +86 10-65236528
HongKong: +886 956-238-82
USA: +1 (732) 865-3286
Notice the nice features here:
- the splitting of country code, area code, and the rest is automatic;
- the delimiter is also nicely adopted to different countries, e.g. "()", "-" and space.
Note the parsing logic is doable to me, however, I don't know where to get the knowledge of most countries' telephone number format.
where could i found such knowledge, or an open source code that implemented it?
You can get similar functionality with the libphonenumber code library.
Interestingly enough, you cannot use an NSNumberFormatter for this, but you can write your own custom class for it. Just create a new class, set properties such as countryCode, areaCode and number, and then create a method that formats the number based on the countryCode.
Here's a great example: http://the-lost-beauty.blogspot.com/2010/01/locale-sensitive-phone-number.html
As an aside: a friend told me about a gigantic regular expression he had to maintain that could pick telephone numbers out of intercepted communications from hundreds of countries around the world. It was very non-trivial.
Thankfully your problem is easier, as you can just have a table with the per-country formats:
format[usa] = "+d (ddd) ddd-dddd";
format[hk] = "+ddd ddd-ddd-dd";
format[china_mobile] = "+dd ddd-dddd-dddd";
...
Then when you're printing, you simply output one digit from the phone number string in each d spot as needed. This assumes you know the country, which is a safe enough assumption for telephone devices -- pick "default" formats for the few surrounding countries.
Since some countries have different formats with different lengths you might need to store your table with additional information:
format[germany][10] = "..."
format[germany][11] = "....."

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Resources