pg_trgm behaves differently on Ubuntu and Mac OS X - macos

I am using pg_trgm to perform fuzzy string match where characters can be Chinese. Strangely, on my Ubuntu server, everything is fine, as following:
SELECT show_trgm('原作者');
> {0xa09182,0xcdfdbb,0x183afe,leD}
However, on my Mac, it does not work:
SELECT show_trgm('原作者');
> {}
I guess it is due to some strange encoding staff, but I examined all settings that I can possibly imagine, including:
SHOW SERVER_VERSION;
SHOW SERVER_ENCODING;
SHOW LC_COLLATE;
SHOW LC_CTYPE;
Where on Ubuntu it shows:
9.5.1
UTF8
en_US.UTF-8
en_US.UTF-8
and on Mac it shows:
9.5.3
UTF8
en_US.UTF-8
en_US.UTF-8
Also, the pg_trgm versions are both 1.1, according to SELECT * FROM pg_extension.
Could anyone help me to find why the pg_trgm not works on Unicode on my Mac?

Reason for this is that pg_trgm depends on libc (system library shipped with OS) routines for classifying which characters are alphabetic and which aren't and this is (unfortunately) different between OSes. Apple Mac OS X is known for interpreting UTF-8 different way than other Unix/Unix-like systems. Character classification is different per-locale and is driven by category LC_CTYPE (and envvar of same name).
Check output of postgres=# \l and you should see Ctype column which tells you how characters are classified in your database.
If this is C (seen that on Apple MacOS X before) try to create database again specifying CREATE DATABASE foo ... LC_CTYPE="en_US.UTF-8"
If it is already en_US.UTF-8 it is very likely MacOS X doesn't classify UTF-8 Chinese characters as alphabetic in this locale (not surprising). Try LC_CTYPE="zh_CN.UTF-8" instead and that should work.

In macOS, this is the issue of character encoding. Based on the language you have to explicitly flag the encoding type. default en_US.UTF-8 will definitely don't work. so:
Chinese : LC_CTYPE="zh_CN.UTF-8"
likewise, locale should be changed accordingly to the language. Although, there is no point of encode/decode Chinese in US English
You can create db:
CREATE DATABASE mydb WITH ENCODING='UTF8' LC_CTYPE='zh_CN.UTF-8' LC_COLLATE='zh_CN.UTF-8' OWNER=postgres TEMPLATE=template0 CONNECTION LIMIT=-1;

Related

Oracle server encoding and file encoding

I have a Oracle server with a DAD defined with PlsqlNLSLanguage DANISH_DENMARK.WE8ISO8859P1.
I also have a JavaScript file that is loaded in the browser. The JavaScript file contains the danish letters æøå. When the js file is saved as UTF8 the danish letters are misencoded. When I save js file as UTF8-BOM or ANSI then the letters are shown correctly.
I am not sure what is wrong.
Try to set your DAD
PlsqlNLSLanguage DANISH_DENMARK.UTF8
or even better
PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8
When you save your file as ANSI it typically means "Windows Codepage 1252" on Western Windows, see column "ANSI codepage" at National Language Support (NLS) API Reference. CP1252 is very similar to ISO-8859-1, see ISO 8859-1 vs. Windows-1252 (it is the German Wikipedia, however that table shows the differences much better than the English Wikipedia). Hence for a 100% correct setting you would have to set PlsqlNLSLanguage DANISH_DENMARK.WE8MSWIN1252.
Now, why do you get correct characters when you save your file as UTF8-BOM, although there is a mismatch with .WE8ISO8859P1?
When the browser opens the file it first reads the BOM 0xEF,0xBB,0xBF and assumes the file encoded as UTF-8. However, this may fail in some circumstances, e.g. when you insert text from a input field to database.
With PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8 you tell the Oracle Database: "The web-server uses UTF-8." No more, no less (in terms of character set encoding). So, when your database uses character set WE8ISO8859P1 then the Oracle driver knows he has to convert ISO-8859-1 characters coming from database to UTF-8 for the browser - and vice versa.

Fixing RuntimeError: Invalid NLS_LANG format: AMERICAN

I'm trying to hit Oracle from Ruby and getting an error on the first line. (I'm actually doing this in pry, but that probably doesn't matter.)
[1] pry(main)> require 'oci8'
RuntimeError: Invalid NLS_LANG format: AMERICAN
What's the problem and how do I fix it?
Googling the error message didn't turn up anything promising. (It now turns up this question.) The only other question resembling this one on stackoverflow is dealing with a different problem (the variable not having any value at all even though the user set one) and the answer there did not work for me (the value proposed is also invalid, and $LANG is not set in my environment, so setting it to that did not work.)
NLS_LANG should have the format <language>_<territory>.<characterset>
Straight from the doc there is an example corresponding to your exact use case:
The NLS_LANG environment variable is set as a local environment variable for the shell on all UNIX-based platforms. For example, if the operating system locale setting is en_US.UTF-8, then the corresponding NLS_LANG environment variable should be set to AMERICAN_AMERICA.AL32UTF8.
Please note the AL32UTF8 is a superset of UTF8 (without hyphen) accepting all Unicode characters. UTF8 only supports Unicode 3.1 and earlier. I would strongly recommend using AL32UTF8 as your default "UTF-8" character set unless you have very specific needs.
In Oracle 12.1, AL32UTF8 supports Unicode up to 6.1. One advantage is AL32UTF8 has support for supplementary characters introduced by Unicode 4.0 (code points from U+10000 to U+10FFFF)
I don't know where the value "AMERICAN" came from, but it turns out a better option, which the ruby-oci8 gem will accept, is NLS_LANG=AMERICAN_AMERICA.UTF8.

How to sort words with accents?

I was wondering how to sort alphabetically a list of Spanish words [with accents].
Excerpt from the word list:
Chocó
Cundinamarca
Córdoba
Cygwin uses GNU utilities, which are usually well-behaved when it comes to locales - a notable and regrettable exception is awk (gawk)ref.
The following is based on Cygwin 1.7.31-3, current as of this writing.
Cygwin by default uses the locale implied by the current Windows user's UI language, combined with UTF-8 character encoding.
Note that it's NOT based on the setting for date/time/number/currency formats, and changing that makes no difference. The limitation of basing the locale on the UI language is that it invariably uses that language's "home" region; e.g., if your UI language is Spanish, Cygwin will invariably use en_ES, i.e., Spain's locale. The only way to change that is to explicitly override the default - see below.
You can override this in a variety of ways, preferably by defining a persistent Windows environment variable named LANG (see below; for an overview of all methods, see https://superuser.com/a/271423/139307)
To see what locale is in effect in Cygwin, run locale and inspect the value of the LANG variable.
If that doesn't show es_*.utf8 (where * represents your region in the Spanish-speaking world, e.g., CO for Colombia, ES for Spain, ...), set the locale as follows:
In Windows, open the Start menu and search for 'environment', then select Edit environment variables for your account, which opens the Environment Variables dialog.
Edit or create a variable named LANG with the desired locale, e.g., es_CO.utf8 -- UTF-8 character encoding is usually the best choice.
Any Cygwin bash shell you open from the on should reflect the new locale - verify by running locale and ensuring that the LC_* values match the LANG value and that no warnings are reported.
At that point, the following:
sort <<<$'Chocó\nCundinamarca\nCórdoba'
should produce (i.e., ó will sort directly after o, as desired):
Chocó
Córdoba
Cundinamarca
Note: locale en_US.utf8 would produce the same output - apparently, it generically sorts accented characters directly after their base characters - which may or may not be what a specific non-US locale actually does.

Converting Legacy Mac OS Japanese Encoding to Unicode in Windows

Several years ago, Apple released a document that outlines the mappings between Apple's "Mac OS Japanese" Character Set and Unicode code points. (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT)
Microsoft provides the function, MultiByteToWideChar, to assist with mapping characters to a UTF-16 string.
MultiByteToWideChar works correctly for some Japanese characters in Apple's legacy character set (see FTP link, above), but returns "no mapping available" for others (For example, 0x85BE is supposed to map to Unicode 0x217B (SMALL ROMAN NUMERAL TWELVE), however it fails.)
I am using code page 10001 (Japanese-Mac).
Am I overlooking something obvious or is the code page for mapping Japanese-Mac to UTF-16 simply incomplete on Windows?
x-mac-japanese is usually treated as SHIFT_JIS by Windows -- and the problem is x-mac-japanese is a superset of SHIFT_JIS, so stuff will be missing. For instance, there is nothing in the 0x85oo range in SHIFT_JIS.

Mac vs PC Text Encoding

I've noticed this different specifically with iTunes when you export your music library.
I have a song with é (that is, a small latin E with an acute accent) and when I export the library in Windows, it gets encoded at %C3%A9, but when I export the library from Mac, a normal 'e' is printed, followed by %CC%81.
Example:
Song Name: Héllo World
Windows Export: H%C3%A9llo World
Mac Export: He%CC%81llo World
This is important to me for a program I'm making where, in the Windows version, I decode the encoding, but now it doesn't work if the file comes from a Mac.
So why is there this difference? Is there a place I can see the differences and see what the Mac encodings are? Is there maybe an Object-C routine to decode these strings?
Thanks.
C3A9 is the UTF-8 encoding for the character é.
CC81 is the UTF-8 encoding for the COMBINING ACUTE ACCENT character (U+0301).
An "e" followed by a COMBINING ACUTE ACCENT combines to the character "é".
The two are simply different forms of Unicode normalization.
Why one iTunes prefers one over the other I don't know, there's no inherent reason to do so.

Resources