I'm trying to hit Oracle from Ruby and getting an error on the first line. (I'm actually doing this in pry, but that probably doesn't matter.)
[1] pry(main)> require 'oci8'
RuntimeError: Invalid NLS_LANG format: AMERICAN
What's the problem and how do I fix it?
Googling the error message didn't turn up anything promising. (It now turns up this question.) The only other question resembling this one on stackoverflow is dealing with a different problem (the variable not having any value at all even though the user set one) and the answer there did not work for me (the value proposed is also invalid, and $LANG is not set in my environment, so setting it to that did not work.)
NLS_LANG should have the format <language>_<territory>.<characterset>
Straight from the doc there is an example corresponding to your exact use case:
The NLS_LANG environment variable is set as a local environment variable for the shell on all UNIX-based platforms. For example, if the operating system locale setting is en_US.UTF-8, then the corresponding NLS_LANG environment variable should be set to AMERICAN_AMERICA.AL32UTF8.
Please note the AL32UTF8 is a superset of UTF8 (without hyphen) accepting all Unicode characters. UTF8 only supports Unicode 3.1 and earlier. I would strongly recommend using AL32UTF8 as your default "UTF-8" character set unless you have very specific needs.
In Oracle 12.1, AL32UTF8 supports Unicode up to 6.1. One advantage is AL32UTF8 has support for supplementary characters introduced by Unicode 4.0 (code points from U+10000 to U+10FFFF)
I don't know where the value "AMERICAN" came from, but it turns out a better option, which the ruby-oci8 gem will accept, is NLS_LANG=AMERICAN_AMERICA.UTF8.
Related
I have a Oracle server with a DAD defined with PlsqlNLSLanguage DANISH_DENMARK.WE8ISO8859P1.
I also have a JavaScript file that is loaded in the browser. The JavaScript file contains the danish letters æøå. When the js file is saved as UTF8 the danish letters are misencoded. When I save js file as UTF8-BOM or ANSI then the letters are shown correctly.
I am not sure what is wrong.
Try to set your DAD
PlsqlNLSLanguage DANISH_DENMARK.UTF8
or even better
PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8
When you save your file as ANSI it typically means "Windows Codepage 1252" on Western Windows, see column "ANSI codepage" at National Language Support (NLS) API Reference. CP1252 is very similar to ISO-8859-1, see ISO 8859-1 vs. Windows-1252 (it is the German Wikipedia, however that table shows the differences much better than the English Wikipedia). Hence for a 100% correct setting you would have to set PlsqlNLSLanguage DANISH_DENMARK.WE8MSWIN1252.
Now, why do you get correct characters when you save your file as UTF8-BOM, although there is a mismatch with .WE8ISO8859P1?
When the browser opens the file it first reads the BOM 0xEF,0xBB,0xBF and assumes the file encoded as UTF-8. However, this may fail in some circumstances, e.g. when you insert text from a input field to database.
With PlsqlNLSLanguage DANISH_DENMARK.AL32UTF8 you tell the Oracle Database: "The web-server uses UTF-8." No more, no less (in terms of character set encoding). So, when your database uses character set WE8ISO8859P1 then the Oracle driver knows he has to convert ISO-8859-1 characters coming from database to UTF-8 for the browser - and vice versa.
I am using pg_trgm to perform fuzzy string match where characters can be Chinese. Strangely, on my Ubuntu server, everything is fine, as following:
SELECT show_trgm('原作者');
> {0xa09182,0xcdfdbb,0x183afe,leD}
However, on my Mac, it does not work:
SELECT show_trgm('原作者');
> {}
I guess it is due to some strange encoding staff, but I examined all settings that I can possibly imagine, including:
SHOW SERVER_VERSION;
SHOW SERVER_ENCODING;
SHOW LC_COLLATE;
SHOW LC_CTYPE;
Where on Ubuntu it shows:
9.5.1
UTF8
en_US.UTF-8
en_US.UTF-8
and on Mac it shows:
9.5.3
UTF8
en_US.UTF-8
en_US.UTF-8
Also, the pg_trgm versions are both 1.1, according to SELECT * FROM pg_extension.
Could anyone help me to find why the pg_trgm not works on Unicode on my Mac?
Reason for this is that pg_trgm depends on libc (system library shipped with OS) routines for classifying which characters are alphabetic and which aren't and this is (unfortunately) different between OSes. Apple Mac OS X is known for interpreting UTF-8 different way than other Unix/Unix-like systems. Character classification is different per-locale and is driven by category LC_CTYPE (and envvar of same name).
Check output of postgres=# \l and you should see Ctype column which tells you how characters are classified in your database.
If this is C (seen that on Apple MacOS X before) try to create database again specifying CREATE DATABASE foo ... LC_CTYPE="en_US.UTF-8"
If it is already en_US.UTF-8 it is very likely MacOS X doesn't classify UTF-8 Chinese characters as alphabetic in this locale (not surprising). Try LC_CTYPE="zh_CN.UTF-8" instead and that should work.
In macOS, this is the issue of character encoding. Based on the language you have to explicitly flag the encoding type. default en_US.UTF-8 will definitely don't work. so:
Chinese : LC_CTYPE="zh_CN.UTF-8"
likewise, locale should be changed accordingly to the language. Although, there is no point of encode/decode Chinese in US English
You can create db:
CREATE DATABASE mydb WITH ENCODING='UTF8' LC_CTYPE='zh_CN.UTF-8' LC_COLLATE='zh_CN.UTF-8' OWNER=postgres TEMPLATE=template0 CONNECTION LIMIT=-1;
Several years ago, Apple released a document that outlines the mappings between Apple's "Mac OS Japanese" Character Set and Unicode code points. (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT)
Microsoft provides the function, MultiByteToWideChar, to assist with mapping characters to a UTF-16 string.
MultiByteToWideChar works correctly for some Japanese characters in Apple's legacy character set (see FTP link, above), but returns "no mapping available" for others (For example, 0x85BE is supposed to map to Unicode 0x217B (SMALL ROMAN NUMERAL TWELVE), however it fails.)
I am using code page 10001 (Japanese-Mac).
Am I overlooking something obvious or is the code page for mapping Japanese-Mac to UTF-16 simply incomplete on Windows?
x-mac-japanese is usually treated as SHIFT_JIS by Windows -- and the problem is x-mac-japanese is a superset of SHIFT_JIS, so stuff will be missing. For instance, there is nothing in the 0x85oo range in SHIFT_JIS.
I'm looking at some X11 code that uses XmbTextListToTextProperty to set the WM_NAME property, with encoding style XTextStyle.
http://tronche.com/gui/x/xlib/ICC/client-to-window-manager/XmbTextListToTextProperty.html suggests XTextStyle means the type/encoding of the property will depend on the current locale.
I'm not sure how to interpret http://tronche.com/gui/x/icccm/sec-4.html#s-4.1.2.1 , it seems it allows the type of WM_NAME to be dependent of the current locale.
My current locale is 'en_US.UTF-8'. Everything I've seen so far suggests that the type of WM_NAME should be of type STRING, COMPOUND_STRING or UTF8_STRING.
However, xprop reports UTF-8, and xwininfo reports 'name in unsupported encoding UTF-8'. Checking the code, indeed it has support for UTF8_STRING but not UTF-8.
I'm at a loss as to where this UTF-8 comes from. Any ideas?
It looks like besides the standard types STRING, COMPOUND_STRING and UTF8_STRING (the latter is an XFree86 extension), it is also acceptable to have any multibyte encoding.
When passing XTextStyle to XmbTextListToTextProperty will simply take the current encoding from the current locale. In the en_US.UTF-8 locale, that would be UTF-8. To get the standardized (by XFree86) UTF8_STRING type for the property, we need to pass XUTF8StringStyle to XmbTextListToTextProperty instead of XTextStyle
File#path is giving me Latin-1 characters -- is there a way to get it to give me utf8 characters, or should I just convert what it returns? If so, what's the best/easiest way to convert?
elaboration
So, I know I can do this:
Iconv.new('UTF-8','LATIN1').iconv(File.basename(file.path))
But I'm wondering if there is a more elegant way to tell File to give me utf8 to begin with.
This is especially important because for some reason I get back a different charset on different systems. On my OS X dev machine, it looks like I get back utf8. On my linux server, latin-1.
Use a magic comment in a first line of your document:
#encoding: UTF-8
See $LANG and $LC_CTYPE (environment variables).
These variables also determine the default value for the encoding defaults in 1.9, and so changes you make today will also work if you later port your code to 1.9.
N.B. Windows is a slightly different beast in this regard, so you may need further information to tackle that.