Oracle, utf-8, NVARCHAR2, and a lot of confusion

Oracle, utf-8, NVARCHAR2, and a lot of confusion - oracle

I have the following "translation" table in an oracle 10g database:
ID VARCHAR2(100 BYTE)
LANGUAGE CHAR(2 BYTE)
COUNTRY CHAR(2 BYTE)
TRANSLATION NVARCHAR2(2000 CHAR)
TRACK_TIMESTAMP DATE
TRACK_USER VARCHAR2(2000 BYTE)
When I try to do this:
update translation set translation = 'œ' where id = 'MY_ID' And language = 'fr';
Then I run this:
select * from translation where id = 'MY_ID' and language = 'fr';
and the translation column shows: S instead of œ and I have no idea why.
Due to legacy issues I cannot convert the whole database to use UTF-8, are there any other options?
Currently the national character set is AL16UTF16. The ordinary character set is WE8ISO8859P1.
I am currently using java 1.6
The above is a simplified example. Here is what the query looks like in my actual application:
UPDATE TRANSLATION SET TRANSLATION=? WHERE TRANSLATION.COUNTRY=? and TRANSLATION.ID=? and TRANSLATION.LANGUAGE=? 1=1,800 - 2,500 œufs par heure 2=CA 3=3_XT_FE_ECS18 4=fr
The problem here is instead of adding œufs it adds ¿ufs

Since you are using bind variables rather than hard-coded literals, you should be able to pass Unicode strings to your UPDATE statement.
If you were using straight JDBC to write to the database, there is an example in the JDBC Developer's Guide on writing data to a NVARCHAR2 column. If you are using a 1.5 JVM, it is necessary to use the OraclePreparedStatement.setFormOfUse call for each NVARCHAR2 column. In a 1.6 JVM, life gets easier because JDBC 4.0 added NCHAR and NVARCHAR2 types. If you are using a 1.5 JVM, getting an ORM framework like Spring to use the Oracle extensions to JDBC may be a non-trivial undertaking. I'm not familiar enough with Spring to know what steps would be necessary for that to happen.
Potentially, you may be able to modify the connection string to specify defaultNChar=true. That will force the driver to treat all character columns using the national character set. That may be enough to resolve your problem without getting Spring to use the OraclePreparedStatement extensions.

Related

Implementing Japanese localization

I've been tasked with determining if our web platform can be 'localized' to Japanese, and how to do so. The platform is PL/SQL based in an Oracle 10g database. We have localized it for French Canadian and Brazilian Portuguese in the past, but I'm wondering what issues I may run into with Japanese (Kanji, I believe). Am I correct that Japanese is a double-byte char set while the others we've used are single-byte? How will this impact code and/or database table structure and access?
The various sentences/phrases/statements are stored in a database table and are looked up as needed based on the user's id and language setting. The table field that stores the 'text' is defined as a CLOB. It's often read into a VARCHAR2 variable.
I tried to copy/past some Japanese characters into the table via a direct paste to the field in a TOAD schema browser. That resulted in '??' being displayed.
Is there anything I have to do in order to be able to store Japanese characters in that table? Or access/display them from that table?

Check your database character set by
SELECT *
FROM V$NLS_PARAMETERS
WHERE PARAMETER IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET');
If the character set support Japanese (e.g. AL32UTF8) it should be no big deal to localize your application also to Japanese. Changing the character set on an existing database is also possible but requires some effort, see Character Set Migration
Check also this answer for topics related to database character set vs. client character set, i.e. NLS_LANG setting.

How to change both charset (NLS_CHARACTERSET) and National Charset (NLS_NCHAR_CHARACTERSET) to UTF8 on a RDS Oracle database?

Our application is designed to work with an Oracle 11g database with Charset (NLS_CHARACTERSET) and National Charset (NLS_NCHAR_CHARACTERSET) both set to UTF8.
While lauching an Oracle database instance on Amazon Relational Database Service (RDS), I'm prompted to chose a Charset that I set to UTF8.
However, I was unable to find a way to set the National Charset, and this parameter is set to AL16UTF16 during database creation.
I tried the following :
Created a new Parameter Group to set the NLS_NCHAR_CHARACTERSET, but the parameter isn't listed. I also tried unsuccessfully to force the creation of a new Parameter Group with this parameter using the AWS CLI.
Tried to ALTER the database but the SYSDBA role is not available on Amazon managed database instances.
Created different Oracle RDS databases with different Charset parameter to check if the National Charset is affected, but it is still set to AL16UTF16.
Is there any way to do it ?

The parameter can be set by specifying --character-set-name during instance creation using AWS CLI. So far I have not found anyway to change that for an existing instance.
In my testing, I set it using --character-set-name KO16MSWIN949, which will support Korean based on AWS doc:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.OracleCharacterSets.html

AWS RDS has no way to set the Oracle Database National Charset (NLS_NCHAR_CHARACTERSET) to UTF8. NLS_NCHAR_CHARACTERSET will always be AL16UTF16. Data types NVARCHAR2, NCHAR, and NCLOB are affected. For the sake of discussion, I will refer to these data types as NCHAR.
Use of AL16UTF16 has space consequences in a migration. As the name implies, all characters are stored as 16 bits (2 bytes). For example, the Western letter 'A' will be stored zero-padded as '\0','A'.
Because of this, the space requirement at the migration target could be higher than at the source. How much higher depends on the prevalence of NCHAR columns. 25% higher is an actual example from experience. An 8 TB schema on conventional hardware required 10 TB on AWS RDS.
If your NLS_CHARACTERSET is AL32UTF8, then one way to prevent migration to the space-wasting AL16UTF16 character set is to migrate your NCHAR columns to CHAR. Example:
from:
CREATE TABLE ...
( "BUSINESS_UNIT" NVARCHAR2(5) NOT NULL ENABLE,
to:
alter session set NLS_LENGTH_SEMANTICS = 'CHAR';
CREATE TABLE ...
( "BUSINESS_UNIT" VARCHAR2(5) NOT NULL ENABLE,
etc.

Using Cloudformation you can set it too in the CharacterSetName attribute of AWS::RDS::DBInstance.

Migrating an oracle database from a non unicode server to a unicode server

I want to move an oracle database from a non-unicode server (EL8ISO8859P7 character set and AL16UTF16 NCHAR character set) to a unicode server. Specifically to an Oracle Express server with AL32UTF8 character set.
Simply exporting (exp) and importing (imp) the data fails. We have a lot of the varchar2 columns with their length specified in bytes. When their contents are mapped in unicode they take more bytes and are truncated.
I tried the following:
- doubling the length of all varchar2 columns of the original database with a script (varchar2(10) becomes varchar2(20))
- exporting
- importing to the new server
And it worked. Apparently doubling is arbitrary, I probably should have changed them to the same size with CHAR semantics.
I also tried the following:
- change all varchar2 columns to nvarchar2 (same size - varchar(10) becomes nvarchar(10))
- exporting
- importing to the new server
It also worked.
Somehow the latter (converting to nvarchar) seems "cleaner". Then again you have a unicode database with unicode data types which seems weird.
So the question is: is there a suggested way to go about moving the database between the two servers? Is there any serious problem with either of the two approaches I mentioned above?

Don't use NVARCHAR2 data types unless that is your only option. The national character set exists to deal with cases where you have an existing, legacy application that does not support Unicode and you want to add a handful of columns to the system that do support Unicode without touching those legacy applications. Using NVARCHAR2 columns is great for those cases but it creates all sorts of issues in application development. Plenty of tools, APIs, and applications either don't support NVARCHAR2 columns or require additional configuration to do so. And since NVARCHAR2 columns are relatively uncommon in the Oracle world, it's very easy to spend gobs of time trying to resolve the particular issues you encounter. Less critically, since AL16UTF16 requires at least 2 bytes per character, you're likely to require quite a bit more space since much of your data is likely to consist of English characters.
I would strongly prefer migrating to the new database with character-length semantics (i.e. VARCHAR2(10 BYTE) becomes VARCHAR2(10 CHAR)). That avoids doubling the allowed length. It also makes it much easier to explain to users what the length limits are (or to code those validations in front-ends). It's terribly confusing to most users to explain that a particular column can sometimes hold 20 characters (when only English characters are used), can sometimes hold 10 characters (when only non-English characters are used), and can sometimes hold something in the middle (when there is a mixture of characters). Character length semantics make all those issues drastically easier.

Migrating to unicode databases is a 4 step process.
Use exp[dp] to export the data and generate ddl for the tables.
Alter the ddl to change the byte length varchar2 fields to character length fields.
create the tables using the modified ddl script.
import the data using imp[dp]
skipping steps 2 and 3 leaves you with the byte length defined fields again and probably with a lot of errors during import because data doesn't fit in the defined columns. If there is only us characters in the source database it won't be a big problem but for example latin characters will give problems because a single character could need more bytes.
Following the listed procedure prevents the length problems. There are obviously more ways to do this but rule is to first have the ddl definition ok and insert the data later.

Defining a Character Set for a column For oracle database tables

I am running following query in SQL*Plus
CREATE TABLE tbl_audit_trail (
id NUMBER(11) NOT NULL,
old_value varchar2(255) NOT NULL,
new_value varchar2(255) NOT NULL,
action varchar2(20) CHARACTER SET latin1 NOT NULL,
model varchar2(255) CHARACTER SET latin1 NOT NULL,
field varchar2(64) CHARACTER SET latin1 NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id NUMBER(11) NOT NULL,
model_id varchar2(65) CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (id),
KEY idx_action (action)
);
I am getting following error:
action varchar2(20) CHARACTER SET latin1 NOT NULL,
*
ERROR at line 5:
ORA-00907: missing right parenthesis
Can you suggest what am I missing?

The simple answer is that, unlike MySQL, character sets can't be defined at column (or table) level. Latin1 is not a valid Oracle character set either.
Character sets are consistent across the database and will have been specified when you created the database. You can find your character by querying NLS_DATABASE_PARAMETERS,
select value
from nls_database_parameters
where parameter = 'NLS_CHARACTERSET'
The full list of possible character sets is available for 11g r2 and for 9i or you can query V$NLS_VALID_VALUES.
It is possible to use the ALTER SESSION statement to set the NLS_LANGUAGE or the NLS_TERRITORY, but unfortunately you can't do this for the character set. I believe this is because altering the language changes how Oracle would display the stored data whereas changing the character set would change how Oracle stores the data.
When displaying the data, you can of course specify the required character set in whichever client you're using.
Character set migration is not a trivial task and should not be done lightly.
On a slight side note why are you trying to use Latin 1? It would be more normal to set up a new database in something like UTF-8 (otherwise known as AL32UTF8 - don't use UTF8) or UTF-16 so that you can store multi-byte data effectively. Even if you don't need it now it's wise to attempt - no guarantees in life - to future proof your database with no need to migrate in the future.
If you're looking to specify differing character sets for different columns in a database then the better option would be to determine if this requirement is really necessary and to try to remove it. If it is definitely necessary1 then your best bet might be to use a character set that is a superset of all potential character sets. Then, have some sort of check constraint that limits the column to specific hex values. I would not recommend doing this at all, the potential for mistakes to creep in is massive and it's extremely complex. Furthermore, different character sets render different hex values differently. This, in turn, means that you need to enforce that a column is rendered in a specific character, which is impossible as it falls outside the scope of the database.
1. I'd be interested to know the situation

According to provided DDL statement it's some need to use 2 character sets. The implementation of this functionality in Oracle is different from MySQL and done with n* data types like nvarchar2, nchar... Latin1 is similar to some Western European character set that might be default. So you able to define for example "Latin1" (WE**) and some Unicode (UTF8..).
The NVARCHAR2 datatype was introduced by Oracle for databases that want to use Unicode for some columns while keeping another character set for the rest of the database (which uses VARCHAR2). The NVARCHAR2 is a Unicode-only datatype.
The reason you want to use NVARCHAR2 might be that your DB uses a non-Unicode character and you still want to be able to store Unicode data for some columns.
Columns in your example would be able to store the same data, however the byte storage will be different.

set locale on Oracle connection

In my company's product, we retrieve results a page at a time from the database. Because of this all filtering and sorting must be done on the database. One of the problems is coded values. For filtering and sorting to work properly, the code values need to be translated to locale specific labels in the query.
My plan is to use a table similar to the following:
t_code_to_label (
type varchar2(10),
locale varchar2(10),
code varchar2(10),
label varchar2(50)
)
The first three columns are the primary (unique) key.
When using this table, you would see something like this
select ent.name, ent.ent_type, entlabel.label as ent_type_label
from t_entitlements ent
join t_code_to_label entlabel on entlabel.type='entlabel' and entlabel.locale=currentLocale() and entlabel.code=ent.ent_type
The problem is that currentLocale() is something that I made up. How can I on the Java side of a JDBC connection do something to set the locale for the Connection object that I have that I can read on the Oracle side in a simple function. Ideally this is true locale support by Oracle but I can't find such a thing.
I am using Oracle 10g and 11g.

Are you talking about the NLS_LANGUAGE setting of the Oracle database? Would you (from the client side) like to dictate the usage of a particular NLS_LANGUAGE by Oracle database?
Then maybe this would work: Set Oracle NLS_LANGUAGE from java in a webapp
If you want an "all american" session you could do you could do something like:
ALTER SESSION SET NLS_LANGUAGE= 'AMERICAN' NLS_TERRITORY= 'AMERICA'
NLS_CURRENCY= '$' NLS_ISO_CURRENCY= 'AMERICA'
NLS_NUMERIC_CHARACTERS= '.,' NLS_CALENDAR= 'GREGORIAN'
NLS_DATE_FORMAT= 'DD-MON-RR' NLS_DATE_LANGUAGE= 'AMERICAN'
NLS_SORT= 'BINARY'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio