Migrating an oracle database from a non unicode server to a unicode server - oracle

I want to move an oracle database from a non-unicode server (EL8ISO8859P7 character set and AL16UTF16 NCHAR character set) to a unicode server. Specifically to an Oracle Express server with AL32UTF8 character set.
Simply exporting (exp) and importing (imp) the data fails. We have a lot of the varchar2 columns with their length specified in bytes. When their contents are mapped in unicode they take more bytes and are truncated.
I tried the following:
- doubling the length of all varchar2 columns of the original database with a script (varchar2(10) becomes varchar2(20))
- exporting
- importing to the new server
And it worked. Apparently doubling is arbitrary, I probably should have changed them to the same size with CHAR semantics.
I also tried the following:
- change all varchar2 columns to nvarchar2 (same size - varchar(10) becomes nvarchar(10))
- exporting
- importing to the new server
It also worked.
Somehow the latter (converting to nvarchar) seems "cleaner". Then again you have a unicode database with unicode data types which seems weird.
So the question is: is there a suggested way to go about moving the database between the two servers? Is there any serious problem with either of the two approaches I mentioned above?

Don't use NVARCHAR2 data types unless that is your only option. The national character set exists to deal with cases where you have an existing, legacy application that does not support Unicode and you want to add a handful of columns to the system that do support Unicode without touching those legacy applications. Using NVARCHAR2 columns is great for those cases but it creates all sorts of issues in application development. Plenty of tools, APIs, and applications either don't support NVARCHAR2 columns or require additional configuration to do so. And since NVARCHAR2 columns are relatively uncommon in the Oracle world, it's very easy to spend gobs of time trying to resolve the particular issues you encounter. Less critically, since AL16UTF16 requires at least 2 bytes per character, you're likely to require quite a bit more space since much of your data is likely to consist of English characters.
I would strongly prefer migrating to the new database with character-length semantics (i.e. VARCHAR2(10 BYTE) becomes VARCHAR2(10 CHAR)). That avoids doubling the allowed length. It also makes it much easier to explain to users what the length limits are (or to code those validations in front-ends). It's terribly confusing to most users to explain that a particular column can sometimes hold 20 characters (when only English characters are used), can sometimes hold 10 characters (when only non-English characters are used), and can sometimes hold something in the middle (when there is a mixture of characters). Character length semantics make all those issues drastically easier.

Migrating to unicode databases is a 4 step process.
Use exp[dp] to export the data and generate ddl for the tables.
Alter the ddl to change the byte length varchar2 fields to character length fields.
create the tables using the modified ddl script.
import the data using imp[dp]
skipping steps 2 and 3 leaves you with the byte length defined fields again and probably with a lot of errors during import because data doesn't fit in the defined columns. If there is only us characters in the source database it won't be a big problem but for example latin characters will give problems because a single character could need more bytes.
Following the listed procedure prevents the length problems. There are obviously more ways to do this but rule is to first have the ddl definition ok and insert the data later.

Related

Can N function cause problems with existing queries?

We use Oracle 10g and Oracle 11g.
We also have a layer to automatically compose queries, from pseudo-SQL code written in .net (something like SqlAlchemy for Python).
Our layer currently wraps any string in single quotes ' and, if contains non-ANSI characters, it automatically compose the UNISTR with special characters written as unicode bytes (like \00E0).
Now we created a method for doing multiple inserts with the following construct:
INSERT INTO ... (...)
SELECT ... FROM DUAL
UNION ALL SELECT ... FROM DUAL
...
This algorithm could compose queries where the same string field is sometimes passed as 'my simple string' and sometimes wrapped as UNISTR('my string with special chars like \00E0').
The described condition causes a ORA-12704: character set mismatch.
One solution is to use the INSERT ALL construct but it is very slow compared to the one used now.
Another solution is to instruct our layer to put N in front of any string (except for the ones already wrapped with UNISTR). This is simple.
I just want to know if this could cause any side-effect on existing queries.
Note: all our fields on DB are either NCHAR or NVARCHAR2.
Oracle ref: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch7progrunicode.htm
Basicly what you are asking is, is there a difference between how a string is stored with or without the N function.
You can just check for yourself consider:
SQL> create table test (val nvarchar2(20));
Table TEST created.
SQL> insert into test select n'test' from dual;
1 row inserted.
SQL> insert into test select 'test' from dual;
1 row inserted.
SQL> select dump(val) from test;
DUMP(VAL)
--------------------------------------------------------------------------------
Typ=1 Len=8: 0,116,0,101,0,115,0,116
Typ=1 Len=8: 0,116,0,101,0,115,0,116
As you can see identical so no side effect.
The reason this works so beautifully is because of the elegance of unicode
If you are interested here is a nice video explaining it
https://www.youtube.com/watch?v=MijmeoH9LT4
I assume that you get an error "ORA-12704: character set mismatch" because your data inside quotes considered as char but your fields is nchar so char is collated using different charsets, one using NLS_CHARACTERSET, the other NLS_NCHAR_CHARACTERSET.
When you use an UNISTR function, it converts data from char to nchar (in any case that also converts encoded values into characters) as the Oracle docs say:
"UNISTR takes as its argument a text literal or an expression that
resolves to character data and returns it in the national character
set."
When you convert values explicitly using N or TO_NCHAR you only get values in NLS_NCHAR_CHARACTERSET without decoding. If you have some values encoded like this "\00E0" they will not be decoded and will be considered unchanged.
So if you have an insert such as:
insert into select N'my string with special chars like \00E0',
UNISTR('my string with special chars like \00E0') from dual ....
your data in the first inserting field will be: 'my string with special chars like \00E0' not 'my string with special chars like à'. This is the only side effect I'm aware of. Other queries should already use NLS_NCHAR_CHARACTERSET encoding, so it shouldn't be any problem using an explicit conversion.
And by the way, why not just insert all values as N'my string with special chars like à'? Just encode them into UTF-16 (I assume that you use UTF-16 for nchars) first if you use different encoding in 'upper level' software.
use of n function - you have answers already above.
If you have any chance to change the charset of the database, that would really make your life easier. I was working on huge production systems, and found the trend that because of storage space is cheap, simply everyone moves to AL32UTF8 and the hassle of internationalization slowly becomes the painful memories of the past.
I found the easiest thing is to use AL32UTF8 as the charset of the database instance, and simply use varchar2 everywhere. We're reading and writing standard Java unicode strings via JDBC as bind variables without any harm, and fiddle.
Your idea to construct a huge text of SQL inserts may not scale well for multiple reasons:
there is a fixed length of maximum allowed SQL statement - so it won't work with 10000 inserts
it is advised to use bind variables (and then you don't have the n'xxx' vs unistr mess either)
the idea to create a new SQL statement dynamically is very resource unfriedly. It does not allow Oracle to cache any execution plan for anything, and will make Oracle hard parse your looong statement at each call.
What you're trying to achieve is a mass insert. Use the JDBC batch mode of the Oracle driver to perform that at light-speed, see e.g.: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/
Note that insert speed is also affected by triggers (which has to be executed) and foreign key constraints (which has to be validated). So if you're about to insert more than a few thousands of rows, consider disabling the triggers and foreign key constraints, and enable them after the insert. (You'll lose the trigger calls, but the constraint validation after insert can make an impact.)
Also consider the rollback segment size. If you're inserting a million of records, that will need a huge rollback segment, which likely will cause serious swapping on the storage media. It is a good rule of thumb to commit after each 1000 records.
(Oracle uses versioning instead of shared locks, therefore a table with uncommitted changes are consistently available for reading. The 1000 records commit rate means roughly 1 commit per second - slow enough to benefit of write buffers, but quick enough to not interfer with other humans willing to update the same table.)

DB2, special character occupies 2 bytes

I have a problem inserting special characters (á é í ú or ñ) in a char(1) field.
CREATE TABLE sgc2."tabtest2"(field1 CHAR(1), field2 VARCHAR(1));
INSERT INTO sgc2."tabtest2" values('á', 'á');
ERROR:
Value "á" is too long.. SQLCODE=-433, SQLSTATE=22001, DRIVER=4.13.111
Apparently to insert these characters take two byte, and as the field only accepts one can not end with the insertion.
Is there any way to configure the database, to support these special characters taking only 1 byte?
Apparently your database was created with the Unicode codeset, where special characters are represented by multiple bytes. If you only need to represent a limited range of accented characters you can choose one of the supported codesets, specified by ISO-8859, for the corresponding language -- details in the manual. You will have to re-create the database using an appropriate CODESET option, as you cannot change the codeset of an existing database.
However, you should consider changing your tables instead, as Unicode gives you more flexibility. A Unicode database can also be a requirement for certain DB2 features, for example BLU Acceleration.

How did the unicode characters endup in the database table column?

Recently I came across a unicode character (\u2019) in a database table column while parsing using Python.
Question: What are the reasons that can result in unicode characters showing up in the database table? Is it data entry issue?
Appreciate any input.
When you set up your Oracle Database you choose a character set which will be used in the SQL char datatypes (char, varchar2 etc).
Suppose you chose your character set and you have a table with a column of VARCHAR2 type. Suddenly you need to store some string with non-ASCII symbols not supported by your database (chosen character set). You may convert this string into ASCII string by calling ASCIISTR function for example and store it in your VARCHAR2 column (but it's not a good idea because many SQL built-in functions don't understand '\u2019' (they think it's just 6 symbols)). That's how Unicode may appear in your table column (ASCIISTR converts non-ascii symbols into unicode representation such as '\u2019').
Another option is special Oracle nchar datatypes which were designed to store UNICODE without altering global database settings.
Here is the link with Oracle documentation: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch6unicode.htm

Why is it necessary to specify length of a column in a table

I always wonder why should we limit a column length in a database table to some limit then the default one.
Eg. I have a column short_name in my table People the default length is 255 characters for the column but I restrict it to 100 characters. What difference will it make.
The string will be truncated to the maximum length ( in characters usually ).
The way it is actually implemented is up to the database engine you use.
For example:
CHAR(30) will always use up 30 characters in MySQL, and this allows
MySQL to speed up access because it is able to predict the value
length without parsing anything;
VARCHAR(30) will trim any lengthy strings to 30 characters in MySQL when strict mode is on, otherwise you may use longer strings and they will be fully stored;
In SQLite, you can store strings in any type of column, ignoring the type.
The reason many features of SQL are supported in those database engines eventhough they are not being utilized, or being utilized in different ways, is in order to maintain compliance to the SQL schema.

Difference between VARCHAR2(10 CHAR) and NVARCHAR2(10)

I've installed Oracle Database 10g Express Edition (Universal) with the default settings:
SELECT * FROM NLS_DATABASE_PARAMETERS;
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16
Given that both CHAR and NCHAR data types seem to accept multi-byte strings, what is the exact difference between these two column definitions?
VARCHAR2(10 CHAR)
NVARCHAR2(10)
The NVARCHAR2 datatype was introduced by Oracle for databases that want to use Unicode for some columns while keeping another character set for the rest of the database (which uses VARCHAR2). The NVARCHAR2 is a Unicode-only datatype.
One reason you may want to use NVARCHAR2 might be that your DB uses a non-Unicode character set and you still want to be able to store Unicode data for some columns without changing the primary character set. Another reason might be that you want to use two Unicode character set (AL32UTF8 for data that comes mostly from western Europe, AL16UTF16 for data that comes mostly from Asia for example) because different character sets won't store the same data equally efficiently.
Both columns in your example (Unicode VARCHAR2(10 CHAR) and NVARCHAR2(10)) would be able to store the same data, however the byte storage will be different. Some strings may be stored more efficiently in one or the other.
Note also that some features won't work with NVARCHAR2, see this SO question:
Oracle Text will not work with NVARCHAR2. What else might be unavailable?
I don't think answer from Vincent Malgrat is correct. When NVARCHAR2 was introduced long time ago nobody was even talking about Unicode.
Initially Oracle provided VARCHAR2 and NVARCHAR2 to support localization. Common data (include PL/SQL) was hold in VARCHAR2, most likely US7ASCII these days. Then you could apply NLS_NCHAR_CHARACTERSET individually (e.g. WE8ISO8859P1) for each of your customer in any country without touching the common part of your application.
Nowadays character set AL32UTF8 is the default which fully supports Unicode. In my opinion today there is no reason anymore to use NLS_NCHAR_CHARACTERSET, i.e. NVARCHAR2, NCHAR2, NCLOB. Note, there are more and more Oracle native functions which do not support NVARCHAR2, so you should really avoid it. Maybe the only reason is when you have to support mainly Asian characters where AL16UTF16 consumes less storage compared to AL32UTF8.
The NVARCHAR2 stores variable-length character data. When you
create a table with the NVARCHAR2 column, the maximum size is always
in character length semantics, which is also the default and only
length semantics for the NVARCHAR2 data type.
The NVARCHAR2data type uses AL16UTF16character set which encodes Unicode data in the UTF-16 encoding. The AL16UTF16 use 2 bytes to store a character. In addition, the maximum byte length of an NVARCHAR2 depends on the configured national character set.
VARCHAR2 The maximum size of VARCHAR2 can be in either bytes or characters. Its column only can store characters in the default character
set while the NVARCHAR2 can store virtually any characters. A single character may require up to 4 bytes.
By defining the field as:
VARCHAR2(10 CHAR) you tell Oracle it can use enough space to store 10
characters, no matter how many bytes it takes to store each one. A single character may require up to 4 bytes.
NVARCHAR2(10) you tell Oracle it can store 10 characters with 2 bytes per character
In Summary:
VARCHAR2(10 CHAR) can store maximum of 10 characters and maximum of 40 bytes (depends on the configured national character set).
NVARCHAR2(10) can store maximum of 10 characters and maximum of 20 bytes (depends on the configured national character set).
Note: Character set can be UTF-8, UTF-16,....
Please have a look at this tutorial for more detail.
Have a good day!
nVarchar2 is a Unicode-only storage.
Though both data types are variable length String datatypes, you can notice the difference in how they store values.
Each character is stored in bytes. As we know, not all languages have alphabets with same length, eg, English alphabet needs 1 byte per character, however, languages like Japanese or Chinese need more than 1 byte for storing a character.
When you specify varchar2(10), you are telling the DB that only 10 bytes of data will be stored. But, when you say nVarchar2(10), it means 10 characters will be stored. In this case, you don't have to worry about the number of bytes each character takes.

Resources