How did the unicode characters endup in the database table column?

How did the unicode characters endup in the database table column? - oracle

Recently I came across a unicode character (\u2019) in a database table column while parsing using Python.
Question: What are the reasons that can result in unicode characters showing up in the database table? Is it data entry issue?
Appreciate any input.

When you set up your Oracle Database you choose a character set which will be used in the SQL char datatypes (char, varchar2 etc).
Suppose you chose your character set and you have a table with a column of VARCHAR2 type. Suddenly you need to store some string with non-ASCII symbols not supported by your database (chosen character set). You may convert this string into ASCII string by calling ASCIISTR function for example and store it in your VARCHAR2 column (but it's not a good idea because many SQL built-in functions don't understand '\u2019' (they think it's just 6 symbols)). That's how Unicode may appear in your table column (ASCIISTR converts non-ascii symbols into unicode representation such as '\u2019').
Another option is special Oracle nchar datatypes which were designed to store UNICODE without altering global database settings.
Here is the link with Oracle documentation: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch6unicode.htm

Related

Special character conversion issue in Datastage

In Datastage, we have source system as Oracle and target system as Netezza. In Oracle the column datatype is varchar whereas in Netezza it is nvarchar. Most of the characters are Latin and Dutch.
We are getting character in our table row which is exactly opposite to the one mentioned in bracket (`) means it's heading towards right and slanting on left(mostly dutch). We feel it is Dutch character which represent apostrophe. The table consists of million records and many values in table have this special character. We want to process the value as it is but we are getting garbage value. Can anyone help us in which conversion function we should try?
I tried iso-8859-1 and iso-8859-15

Declaring a CLOB in an Oracle database with a custom charset

Is it possible to declare a UTF-8 CLOB if the database is set up with the following character sets?
PARAMETER VALUE
NLS_CHARACTERSET CL8ISO8859P5
NLS_NCHAR_CHARACTERSET AL16UTF16
I tried passing a charset name to the declaration, but it looks like it can only accept references to character sets of other objects.
declare
clob_1 clob character set "AL32UTF8";
begin
null;
end;
/

I don't think this is possible, see PL/SQL Language Fundamentals
PL/SQL uses the database character set to represent:
Stored source text of PL/SQL units
Character values of data types CHAR, VARCHAR2, CLOB, and LONG
So, in your case you have to use NCLOB which uses AL16UTF16 or try a workaround with BLOB. However, this might become cumbersome.

As far as I can tell, you can't do that.
Database character set is defined during database creation (and can't be changed unless you recreate the database) and all character datatype columns store data in that character set.
Perhaps you could try with NCLOB data type, where "N" represents "national character set" and it'll store Unicode character data.
Unicode is a universal encoded character set that can store
information in any language using a single character set

DB2, special character occupies 2 bytes

I have a problem inserting special characters (á é í ú or ñ) in a char(1) field.
CREATE TABLE sgc2."tabtest2"(field1 CHAR(1), field2 VARCHAR(1));
INSERT INTO sgc2."tabtest2" values('á', 'á');
ERROR:
Value "á" is too long.. SQLCODE=-433, SQLSTATE=22001, DRIVER=4.13.111
Apparently to insert these characters take two byte, and as the field only accepts one can not end with the insertion.
Is there any way to configure the database, to support these special characters taking only 1 byte?

Apparently your database was created with the Unicode codeset, where special characters are represented by multiple bytes. If you only need to represent a limited range of accented characters you can choose one of the supported codesets, specified by ISO-8859, for the corresponding language -- details in the manual. You will have to re-create the database using an appropriate CODESET option, as you cannot change the codeset of an existing database.
However, you should consider changing your tables instead, as Unicode gives you more flexibility. A Unicode database can also be a requirement for certain DB2 features, for example BLU Acceleration.

Migrating an oracle database from a non unicode server to a unicode server

I want to move an oracle database from a non-unicode server (EL8ISO8859P7 character set and AL16UTF16 NCHAR character set) to a unicode server. Specifically to an Oracle Express server with AL32UTF8 character set.
Simply exporting (exp) and importing (imp) the data fails. We have a lot of the varchar2 columns with their length specified in bytes. When their contents are mapped in unicode they take more bytes and are truncated.
I tried the following:
- doubling the length of all varchar2 columns of the original database with a script (varchar2(10) becomes varchar2(20))
- exporting
- importing to the new server
And it worked. Apparently doubling is arbitrary, I probably should have changed them to the same size with CHAR semantics.
I also tried the following:
- change all varchar2 columns to nvarchar2 (same size - varchar(10) becomes nvarchar(10))
- exporting
- importing to the new server
It also worked.
Somehow the latter (converting to nvarchar) seems "cleaner". Then again you have a unicode database with unicode data types which seems weird.
So the question is: is there a suggested way to go about moving the database between the two servers? Is there any serious problem with either of the two approaches I mentioned above?

Don't use NVARCHAR2 data types unless that is your only option. The national character set exists to deal with cases where you have an existing, legacy application that does not support Unicode and you want to add a handful of columns to the system that do support Unicode without touching those legacy applications. Using NVARCHAR2 columns is great for those cases but it creates all sorts of issues in application development. Plenty of tools, APIs, and applications either don't support NVARCHAR2 columns or require additional configuration to do so. And since NVARCHAR2 columns are relatively uncommon in the Oracle world, it's very easy to spend gobs of time trying to resolve the particular issues you encounter. Less critically, since AL16UTF16 requires at least 2 bytes per character, you're likely to require quite a bit more space since much of your data is likely to consist of English characters.
I would strongly prefer migrating to the new database with character-length semantics (i.e. VARCHAR2(10 BYTE) becomes VARCHAR2(10 CHAR)). That avoids doubling the allowed length. It also makes it much easier to explain to users what the length limits are (or to code those validations in front-ends). It's terribly confusing to most users to explain that a particular column can sometimes hold 20 characters (when only English characters are used), can sometimes hold 10 characters (when only non-English characters are used), and can sometimes hold something in the middle (when there is a mixture of characters). Character length semantics make all those issues drastically easier.

Migrating to unicode databases is a 4 step process.
Use exp[dp] to export the data and generate ddl for the tables.
Alter the ddl to change the byte length varchar2 fields to character length fields.
create the tables using the modified ddl script.
import the data using imp[dp]
skipping steps 2 and 3 leaves you with the byte length defined fields again and probably with a lot of errors during import because data doesn't fit in the defined columns. If there is only us characters in the source database it won't be a big problem but for example latin characters will give problems because a single character could need more bytes.
Following the listed procedure prevents the length problems. There are obviously more ways to do this but rule is to first have the ddl definition ok and insert the data later.

Char Vs Byte in Oracle

I am comparing two databases which have similar schema. Both should support unicode characters.
When i describe the same table in both database, db 1 shows all the varchar fields with char, (eg varchar(20 char)) but the db2 shows without char, (varchar(20)
the second schema supports only one byte/char.
When i compare nls_database_parameters and v$nls_parameters in both database its all same.
could some one let me know what may be the change here?

Have you checked NLS_LENGTH_SEMANTICS? You can set the default to BYTE or CHAR for CHAR/VARCHAR2 types.
If these parameters are the same on both datbases then maybe the table was created by explicitly specifying it that way.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio