Oracle CLOB to XMLTYPE Errors - oracle

We are running an older Oracle Server, 10.1.0.5... yes, we will upgrade soon.
Some relevant NLS Settings are as follows...
NLS_CHARACTERSET IS 'US7ASCII'
NLS_LENGTH_SEMANTICS IS 'BYTE'
Onto the question...
We have well formed XML stored in CLOB columns. When trying to pull XML Element data I am using syntax like
select XMLTYPE(I.CLOBFIELD).EXTRACT('/Record/RecordID/text()') as Record_ID
from iTable I
Where I.CLOBFIELD is the CLOB containing some XML.
This works great, usually.
We sometimes get an error when the CLOB data contains non-ascii data that has been encoded using "&#xxxx;".
For example if the following text
... “violation” ...
were found anywhere in the CLOB I would get the above error when running this query.
The left quote is x201C and the right is x201D, these are stored as plain ascii in the xml as &#x201C and &#x201D, respectively.
Punctuation like this has crept into our CLOB fields (mostly from users cutting and pasting from ms-word). At some point we will clean them up, probably when we migrate, but for now we would like the above query to always work, even when these Unicode equivalents are found in the CLOB.
Note: I would use XMLTABLE(..) if I could but it is not available in this edition of Oracle.
Suggestions or alternatives to XMLTYPE would be welcome.
Thank you,
sse

Related

Migrating an oracle database from a non unicode server to a unicode server

I want to move an oracle database from a non-unicode server (EL8ISO8859P7 character set and AL16UTF16 NCHAR character set) to a unicode server. Specifically to an Oracle Express server with AL32UTF8 character set.
Simply exporting (exp) and importing (imp) the data fails. We have a lot of the varchar2 columns with their length specified in bytes. When their contents are mapped in unicode they take more bytes and are truncated.
I tried the following:
- doubling the length of all varchar2 columns of the original database with a script (varchar2(10) becomes varchar2(20))
- exporting
- importing to the new server
And it worked. Apparently doubling is arbitrary, I probably should have changed them to the same size with CHAR semantics.
I also tried the following:
- change all varchar2 columns to nvarchar2 (same size - varchar(10) becomes nvarchar(10))
- exporting
- importing to the new server
It also worked.
Somehow the latter (converting to nvarchar) seems "cleaner". Then again you have a unicode database with unicode data types which seems weird.
So the question is: is there a suggested way to go about moving the database between the two servers? Is there any serious problem with either of the two approaches I mentioned above?
Don't use NVARCHAR2 data types unless that is your only option. The national character set exists to deal with cases where you have an existing, legacy application that does not support Unicode and you want to add a handful of columns to the system that do support Unicode without touching those legacy applications. Using NVARCHAR2 columns is great for those cases but it creates all sorts of issues in application development. Plenty of tools, APIs, and applications either don't support NVARCHAR2 columns or require additional configuration to do so. And since NVARCHAR2 columns are relatively uncommon in the Oracle world, it's very easy to spend gobs of time trying to resolve the particular issues you encounter. Less critically, since AL16UTF16 requires at least 2 bytes per character, you're likely to require quite a bit more space since much of your data is likely to consist of English characters.
I would strongly prefer migrating to the new database with character-length semantics (i.e. VARCHAR2(10 BYTE) becomes VARCHAR2(10 CHAR)). That avoids doubling the allowed length. It also makes it much easier to explain to users what the length limits are (or to code those validations in front-ends). It's terribly confusing to most users to explain that a particular column can sometimes hold 20 characters (when only English characters are used), can sometimes hold 10 characters (when only non-English characters are used), and can sometimes hold something in the middle (when there is a mixture of characters). Character length semantics make all those issues drastically easier.
Migrating to unicode databases is a 4 step process.
Use exp[dp] to export the data and generate ddl for the tables.
Alter the ddl to change the byte length varchar2 fields to character length fields.
create the tables using the modified ddl script.
import the data using imp[dp]
skipping steps 2 and 3 leaves you with the byte length defined fields again and probably with a lot of errors during import because data doesn't fit in the defined columns. If there is only us characters in the source database it won't be a big problem but for example latin characters will give problems because a single character could need more bytes.
Following the listed procedure prevents the length problems. There are obviously more ways to do this but rule is to first have the ddl definition ok and insert the data later.

Insert Unicode string to DB using Linq

When I try to excecute this:
INSERT INTO [DB_NAME].[dbo].[Table]
([Column])
VALUES('some_hebrew_characters')
I get only questions mark in the column. If I change it to N'some_hebrew_characters' - then it's OK. Why is this happening? How can I translate it to Linq?
How can I make this table to treat all data as Unicode by default? My colum collation is Hebrew_CS_AI, and server is SQL 2008 R2.
Thanks!
---EDIT----
something I just noticed:
even if I run this
SELECT 'some_hebrew_characters'
Im getting questions mark in my results grid
Didn't you forget to mark your column as NVARCHAR also?
Probably that's your editor's default enncoding is not unicode.
To be sure, save your query as a unicode file in SQL SERVER Management Studio and re-run it.
I think if you get results through Linq there would be right.
you need to prefix the '' with the letter N
when inserting a value that contains unicode characters, you need to do this:
insert into table_name(unicode_field) values (N'会意字')
without the N prefix, they'll be passed as ASCII characters.
Also, be sure that the column you're inserting to, supports unicode characters - i.e. nchar, nvarchar, ntext.

Oracle SQL Developer environment encoding

I have Oracle SQL Developer (3.1.07) and I'm trying to work with a database that uses WE8ISO8859P1 encoding:
SELECT * FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
I have problems with saving packages that contains unicode symbols. When I open previously saved package all unicode symbols are turned to '¿'.
What settings do I have to change to make SQL Developer keep those symbols?
I've tried to set environment encoding to 'ISO-8859-15' and some other encodings, but it won't help.
If your database encodes text to a non-unicode single-byte encoding (e.g. ISO-8859), any symbol not present on the character table will be seen as invalid and replaced by a placeholder. You can't go back from that, the information is lost.
That can be usually worked around when storing data, but as for source code, you cannot control how Oracle would encode your strings.
If your database is configured to use such encoding scheme you're probably not supposed to write code that violates its rules.
Maybe you could need this character set migration
http://docs.oracle.com/cd/B10501_01/server.920/a96529/ch10.htm#1656
on the Oracle's documentation
At least to open PKG in sql developer, you can do a quick try and see if it works:-
Change SQL Developer 'encoding' to 'unicode-utf-8' which is default to later versions now.
You would ,eventually, need to go for database charset migration to 'AL32UTF8' to avoid other issues (like data) due to this char set.
If you look at USER_SOURCE you'll see that the source code, as stored/interpreted by the database, will be in a VARCHAR2 column so use the database character set. As such, your source code will need to be in WE8ISO8859P1.
In theory, if the client and database are using the same character set, then the database won't try to do any character set translation and you may be able to sneak in a sequence of bytes that the database thinks are WE8ISO8859P1 but will make sense in unicode. However, at some point, someone will use the wrong client and it will break.
You don't need unicode for identifiers etc in the code, so I assume it is in string literals. You are better off storing these in a table (NVARCHAR2 column) and selecting them into the code rather than hard-coding them. If that isn't possible, you could use UNISTR and hard-code the relevant hex values.

using ansi sql syntax for formatting Numeric

I am using two different databases for my project,
Oracle and Apache Derby, and am trying as much as possible to use the ANSI SQL syntax supported by both of the databases.
I have a table with a column amount_paid NUMERIC(26,2).
My old code, which was using Oracle db, needed to retrieve value in this format
SELECT LTRIM(TO_CHAR(amount_paid,'9,999,999,999,999.99'))
How can I convert a numeric value to such a string in the format '9,999,999,999,999.99' using ANSI sql syntax?
I think this is the wrong approach. The format mask is for display purposes, so it really ought to be the concern of the presentation layer. All your data access layer should do is merely execute:
select amount_paid
from your_table
where ....
This syntax will obviously work whatever database your app attaches to.
Then put the formatting code in the front-end, where it belongs.
My knowledge is not encylopedic but as far as I know there isn't an ANSI function to do what you want (although I'd be glad to find out I'm wrong :-). CONVERT converts between character sets but does not, as best I can see, do the formatting work you want. CAST converts values between data types but, again, doesn't do formatting.
If Derby doesn't support the Oracle-style TO_CHAR function you may have to roll your own function, let's call it MY_TO_CHAR. In Oracle the implementation might be
FUNCTION MY_TO_CHAR(nValue IN NUMBER,
strOracle_format IN VARCHAR2,
strDerby_format IN VARCHAR2)
RETURN VARCHAR2
IS BEGIN
RETURN TO_CHAR(nValue, strOracle_format);
END MY_TO_CHAR;
In Derby you'd want to define this function in a similar manner, taking the appropriate value and format and invoking Derby's equivalent of TO_CHAR with the Derby formatting string.
EDIT: I agree with #APC - a lot of these issues disappear if you don't require the backend to do what is basically front-end work.
Share and enjoy.

UTF 8 from Oracle tables

The client has asked for a number of tables to be extracted into csv's, all done no problem. They've just asked we make sure the files are always in UTF 8 format.
How do I check this is actually the case. Or even better force it to be so, is it something i can set in a procedure before running a query perhaps?
The data is extracted from an Oracle 10g database.
What should I be checking?
Thanks
You can check the database character set with the following query:
select value from nls_database_parameters
where parameter='NLS_CHARACTERSET'
If it says AL32UTF8 then your database is in the format what you need and if the export does not impair it then your are done.
You may read about Oracle globalization support here, and here about NLS parameters like the above.
How, exactly, are you generating the CSV files? Depending on the exact architecture, there will be different answers.
If you are, for example, using SQL*Plus to extract the data, you would need to set the NLS_LANG on the client machine to something appropriate (i.e. AMERICAN_AMERICA.AL32UTF8) to force the data to be sent to the client machine in UTF-8. If you are using other approaches, NLS_LANG may or may not be important.
What you have to look for is the eight-bit ascii characters in hte input (if any) are translated into double byte utf-8 characters.
This is highly dependant on your local ASCII code page but typically:-
ASCII "£" should be x'A3' in ascii magically becomes x'C2A3' in utf-8.
Ok it wasn't as simple as I first hoped. The query above returns AL32UTF8.
I am using a stored proc compiled on the database to loop through a list of table names held in an array inside the stored procedure.
I use DBMS_SQL package to build the SQL and UTL_FILE.PUT_NCHAR to insert data into a text file.
I believed then my resultant output would be in UTF 8 however opening in Textpad says it's in ANSI and the data is garbled in places :)
Cheers
It might be important that NLS_CHARACTERSET is AL32UTF8 and NLS_NCHAR_CHARACTERSET is AL16UTF16

Resources