We are having problems with text that is encoded in some different ways but kept in a single column in a table. Long story. On MySQL, I can do "select hex(str) from table where" and I see the bytes of the string exactly as I set them.
On Oracle, I have a string which starts with the Turkish character İ, which is the Unicode character 0x0130 "LATIN CAPITAL LETTER WITH DOT ABOVE". This is in my printed copy of the Unicode Version 2.0 book. In UTF-8, this character is 0xc4b0.
We have very old client apps we need to support. They would send us this text in "windows-1254". We used to just close our eyes, store it, and hand it back later. Now we need the Unicode, or are being given the Unicode.
So I have:
SQL> select id, name from table where that thing;
ID NAME
------ ------------------------
746 Ý
This makes sense because the "İ" is 0xdd in windows-1254 and 0xdd in wondows-1252 is "Ý". My terminal is presumably set to the usual windows-1252.
But:
SQL> select id, rawtohex(name) from table where that thing;
ID RAWTOHEX(NAME)
------ ------------------------
746 C39D
There seems to be no equivalent to the hex(name) function in MySQL. But I must be missing something. What am I missing here?
My java code has to take the utf8 that I am supplied and save a utf8 copy and a windows-1252 copy. The java code gives me:
bytes (utf8): c4 b0
bytes (1254): dd
Yet, when I save it, the client does not get the correct character. And when I try to see what Oracle has actually stored, i get the garbage seen above. I have no idea where the C39D is coming from. Any suggestions?
We have ojdbc14.jar built into all of our applications and we are connecting to a database that says it is "Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production".
Use the dump function to see how Oracle stores data internally.
You seem to have a misunderstanding on how Oracle treats VARCHAR2 characters set conversions: you can't influence how Oracle stores its data physically. (Also if you haven't already, it's helpful to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets).
Your client speaks to Oracle only in binary. In fact all systems exchange information in binary only. To understand each others, it is necessary that both systems know what language (character set) is being used.
In your case we can reconstruct what happens:
Your client sends the byte dd to Oracle and says it is windows-1252 (instead of 1254).
Oracle looks up its character set table and sees that this data is translated to the symbol Ý in this character set.
Oracle logically stores this information in its table.
Since Oracle is setup in UTF-8, it converts this data to the UTF-8 binary reprensentation of Ý:
SQL> SELECT rawtohex('Ý') FROM dual;
RAWTOHEX('Ý')
--------------
C39D
Oracle stores C39D internally.
As you can see, the problem comes from the first step: there is a problem of setup. As long as you don't fix this, the systems won't be able to successfully dialogue.
The conversion is automatic when you use VARCHAR2 because this datatype is a logical text symbol interface (you have next to no control over forcing the actual binary data being stored).
I have bytes in UTF-8 to begin.
String strFromUTF8 = new String(bytes, "UTF8");
byte[] strInOldStyle = strFromUTF8.getBytes("Cp1254");
With MySQL, I am done. I takes these bytes, turn them into a hex string and do an update with unhex(hexStr). This allows me to put the legacy bytes into a varchar column.
With Oracle, I must do:
String again = new String(strInOldStyle, "Cp1254");
byte[] nextOldBytes = again.getBytes("UTF8");
Now, I can do an update and get the bytes into a varchar2 column with:
update table set colName = UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW('hexStr')) where ...
Strange, no? I am sure I have made this more complex than it needed to be.
What we see is this, though,
"İ" in UTF-8 == 0xc4d0
"İ" in Cp1254 == 0xdd == "Ý" in Cp1252
"Ý" in UTF-8 == 0xc3d9
So, if I get the string "İ" and do:
update table set name = UTL_RAW.CAST_TO_VARCHAR2(HEXTORAW('C3D9')) where ...
Then our legacy client gives us a "İ". Yep. It works.
Related
I have an issue with invalid characters appearing in an Oracle database. The "¿" or upside-down question mark. I know its caused by UTF8 encoding's being put into the encoding of the Oracle database. Examples of characters I am expecting are ' – ' which looks like a normal hyphen but isnt, and ' ’ ' which should be a normal single quote.
Input: "2020-08-31 – 2020-12-31" - looks like normal hyphen but not
Output: "2020-08-31 ¿ 2020-12-31"
I know the primary source of the characters are copy and paste from office programs like word and its conversion to smart quotes and stuff. Getting the users to turn off this feature is not a viable solution.
There was a good article about dealing with this and they offered a number of solutions:
switch database to UTF8 - cant do it due to reasons
change varchar2 to nvarchar2 - I tried this solution as it would be the best for my needs, but when tested I was still ending up with invalid characters, so its not the only point.
use blob/clob - tried and got page crashes
filter - not attractive as there are a lot of places that would need to be updated.
So the technologies being used are Eclipse, Spring, JDBC version 4.2, and Oracle 12.
So the information is entered into the form, the form gets saved, it passes from the controller into the DAO and when its checked here, the information is correct. Its when it passes here into the JDBC where I lose sight of it, and once it enters the database I cant tell if its already changed or if thats where its happening, but the database stores the invalid character.
So somewhere between the insert statement to the database this is happening. So its either the JDBC or the database, but how to tell and how to fix? I have changed the field where the information is being stored, its originally a varchar2 but I tried nvarchar2 and its still invalid.
I know that this question is asking about the same topic, but the answers do not help me.
SELECT * FROM V$NLS_PARAMETERS WHERE PARAMETER IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET');
PARAMETER VALUE CON_ID
---------------------------------------------------------------- ---------------------------------------------------------------- ----------
NLS_CHARACTERSET WE8ISO8859P15 0
NLS_NCHAR_CHARACTERSET AL16UTF16 0
So in the comments I was given a link to something that might help but its giving me an error.
#Override public Long createComment(Comment comment) {
return jdbcTemplate.execute((Connection connection) -> {
PreparedStatement ps = connection.prepareStatement("INSERT INTO COMMENTS (ID, CREATOR, CREATED, TEXT) VALUES (?,?,?,?)", new int[] { 1 });
ps.setLong(1, comment.getFileNumber());
ps.setString(2, comment.getCreator());
ps.setTimestamp(3, Timestamp.from(comment.getCreated()));
((OraclePreparedStatement) ps).setFormOfUse(4, OraclePreparedStatement.FORM_NCHAR);
ps.setString(4, comment.getText());
if(ps.executeUpdate() != 1) return null;
ResultSet rs = ps.getGeneratedKeys();
return rs.next() ? rs.getLong(1) : null;
});
An exception occurred: org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement cannot be cast to oracle.jdbc.OraclePreparedStatement
Character set WE8ISO8859P15 (resp. ISO 8859-15) does not contain – or ’, so you cannot store them in a VARCHAR2 data field.
Either you migrate the database to Unicode (typically AL32UTF8) or use NVARCHAR2 data type for this column.
I am not familiar with Java but I guess you have to use getNString. The linked document should provide some hints to solve the issue.
Recently I came across a unicode character (\u2019) in a database table column while parsing using Python.
Question: What are the reasons that can result in unicode characters showing up in the database table? Is it data entry issue?
Appreciate any input.
When you set up your Oracle Database you choose a character set which will be used in the SQL char datatypes (char, varchar2 etc).
Suppose you chose your character set and you have a table with a column of VARCHAR2 type. Suddenly you need to store some string with non-ASCII symbols not supported by your database (chosen character set). You may convert this string into ASCII string by calling ASCIISTR function for example and store it in your VARCHAR2 column (but it's not a good idea because many SQL built-in functions don't understand '\u2019' (they think it's just 6 symbols)). That's how Unicode may appear in your table column (ASCIISTR converts non-ascii symbols into unicode representation such as '\u2019').
Another option is special Oracle nchar datatypes which were designed to store UNICODE without altering global database settings.
Here is the link with Oracle documentation: https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch6unicode.htm
I want to move an oracle database from a non-unicode server (EL8ISO8859P7 character set and AL16UTF16 NCHAR character set) to a unicode server. Specifically to an Oracle Express server with AL32UTF8 character set.
Simply exporting (exp) and importing (imp) the data fails. We have a lot of the varchar2 columns with their length specified in bytes. When their contents are mapped in unicode they take more bytes and are truncated.
I tried the following:
- doubling the length of all varchar2 columns of the original database with a script (varchar2(10) becomes varchar2(20))
- exporting
- importing to the new server
And it worked. Apparently doubling is arbitrary, I probably should have changed them to the same size with CHAR semantics.
I also tried the following:
- change all varchar2 columns to nvarchar2 (same size - varchar(10) becomes nvarchar(10))
- exporting
- importing to the new server
It also worked.
Somehow the latter (converting to nvarchar) seems "cleaner". Then again you have a unicode database with unicode data types which seems weird.
So the question is: is there a suggested way to go about moving the database between the two servers? Is there any serious problem with either of the two approaches I mentioned above?
Don't use NVARCHAR2 data types unless that is your only option. The national character set exists to deal with cases where you have an existing, legacy application that does not support Unicode and you want to add a handful of columns to the system that do support Unicode without touching those legacy applications. Using NVARCHAR2 columns is great for those cases but it creates all sorts of issues in application development. Plenty of tools, APIs, and applications either don't support NVARCHAR2 columns or require additional configuration to do so. And since NVARCHAR2 columns are relatively uncommon in the Oracle world, it's very easy to spend gobs of time trying to resolve the particular issues you encounter. Less critically, since AL16UTF16 requires at least 2 bytes per character, you're likely to require quite a bit more space since much of your data is likely to consist of English characters.
I would strongly prefer migrating to the new database with character-length semantics (i.e. VARCHAR2(10 BYTE) becomes VARCHAR2(10 CHAR)). That avoids doubling the allowed length. It also makes it much easier to explain to users what the length limits are (or to code those validations in front-ends). It's terribly confusing to most users to explain that a particular column can sometimes hold 20 characters (when only English characters are used), can sometimes hold 10 characters (when only non-English characters are used), and can sometimes hold something in the middle (when there is a mixture of characters). Character length semantics make all those issues drastically easier.
Migrating to unicode databases is a 4 step process.
Use exp[dp] to export the data and generate ddl for the tables.
Alter the ddl to change the byte length varchar2 fields to character length fields.
create the tables using the modified ddl script.
import the data using imp[dp]
skipping steps 2 and 3 leaves you with the byte length defined fields again and probably with a lot of errors during import because data doesn't fit in the defined columns. If there is only us characters in the source database it won't be a big problem but for example latin characters will give problems because a single character could need more bytes.
Following the listed procedure prevents the length problems. There are obviously more ways to do this but rule is to first have the ddl definition ok and insert the data later.
When using Oracle ascii function:
select ascii('A') from dual;
It return 65 is right.
But,when i using:
select ascii('周') from dual;
The return is 55004.The ascii can represent>255???
How to explain?
Help!!!!
My oracle version:Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production
my Characterset:6 NLS_CHARACTERSET ZHS16GBK
ASCII in the name is a holdover from when Oracle only supported ASCII. It does not mean it only returns ASCII values.
From the docs:
ASCII returns the decimal representation in the database character set of the first character of char.
http://docs.oracle.com/cd/E11882_01/server.112/e41084/functions013.htm#sthref933
So the result depends on the database character set, which can be greater than 255.
This may vary with your version of Oracle, but it is probably trying to do you the favor of gracefully handling the non-7bit ASCII value that you are passing (but should not be). The doc in at least one version discusses some handling of non-ASCII inputs (http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions007.htm) though if you are using a different version of oracle you may want to refer to the appropriate docs.
If your docs don't say anything more about how it handles non-7bit characters then the answer is probably not well defined (ie no guarantee from Oracle on behavior) and you may want to consider cleansing your input so you only try calling the ASCII function on values that you know to be in the proper input set.
The client has asked for a number of tables to be extracted into csv's, all done no problem. They've just asked we make sure the files are always in UTF 8 format.
How do I check this is actually the case. Or even better force it to be so, is it something i can set in a procedure before running a query perhaps?
The data is extracted from an Oracle 10g database.
What should I be checking?
Thanks
You can check the database character set with the following query:
select value from nls_database_parameters
where parameter='NLS_CHARACTERSET'
If it says AL32UTF8 then your database is in the format what you need and if the export does not impair it then your are done.
You may read about Oracle globalization support here, and here about NLS parameters like the above.
How, exactly, are you generating the CSV files? Depending on the exact architecture, there will be different answers.
If you are, for example, using SQL*Plus to extract the data, you would need to set the NLS_LANG on the client machine to something appropriate (i.e. AMERICAN_AMERICA.AL32UTF8) to force the data to be sent to the client machine in UTF-8. If you are using other approaches, NLS_LANG may or may not be important.
What you have to look for is the eight-bit ascii characters in hte input (if any) are translated into double byte utf-8 characters.
This is highly dependant on your local ASCII code page but typically:-
ASCII "£" should be x'A3' in ascii magically becomes x'C2A3' in utf-8.
Ok it wasn't as simple as I first hoped. The query above returns AL32UTF8.
I am using a stored proc compiled on the database to loop through a list of table names held in an array inside the stored procedure.
I use DBMS_SQL package to build the SQL and UTL_FILE.PUT_NCHAR to insert data into a text file.
I believed then my resultant output would be in UTF 8 however opening in Textpad says it's in ANSI and the data is garbled in places :)
Cheers
It might be important that NLS_CHARACTERSET is AL32UTF8 and NLS_NCHAR_CHARACTERSET is AL16UTF16