Clob size in bytes - oracle

I have a database with the below NLS settings
NLS_NCHAR_CHARACTERSET - AL16UTF16
NLS_CHARACTERSET - AL32UTF8
There's a table with a clob column storing a base64 encoded data.
Since the characters are mostly english and letters, I would assume each character takes up 1 byte only as clob using the charset of NLS_CHARACTERSET for encoding.
With a inline enabled clob column, the clob will be stored inline unless it goes more that 4096 bytes in size. However, when I tried to store a set of data with 2048 chars, I found that it is not stored inline (By checking the table DBA_TABLES). So does it mean each character is not using only 1 byte? Can anyone elaborate on this?
Another test added:
Create a table with clob column with chunk size 8kb so that initial segment size is 65536 bytes.
After insert a row with 32,768 chars in clob column. The 2nd extent creation can be told by querying dba_segments.

http://docs.oracle.com/cd/E11882_01/server.112/e10729/ch6unicode.htm#r2c1-t12
It says:
Data in CLOB columns is stored in a format that is compatible with
UCS-2 when the database character set is multibyte, such as UTF8 or
AL32UTF8. This means that the storage space required for an English
document doubles when the data is converted
So it looks like CLOB internally stores everything as UCS-2 (Unicode), i.e. 2 bytes fixed per symbol. Consequently, it stores inline 4096/2 = 2048 chars.

Related

Thai characters not allowing more than 1333 characters from Java code

Thai characters not allowing more than 1333 characters from Java code.is there any possible way except using CLOB data type in db. we are using Oracle 11g.
Simply, no (I assume you use VARCHAR2 data type.), except Oracle 12c with EXTENDED string.
VARCHAR2 columns allow 4000 bytes in normal mode and up to 32767 in extended.
Thai requires multibyte characters that's why more than 1333 characters can take more than 4000 bytes.
NVARCHAR2 columns allow 2000 characters in normal mode and up to 16383 in extended.
What is the db character set ?
I suspect your scenario is as follows:
al32utf8 is the db character set.
the varchar2 column(s) in your table(s) have byte semantics.
The utf8 encoding represents each thai in up to 3 bytes. thus you encounter the length limit of 1333 instead of 4000.
You can change the length semantics from byte to char with ALTER TABLE MODIFY <column> VARCHAR2(n CHAR); (ref.: see here).
For the sake of completness: in case you are operating with a single byte db character set like WE8ISO8859P11 ( iso 8859-11, thai script ), characters can be composed from base characters and diacritical marks. In that case you might have success in changing encoding in the data source to use the code points for composite characters. However, I feel this scenario is unlikely, given that actually each of your test data characters must be composed from three parts to match the observation.

perl DBIx::Class converting values with umlaut

I'm using DBIx::Class to fetch data from Oracle (11.2). when the data fetched, for example "Alfred Kärcher" its returns the value as "Alfred Karcher". I tried to add the $ENV NLS_LANG and NLS_NCHAR but still no change.
I also used the utf8 module to verify that the data is utf8 encoded.
This looks like the Oracle client library converting the data.
Make sure the database encoding is set to AL32UTF8 and the environment variable NLS_LANG to AMERICAN_AMERICA.AL32UTF8.
It might also be possible by setting the ora_(n)charset parameter instead.
The two links from DavidEG contain all the info that's needed to make it work.
You don't need use utf8; in your script but make sure you set STDOUT to UTF-8 encoding: use encoding 'utf8';
here the problem is with the column data type that you specified for the storing
you column database specified as VARCHAR2(10), then for oracle, actually stores the 10 bytes, for English 10 bytes means 10 characters, but in case the data you insert into the column contains some special characters like umlaut, it require 2 bytes. then you end up RA-12899: VALUE too large FOR column.
so in case the data that you inserting into the column which is provided the user and from different countries then use VARCHAR2(10 char)
In bytes: VARCHAR2(10 byte). This will support up to 10 bytes of data, which could be as few as two characters in a multi-byte character sets.
In characters: VARCHAR2(10 char). This will support to up 10 characters of data, which could be as much as 40 bytes of information.

ORA-01704: string literal too long using long UTF-8 Character set

I'm testing a recently converted a database to UTF-8. If I use long random UTF-8 characters to insert into a varchar2 field (4000 characters) I get:
[ORA-01704: string literal too long using long UTF-8 Character set]
If I cut the string down to about 3600 characters, it works. What gives? Is there a way to insert my 4000 characters?
Note that there are some pretty strange characters in the string.
Thanks.
From the documentation:
Independently of the maximum length in characters, the length of VARCHAR2 data cannot exceed 4000 bytes.
So a field declared as varchar2(4000 [char]) can hold 4000 single-byte characters, or a lower number of multi-byte characters. You can't get around that, at least until 12c when varchar2 supports up to 32k.
If you do actually need to allow 4000 multi-byte characters in 11g or earlier you will need to create the column as a CLOB, which can hold gigabytes of data. (You might want to read more on LOB storage as well).
A single UTF-8 character can be more than 1 byte long. Oracle has a limit of 4000 bytes. Therefore less then 4000 UTF-8 characters will fit into a 4000 char length column.
Better you Change the datatype of the column to clob

SQL Loader with utf8

I am getting following error while loading Japanese data using SQL*Loader. My Database is UTF8 (NLS parameters) and my OS supports UTF8.
Record 5: Rejected - Error on table ACTIVITY_FACT, column METADATA.
ORA-12899: value too large for column METADATA (actual: 2624, maximum: 3500)
My Control file:
load data
characterset UTF8
infile '../tab_files/activity_fact.csv' "STR ';'"
APPEND
into tableactivity_fact
fields terminated by ',' optionally enclosed by '~'
TRAILING NULLCOLS
(metadata CHAR(3500))
My table
create table actvuty_facr{
metadata varchar2(3500 char)
}
Why SQL Loader is throwing the wrong exception, (actual: 2624, maximum: 3500). 2624 is less than 3500.
The default length semantics for all datafiles (except UFT-16) is byte. So in your case you have a CHAR of 3500 bytes rather than characters. You have some multi-byte characters in your file and the 2624 characters is therefore using more than 3500 bytes, hence the (misleading) message.
You can sort this out by using character length semantics instead
alter this line in your control file
characterset UTF8
to this
characterset UTF8 length semantics char
and it will work on characters for CHAR fields (and some others) - in the same way that you have set up your table, so 3500 characters of up to four bytes each.
See the Utilities Guide on Character Length Semantics for more information

Difference between VARCHAR2(10 CHAR) and NVARCHAR2(10)

I've installed Oracle Database 10g Express Edition (Universal) with the default settings:
SELECT * FROM NLS_DATABASE_PARAMETERS;
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16
Given that both CHAR and NCHAR data types seem to accept multi-byte strings, what is the exact difference between these two column definitions?
VARCHAR2(10 CHAR)
NVARCHAR2(10)
The NVARCHAR2 datatype was introduced by Oracle for databases that want to use Unicode for some columns while keeping another character set for the rest of the database (which uses VARCHAR2). The NVARCHAR2 is a Unicode-only datatype.
One reason you may want to use NVARCHAR2 might be that your DB uses a non-Unicode character set and you still want to be able to store Unicode data for some columns without changing the primary character set. Another reason might be that you want to use two Unicode character set (AL32UTF8 for data that comes mostly from western Europe, AL16UTF16 for data that comes mostly from Asia for example) because different character sets won't store the same data equally efficiently.
Both columns in your example (Unicode VARCHAR2(10 CHAR) and NVARCHAR2(10)) would be able to store the same data, however the byte storage will be different. Some strings may be stored more efficiently in one or the other.
Note also that some features won't work with NVARCHAR2, see this SO question:
Oracle Text will not work with NVARCHAR2. What else might be unavailable?
I don't think answer from Vincent Malgrat is correct. When NVARCHAR2 was introduced long time ago nobody was even talking about Unicode.
Initially Oracle provided VARCHAR2 and NVARCHAR2 to support localization. Common data (include PL/SQL) was hold in VARCHAR2, most likely US7ASCII these days. Then you could apply NLS_NCHAR_CHARACTERSET individually (e.g. WE8ISO8859P1) for each of your customer in any country without touching the common part of your application.
Nowadays character set AL32UTF8 is the default which fully supports Unicode. In my opinion today there is no reason anymore to use NLS_NCHAR_CHARACTERSET, i.e. NVARCHAR2, NCHAR2, NCLOB. Note, there are more and more Oracle native functions which do not support NVARCHAR2, so you should really avoid it. Maybe the only reason is when you have to support mainly Asian characters where AL16UTF16 consumes less storage compared to AL32UTF8.
The NVARCHAR2 stores variable-length character data. When you
create a table with the NVARCHAR2 column, the maximum size is always
in character length semantics, which is also the default and only
length semantics for the NVARCHAR2 data type.
The NVARCHAR2data type uses AL16UTF16character set which encodes Unicode data in the UTF-16 encoding. The AL16UTF16 use 2 bytes to store a character. In addition, the maximum byte length of an NVARCHAR2 depends on the configured national character set.
VARCHAR2 The maximum size of VARCHAR2 can be in either bytes or characters. Its column only can store characters in the default character
set while the NVARCHAR2 can store virtually any characters. A single character may require up to 4 bytes.
By defining the field as:
VARCHAR2(10 CHAR) you tell Oracle it can use enough space to store 10
characters, no matter how many bytes it takes to store each one. A single character may require up to 4 bytes.
NVARCHAR2(10) you tell Oracle it can store 10 characters with 2 bytes per character
In Summary:
VARCHAR2(10 CHAR) can store maximum of 10 characters and maximum of 40 bytes (depends on the configured national character set).
NVARCHAR2(10) can store maximum of 10 characters and maximum of 20 bytes (depends on the configured national character set).
Note: Character set can be UTF-8, UTF-16,....
Please have a look at this tutorial for more detail.
Have a good day!
nVarchar2 is a Unicode-only storage.
Though both data types are variable length String datatypes, you can notice the difference in how they store values.
Each character is stored in bytes. As we know, not all languages have alphabets with same length, eg, English alphabet needs 1 byte per character, however, languages like Japanese or Chinese need more than 1 byte for storing a character.
When you specify varchar2(10), you are telling the DB that only 10 bytes of data will be stored. But, when you say nVarchar2(10), it means 10 characters will be stored. In this case, you don't have to worry about the number of bytes each character takes.

Resources