I'm facing a very strange issue where Oracle 12c is not managing 2 bytes character as Oracle 11g, leading to issues with some functions like LPAD.
We have two databases, one 11g and one 12c, with identical NLS parameters, but while 11g manages cyrillic characters as 1 byte in functions like LPAD, 12c manages them as 2 bytes, leading to problems: if we need a certain value to be 40 chars long, every cyrillic character in it will count as 2 bytes while being padded, but will be displayed as 1 char, meaning that 5 cyrillic characters to be LPADded to 40 will in fact generate a value with length 35.
This behaviour is described in the official Oracle documentation (https://docs.oracle.com/database/121/SQLRF/functions107.htm#SQLRF00663), but it has been so for several versions (including 11g), so it's unclear to me why these 2 versions should have different behaviours with the same settings, and in case, how to manage this.
Important notes:
both databases manage european characters (including special characters from some eastern european alphabets like greek, etc.) and russian characters (cyrillic), so it's not really an option to switch region to "RUSSIA";
using nvarchar2 instead of varchar2 solves the issue (it switches to national charset which is UTF16), but it would imply switching all varchar2 columns in a 4 TB database to nvarchar2, which is quite troublesome and might lead to a LOT of wasted space;
the problem occurs in stored procedures managing data already stored in the database, so this doesn't look like a client misconfiguration.
Database properties for NLS parameters (I've removed date and currency formats since they're not really relevant):
+-----------------------------------+------------+------------+
| Parameter | 12c | 11g |
+-----------------------------------+------------+------------+
| NLS_CHARACTERSET | AL32UTF8 | AL32UTF8 |
| NLS_COMP | BINARY | BINARY |
| NLS_DATE_LANGUAGE | AMERICAN | AMERICAN |
| NLS_ISO_CURRENCY | AMERICA | AMERICA |
| NLS_LANGUAGE | AMERICAN | AMERICAN |
| NLS_LENGTH_SEMANTICS | BYTE | BYTE |
| NLS_NCHAR_CHARACTERSET | AL16UTF16 | AL16UTF16 |
| NLS_NCHAR_CONV_EXCP | FALSE | FALSE |
| NLS_NUMERIC_CHARACTERS | ., | ., |
| NLS_RDBMS_VERSION | 12.1.0.2.0 | 11.2.0.4.0 |
| NLS_SORT | BINARY | BINARY |
| NLS_TERRITORY | AMERICA | AMERICA |
+-----------------------------------+------------+------------+
V$Parameter properties (same, removed dates):
+-----------------------------------+----------------+----------------+
| Parameter | 12c | 11g |
+-----------------------------------+----------------+----------------+
| NLS_COMP | BINARY | BINARY |
| NLS_DATE_LANGUAGE | ENGLISH | ENGLISH |
| NLS_ISO_CURRENCY | UNITED KINGDOM | UNITED KINGDOM |
| NLS_LANGUAGE | ENGLISH | ENGLISH |
| NLS_LENGTH_SEMANTICS | CHAR | CHAR |
| NLS_NCHAR_CONV_EXCP | FALSE | FALSE |
| NLS_NUMERIC_CHARACTERS | ., | ., |
| NLS_SORT | BINARY | BINARY |
| NLS_TERRITORY | UNITED KINGDOM | UNITED KINGDOM |
+-----------------------------------+----------------+----------------+
Example from the 12c database:
SELECT 'This is a test данные испытаний' as "Original",
lpad(nvl('This is a test данные испытаний', ' '), 40) as "LPADded",
lpad(nvl('данные испытаний', ' '), 40) as "Cyrillic only",
lpad(nvl('This is a test', ' '), 40) as "Non-cyrillic only",
lpad(nvl(to_nchar('данные испытаний'), ' '), 40) as "NChar cyrillic only",
lpad(nvl(to_nchar('This is a test данные испытаний'),
' '),
40) as "NChar mixed"
FROM dual;
Results:
This is a test данные испытаний (original - 31 chars)
This is a test данные испыта (std lpad - 28 chars)
данные испытаний (std lpad cyrillic only - 25 chars)
This is a test (std lpad non-cyrillic only - 40 chars)
данные испытаний (nchar lpad cyrillic only - 40 chars)
This is a test данные испытаний (nchar lpad mixed - 40 chars)
In the 11g database, all the above (except, of course, the original) have a length of 40 chars.
Thanks
I think the problem is related to the ambiguous fonts in UNICODE. You can find a description here:
http://unicode.org/reports/tr11/#Ambiguous
In oracle if you use
lengthc function
always returns the actual length of the character,
while the
lenghtb function
returns the byte occupation of the character.
A possible solution could be to use the following form:
i tried with UNISTR('\4F4F') that takes up 2 bytes
select lpad('pippo'||UNISTR('\4F4F'),10+lengthc(UNISTR('\4F4F')),'x') from dual;
and the displayed length is the desired one
Related
I'm testing Greenplum (which is based of Postegres) with a table of this form:
CREATE TABLE whiteglove (bigint BIGINT,varbinary bytea,boolean BOOLEAN,date DATE,decimal DECIMAL,double float,real REAL,integer INTEGER,smallint SMALLINT,timestamp TIMESTAMP,tinyint smallint,varchar VARCHAR)
Then I trying to insert this row using Postegres JDBC driver
INSERT INTO whiteglove VALUES (100000,'68656c6c6f',TRUE,'10/10/2020',0.5,1.234567,1.234,10,2,'4/14/2015 7:32:33PM',2,'hello')
which fails with the following error
org.postgresql.util.PSQLException: ERROR: date/time field value out of range: "10/10/2020"
Hint: Perhaps you need a different "datestyle" setting.
Position: 57
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2532)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2267)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:312)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:448)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:369)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:310)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:296)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:273)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:268)
If I take that same query and execute it from terminal using psql it passes without problems
dev=# select * from whiteglove ;
bigint | varbinary | boolean | date | decimal | double | real | integer | smallint | timestamp | tinyint | varchar
--------+-----------+---------+------+---------+--------+------+---------+----------+-----------+---------+---------
(0 rows)
dev=# INSERT INTO whiteglove VALUES (100000,'68656c6c6f',TRUE,'10/10/2020',0.5,1.234567,1.234,10,2,'4/14/2015 7:32:33PM',2,'hello');
INSERT 0 1
dev=# select * from whiteglove ;
bigint | varbinary | boolean | date | decimal | double | real | integer | smallint | timestamp | tinyint | varchar
--------+------------+---------+------------+---------+----------+-------+---------+----------+---------------------+---------+---------
100000 | 68656c6c6f | t | 2020-10-10 | 0.5 | 1.234567 | 1.234 | 10 | 2 | 2015-04-14 19:32:33 | 2 | hello
(1 row)
Any pointers on why I'm getting this out of range error??
Dynamically Identify Columns in External Tables
We have a process wherein we upload employee data from multiple legislations (ex. US, Philippines, Latin America) via a SQL Loader.
This happens at least once a week and the current process is they create a control file every time they load employee information,
Load that into Staging Tables using SQL*Loader.
I was hoping to simplify the process by creating an External Table and running a concurrent request to put the data into our staging Tables.
There are two stumbling blocks i'm encountering:
There are some columns which are not being used by some legislations.
Example: US uses the column "Veteran_Information", while the Philippines and Latin America don't.
Philippines uses "SSS_Number" while US and Latin America Don't.
Latin America uses a "Medical_Insurance" Column while US and Philippines don't.
Something like below:
US: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, VETERAN_INFORMATION
PHL: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, SSS_NUMBER
LAT: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, MEDICAL_INSURANCE
Business Users don't use a Standard CSV Template/Format.
Since the File is being sent by Non-IT Business Users, they don't usually follow a prescribed format. (Training/User issue, probably).
they often don't follow the correct order of columns
they often don't follow the correct number of columns
they often don't follow the correct names of columns
Something like below:
US: LEGISLATION, EMPLOYEE_ID, VETERAN_INFORMATION, DATE_OF_BIRTH, EMAIL_ADD
PHL: EMP_NUM, LEGISLATION, DOB, SSS_NUMBER, EMAIL_ADDRESS
LAT: LEGISLATION, PS_ID, BIRTH_DATE, EMAIL, MEDICAL_INSURANCE
Is there a way for External Tables to identify the correct order and naming of columns even if they're not in the correct order/naming convention in the File?
Taking the Column Data from Problem 2:
US: LEGISLATION | EMPLOYEE_ID | VETERAN_INFORMATION | DATE_OF_BIRTH | EMAIL_ADD
US | 111 | No | 1967 | vet#gmail.com
PHL: EMP_NUM | LEGISLATION | DOB | SSS_NUMBER | EMAIL_ADDRESS
222 | PHL | 1898 | 456789 | pinoy#gmail.com
LAT: LEGISLATION | PS_ID | BIRTH_DATE | EMAIL | MEDICAL_INSURANCE
HON | 333 | 1956 | hon#gmail.com | Yes
I would like it to be like this when it appears in the External Table:
LEGISLATION | EMPLOYEE_NUMBER | DATE_OF_BIRTH | VETERAN_INFORMATION | SSS_NUMBER | MEDICAL_INSURANCE | EMAIL_ADDRESS
US | 111 | 1967 | Y | (NULL) | (NULL) | vet#gmail.com
PHL | 222 | 1898 | (NULL) | 456789 | (NULL) | pinoy#gmail.com
HON | 333 | 1956 | (NULL) | (NULL) | Yes | hon#gmail.com
Is there a way for External Tables to do something like above?
Thanks in advance!
The simplest would be:
Use three distinct load scripts for each type of input (US, PHL, HON). Each script just discards the other 2 record types, and places the columns (possibly doing some transformation, like 'No' -> 'N') in the right place and inserts NULL for columns that were not present for that record type.
I installed the latest version of the free hp vertica server on OS Linux CentOS release 6.6 (Final). Next, I set up a server and created a database IM_0609. Next, I created a table with the command:
CREATE TABLE MARKS (SERIAL_NUM varchar(30),PERIOD smallint,MARK_NUM decimal(20,0), END_MARK_NUM decimal(20,0),OLD_MARK_NUM decimal(20,0),DEVICE_NAME varchar(256),DEVICE_MARK varchar(256),CALIBRATION_DATE date);
Next, from the DB2 database I executed EXPORT data to txt file:
5465465|12|+5211.|+5211.||Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оператора|ЭкОЗ-01|2004-12-09
5465465|12|+5211.|+5211.||Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Test|2004-12-09
....
and I changed the file encoding to UTF-8.
I then import the data from the text file into a database table using the hp vertica here this command:
copy MARKS from '/home/dbadmin/result.txt' delimiter '|' null as '' exceptions '/home/dbadmin/copy-error.log' ABORT ON ERROR;
All data loaded, but Russian characters display some weird characters, apparently this is due to the problems of character encoding the command COPY.
5465465 12 5211 5211 (null) Êîìïëåêñû êîìïüşòåğèçèğîâàííûå ñàìîñòîÿòåëüíîãî ïğåäğåéñîâîãî ıêñïğåññ-îáñëåäîâàíèÿ ôóíêöèîíàëüíîãî ñîñòîÿíèÿ ìàøèíèñòà, âîäèòåëÿ è îï İêÎÇ-01 2004-12-09
5465465 12 5211 5211 (null) Ñïåêòğîìåòğû ıìèññèîííûå Metal Lab 2004-12-09
Question: How can I fix this problem?
Make sure your file encoding us utf-8
[dbadmin#DCG023 ~]$ file rus
rus: UTF-8 Unicode text
[dbadmin#DCG023 ~]$ cat rus
5465465|12|+5211.|+5211.||Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оператора|ЭкОЗ-01|2004-12-09
5465465|12|+5211.|+5211.||Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Lab|2004-12-09
б/н|12|+5207.|+5207.|+5205.|Спектрометры эмиссионные|Metal Test|2004-12-09
Load the data
[dbadmin#DCG023 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
(dbadmin#:5433) [dbadmin] > copy MARKS from '/home/dbadmin/rus' delimiter '|' null as '' ABORT ON ERROR;
Rows Loaded
-------------
4
(1 row)
Query the data
(dbadmin#:5433) [dbadmin] > select * from Marks;
SERIAL_NUM | PERIOD | MARK_NUM | END_MARK_NUM | OLD_MARK_NUM | DEVICE_NAME | DEVICE_MARK | CALIBRATION_DATE
------------+--------+----------+--------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------+-------------+------------------
5465465 | 12 | 5211 | 5211 | | Комплексы компьютеризированные самостоятельного предрейсового экспресс-обследования функционального состояния машиниста, водителя и оп | ЭкОЗ-01 | 2004-12-09
5465465 | 12 | 5211 | 5211 | | Спектрометры эмиссионные | Metal Lab | 2004-12-09
б/н | 12 | 5207 | 5207 | 5205 | Спектрометры эмиссионные | Metal Lab | 2004-12-09
б/н | 12 | 5207 | 5207 | 5205 | Спектрометры эмиссионные | Metal Test | 2004-12-09
(4 rows)
Below is a portion of statistics of one of my tables. I'm not sure how to understand width column. Are those values in bytes? If so, I know fname and lname have higher ascii char counts than 5 and 6 and there are some 1 char long values in mname.
Update 1.
Below is the output of select * from statistics. I'm only showing first 5 columns of the ouput.
+--------+---------+------------------------+---------+-------+
| schema | table | column | type | width |
+========+=========+========================+=========+=======+
| abc | targets | fname | varchar | 5 |
| abc | targets | mname | varchar | 0 |
| abc | targets | lname | varchar | 6 |
The column width shows the "byte-width of the atom array" (defined in gdk.h). This is however not the entire story in the case of string columns, because here the atom array only stores offsets into a string heap.
MonetDB uses variable-width columns, because if there are few distinct string values, 64-bit offsets would be a waste of memory. So in your case, the fname column needs string offsets with 5 bytes, or 40 bits, and lname needs 6 bytes (48 bits). This could change if new values are inserted.
The zero value for mname is interesting, because the width is initialised to 1 for new columns. Which version are you using?
I try to add a constant in package specification with nvarchar2 datatype but after compilation it stores in database something like ???. For example I try to add a constant for armenian word մեկ
x constant nvarchar2(3) default 'մեկ';
Can anyone suggest a solution to this problem or it is impossible to do so?
I have tested you example on two different databases with different NLS_CHARACTERSET configurations.
Configurations (recived by query -
select *
from v$nls_parameters
where parameter in ('NLS_NCHAR_CHARACTERSET','NLS_CHARACTERSET','NLS_LANGUAGE')
):
First:
+----+------------------------+----------+
| id | PARAMETER | VALUE |
+----+------------------------+----------+
| 1 | NLS_LANGUAGE | AMERICAN |
| 2 | NLS_CHARACTERSET | AL32UTF8 |
| 3 | NLS_NCHAR_CHARACTERSET | AL16UTF16|
+----+------------------------+----------+
Second:
+----+------------------------+-------------+
| id | PARAMETER | VALUE |
+----+------------------------+-------------+
| 1 | NLS_LANGUAGE | RUSSIAN |
| 2 | NLS_CHARACTERSET | CL8MSWIN1251|
| 3 | NLS_NCHAR_CHARACTERSET | AL16UTF16 |
+----+------------------------+-------------+
And the result is following, on DB with charset AL32UTF8 variable displays correctly, on charset CL8MSWIN1251 with questions '???'.
I haven't change charsests on databases to validate my suggestion. So I suggest you change NLS_CHARACTERSET to AL32UTF8 it should help.
My package for tests:
create or replace package question27577711 is
x constant nvarchar2(3) default 'մեկ';
function get_constant_x return nvarchar2;
end question27577711;
create or replace package body question27577711 is
function get_constant_x
return nvarchar2
is
begin
return x;
end get_constant_x;
end question27577711;
select question27577711.get_constant_x from dual