Loading Unicode Characters with Oracle SQL Loader (sqlldr) results in question marks - oracle

I'm trying to load localized strings from a unicode (UTF8-encoded) csv using SQL Loader into an oracle database. I've tried all sort of combinations but nothing seems to give me the result I'm looking for which is to have special greek characters like (Δ) not get converted to Δ or ¿.
My table definition looks like this:
CREATE TABLE "GLOBALIZATIONRESOURCE"
(
"RESOURCETYPE" VARCHAR2(255 CHAR) NOT NULL ENABLE,
"CULTURE" VARCHAR2(20 CHAR) NOT NULL ENABLE,
"KEY" VARCHAR2(128 CHAR) NOT NULL ENABLE,
"VALUE" VARCHAR2(2048 CHAR),
"DESCRIPTION" VARCHAR2(512 CHAR),
CONSTRAINT "PK_GLOBALIZATIONRESOURCE" PRIMARY KEY ("RESOURCETYPE","CULTURE","KEY") USING INDEX TABLESPACE REPSPACE_IX ENABLE
)
TABLESPACE REPSPACE;
I have tried the following configurations in my control file (and actually every permutation I could think of)
load data
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
"RESOURCETYPE" CHAR(255),
"CULTURE" CHAR(20),
"KEY" CHAR(128),
"VALUE" CHAR(2048),
"DESCRIPTION" CHAR(512)
)
load data
CHARACTERSET UTF8
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
"RESOURCETYPE" CHAR(255),
"CULTURE" CHAR(20),
"KEY" CHAR(128),
"VALUE" CHAR(2048),
"DESCRIPTION" CHAR(512)
)
load data
CHARACTERSET UTF16
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY X'002c' OPTIONALLY ENCLOSED BY X'0022'
TRAILING NULLCOLS
(
"RESOURCETYPE" CHAR(255),
"CULTURE" CHAR(20),
"KEY" CHAR(128),
"VALUE" CHAR(2048),
"DESCRIPTION" CHAR(512)
)
With the first two options, the unicode characters don't get encoded and just show up as upside down question marks.
If I choose last option, UTF16, then I get the following error even though all my data in my fields are much shorter than the length specified.
Field in data file exceeds maximum length
It seems as though every possible combination of ctl file configurations (even setting the byte order to little and big) doesn't work correctly. Can someone please give an example of a configuration (table structure and CTL file) that correctly loads unicode data from a csv? Any help would be greatly appreciated.
Note: I've already been to http://docs.oracle.com/cd/B19306_01/server.102/b14215/ldr_concepts.htm, http://docs.oracle.com/cd/B10501_01/server.920/a96652/ch10.htm and http://docs.oracle.com/cd/B10501_01/server.920/a96652/ch10.htm.

I had same issue and resolved by below steps -
Open data file into Notepad++ , Go to "Encoding" dropdown and select UTF8 encoding and save file.
use CHARACTERSET UTF8 into CTL file and then upload data.

You have two problem;
Character set.
Answer: You can solve this problem by finding your text character set (most of time notepad++ can do this.). After finding character set, you have to find sqlldr correspond of character set name. So, you can find this info from link https://docs.oracle.com/cd/B10501_01/server.920/a96529/appa.htm#975313
After all of these, you should solve character set problem.
In contrast to your actual data length, sqlldr says that, Field in data file exceeds maximum length.
Answer: You can solve this problem by adding CHAR(4000) (or what the actual length is) to problematic column. In my case, the problematic column is "E" column. Example is below. In my case I solved my problem in this way, hope helps.
LOAD DATA
CHARACTERSET UTF8
-- This line is comment
-- Turkish charset (for ÜĞİŞ etc.)
-- CHARACTERSET WE8ISO8859P9
-- Character list is here.
-- https://docs.oracle.com/cd/B10501_01/server.920/a96529/appa.htm#975313
INFILE 'data.txt' "STR '~|~\n'"
TRUNCATE
INTO TABLE SILTAB
FIELDS TERMINATED BY '#'
TRAILING NULLCOLS
(
a,
b,
c,
d,
e CHAR(4000)
)

You must ensure that the following charactersets are the same:
db characterset
dump file characterset
the client from which you are doing the import (NLS_LANG)
If the client-side characterset is different, oracle will attempt to perform character conversions to the native db characterset and this might not always provide the desired result.

Don't use MS Office to save the spreadsheet into unicode .csv.
Instead, use OpenOffice to save into unicode-UTF8 .csv file.
Then in the loader control file, add "CHARACTERSET UTF8"
run Oracle SQL*Loader, this gives me correct results

There is a range of character set encoding that you can use in control file while loading data from sql loader.
For greek characters I believe Western European char set should do the trick.
LOAD DATA
CHARACTERSET WE8ISO8859P1
or in case of MS word input files with smart characters try in control file
LOAD DATA
CHARACTERSET WE8MSWIN1252

Related

Multiple Line - Multiple CLOB per line SQL Loader

I'm finding myself in a predicament.
I am parsing a logfile of multiple entries of SOAP calls. Each soap call can contain payloads of 4000+ characters preventing me to use varchar2. So I must use a CLOB)
I have to load those payloads onto a oracle DB (12g).
I succesfully split the log into single fields and got the payloads and the header of the calls in two single files.
How do I create a CTL file that loads from an infile (that contains data for other fields) and reads CLOB files in pairs?
Ideally:
LOAD DATA
INFILE 'load.ldr'
BADFILE 'load.bad'
APPEND
INTO TABLE ops_payload_tracker
FIELDS TERMINATED BY '§'
TRAILING NULLCOLS
( id,
direction,
http_info,
payload CLOB(),
header CLOB(),
host_name)
but then I don't know, and can't find anywhere on the internet, how to do it for more than one record and how to reference that one record with two CLOBS.
Worth mentioning that it is JBOSS logs, on bash environment.
check size of varchar type in 12c. I thought it was increased to 32K
https://oracle-base.com/articles/12c/extended-data-types-12cR1
see that sample SQL Loader, CLOB, delimited fields
"I already create two separate files for payload and headers. How
should I specify that the two files are there for the same ID?"
see example here:
https://oracle-base.com/articles/10g/load-lob-data-using-sql-loader
roughly:
sample table
1,one,01-JAN-2006,1_clob_header.txt,2_clob_details.txt
2,two,02-JAN-2006,2_clob_heder.txt,2_clob_details.txt
ctl
LOAD DATA
INFILE 'lob_test_data.txt'
INTO TABLE lob_tab
FIELDS TERMINATED BY ','
(number_content CHAR(10),
varchar2_content CHAR(100),
date_content DATE "DD-MON-YYYY" ":date_content",
clob_filename FILLER CHAR(100),
clob_content LOBFILE(clob_filename) TERMINATED BY EOF,
blob_filename FILLER CHAR(100),
blob_content LOBFILE(blob_filename) TERMINATED BY EOF)

perl DBIx::Class converting values with umlaut

I'm using DBIx::Class to fetch data from Oracle (11.2). when the data fetched, for example "Alfred Kärcher" its returns the value as "Alfred Karcher". I tried to add the $ENV NLS_LANG and NLS_NCHAR but still no change.
I also used the utf8 module to verify that the data is utf8 encoded.
This looks like the Oracle client library converting the data.
Make sure the database encoding is set to AL32UTF8 and the environment variable NLS_LANG to AMERICAN_AMERICA.AL32UTF8.
It might also be possible by setting the ora_(n)charset parameter instead.
The two links from DavidEG contain all the info that's needed to make it work.
You don't need use utf8; in your script but make sure you set STDOUT to UTF-8 encoding: use encoding 'utf8';
here the problem is with the column data type that you specified for the storing
you column database specified as VARCHAR2(10), then for oracle, actually stores the 10 bytes, for English 10 bytes means 10 characters, but in case the data you insert into the column contains some special characters like umlaut, it require 2 bytes. then you end up RA-12899: VALUE too large FOR column.
so in case the data that you inserting into the column which is provided the user and from different countries then use VARCHAR2(10 char)
In bytes: VARCHAR2(10 byte). This will support up to 10 bytes of data, which could be as few as two characters in a multi-byte character sets.
In characters: VARCHAR2(10 char). This will support to up 10 characters of data, which could be as much as 40 bytes of information.

Loading currency symbols using SQL*Loader

I am trying to load a csv file with currency symbols, using SQL*Loader. The symbol '£' gets replaced by '#' and symbol '€' gets replaced by NULL.
Not sure if I should tweak some settings in my control file?
Here are the values from NLS_DATABASE_PARAMETERS:
NLS_NCHAR_CHARACTERSET = AL16UTF16
NLS_CHARACTERSET = AL32UTF8
Any pointers would be of great help.
Extract of csv file -
id,currency
1234,£
5678,€
Datatype of the column for currency is NVARCHAR2(10).
Here's the ctl file -
OPTIONS(skip=1)
LOAD DATA
TRUNCATE
INTO TABLE schema_name.table_name
FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
filler1 filler,
filler2 filler,
Id INTEGER EXTERNAL,
Currency CHAR "TRIM(:Currency)"
)
I guess this is a character set problem.
Did you set the character set of the sqlloader file to UTF8?
CHARACTERSET UTF8
Also, is the file itself UTF8 encoded?
Thanks Justin and Patrick for the pointer!
The file was not UTF-8 encoded. I converted the file to UTF-8 encoding and it worked!
For those who don't know how to convert the file's encoding using Notepad++ (like me, I just learned how to do it) :
Create a new file in Notepad++ -> Go to Encoding -> Encode in UTF-8 -> Copy-paste the contents -> save the file as .csv

Defining a Character Set for a column For oracle database tables

I am running following query in SQL*Plus
CREATE TABLE tbl_audit_trail (
id NUMBER(11) NOT NULL,
old_value varchar2(255) NOT NULL,
new_value varchar2(255) NOT NULL,
action varchar2(20) CHARACTER SET latin1 NOT NULL,
model varchar2(255) CHARACTER SET latin1 NOT NULL,
field varchar2(64) CHARACTER SET latin1 NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
user_id NUMBER(11) NOT NULL,
model_id varchar2(65) CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (id),
KEY idx_action (action)
);
I am getting following error:
action varchar2(20) CHARACTER SET latin1 NOT NULL,
*
ERROR at line 5:
ORA-00907: missing right parenthesis
Can you suggest what am I missing?
The simple answer is that, unlike MySQL, character sets can't be defined at column (or table) level. Latin1 is not a valid Oracle character set either.
Character sets are consistent across the database and will have been specified when you created the database. You can find your character by querying NLS_DATABASE_PARAMETERS,
select value
from nls_database_parameters
where parameter = 'NLS_CHARACTERSET'
The full list of possible character sets is available for 11g r2 and for 9i or you can query V$NLS_VALID_VALUES.
It is possible to use the ALTER SESSION statement to set the NLS_LANGUAGE or the NLS_TERRITORY, but unfortunately you can't do this for the character set. I believe this is because altering the language changes how Oracle would display the stored data whereas changing the character set would change how Oracle stores the data.
When displaying the data, you can of course specify the required character set in whichever client you're using.
Character set migration is not a trivial task and should not be done lightly.
On a slight side note why are you trying to use Latin 1? It would be more normal to set up a new database in something like UTF-8 (otherwise known as AL32UTF8 - don't use UTF8) or UTF-16 so that you can store multi-byte data effectively. Even if you don't need it now it's wise to attempt - no guarantees in life - to future proof your database with no need to migrate in the future.
If you're looking to specify differing character sets for different columns in a database then the better option would be to determine if this requirement is really necessary and to try to remove it. If it is definitely necessary1 then your best bet might be to use a character set that is a superset of all potential character sets. Then, have some sort of check constraint that limits the column to specific hex values. I would not recommend doing this at all, the potential for mistakes to creep in is massive and it's extremely complex. Furthermore, different character sets render different hex values differently. This, in turn, means that you need to enforce that a column is rendered in a specific character, which is impossible as it falls outside the scope of the database.
1. I'd be interested to know the situation
According to provided DDL statement it's some need to use 2 character sets. The implementation of this functionality in Oracle is different from MySQL and done with n* data types like nvarchar2, nchar... Latin1 is similar to some Western European character set that might be default. So you able to define for example "Latin1" (WE**) and some Unicode (UTF8..).
The NVARCHAR2 datatype was introduced by Oracle for databases that want to use Unicode for some columns while keeping another character set for the rest of the database (which uses VARCHAR2). The NVARCHAR2 is a Unicode-only datatype.
The reason you want to use NVARCHAR2 might be that your DB uses a non-Unicode character and you still want to be able to store Unicode data for some columns.
Columns in your example would be able to store the same data, however the byte storage will be different.

Handling UTF-8 characters in Oracle external tables

I have an external table that reads from a fixed length file. The file is expected to contain special characters. In my case the word containing special character is "Göteborg". Because "ö" is a special character, looks like Oracle is considering it as 2 bytes. That causes the trouble. The subsequent fields in the files get shifted by 1 byte thereby messing up the data. Has anyone faced the issue before. So far we have tried the following solution:
Changed the value of NLS_LANG to AMERICAN_AMERICA.WE8ISO8859P1
Tried Setting the Database Character set to UTF-8
Tried changing the NLS_LENGTH_SYMMANTIC to CHAR instead of BYTE using ALTER SYSTEM
Tried changing the External table characterset to: AL32UTF8
Tried changing the External table characterset to: UTF-8
Nothing works.
Other details include:
File is UTF-8 encoded
Operating System : RHEL
Database: Oracle 11g
Any thing else that I might be missing? Any help will be appreciated. Thanks!
The nls_length_semantics only pertains to the creation of new tables.
Below is what I did to fix this very problem.
records delimited by newline
CHARACTERSET AL32UTF8
STRING SIZES ARE IN CHARACTERS
i.e.
ALTER SESSION SET nls_length_semantics = CHAR
/
CREATE TABLE TDW_OWNER.SDP_TST_EXT
(
COST_CENTER_CODE VARCHAR2(10) NULL,
COST_CENTER_DESC VARCHAR2(40) NULL,
SOURCE_CLIENT VARCHAR2(3) NULL,
NAME1 VARCHAR2(35) NULL
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY DBA_DATA_DIR
ACCESS PARAMETERS
( records delimited by newline
CHARACTERSET AL32UTF8
STRING SIZES ARE IN CHARACTERS
logfile DBA_DATA_DIR:'sdp_tst_ext_%p.log'
badfile DBA_DATA_DIR:'sdp_tst_ext_%p.bad'
discardfile DBA_DATA_DIR:'sdp_tst_ext_%p.dsc'
fields
notrim
(
COST_CENTER_CODE CHAR(10)
,COST_CENTER_DESC CHAR(40)
,SOURCE_CLIENT CHAR(3)
,NAME1 CHAR(35)
)
)
LOCATION (DBA_DATA_DIR:'sdp_tst.dat')
)
REJECT LIMIT UNLIMITED
NOPARALLEL
NOROWDEPENDENCIES
/

Resources