I'm trying to load an UTF8 CSV file with Chinese characters on it, only to discover that in my table the correct encoding is lost. My table has UTF8 as configured charset.
I'm using a bash script on RHEL 5 with MySQL command line client and my statement is
LOAD DATA LOCAL INFILE 'file' INTO TABLE 'table'
CHARACTER SET "UTF8"
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
Is there something I can do to overcome this?
Recently I managed to do this.
I loaded a text file which has lots of Chinese characters into my MySQL:
My text file is encoded with utf8,
my table is encoded with utf8
and I used your statement.
It worked.
I think you should try to convert you file into utf8 first and make sure your table is encoded with utf8.
BTW, adding charset=utf8, like CREATE TABLE test (column_a varchar(100)) charset=utf8, will make the table encoded with utf8.
Related
I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.
I'm using DBIx::Class to fetch data from Oracle (11.2). when the data fetched, for example "Alfred Kärcher" its returns the value as "Alfred Karcher". I tried to add the $ENV NLS_LANG and NLS_NCHAR but still no change.
I also used the utf8 module to verify that the data is utf8 encoded.
This looks like the Oracle client library converting the data.
Make sure the database encoding is set to AL32UTF8 and the environment variable NLS_LANG to AMERICAN_AMERICA.AL32UTF8.
It might also be possible by setting the ora_(n)charset parameter instead.
The two links from DavidEG contain all the info that's needed to make it work.
You don't need use utf8; in your script but make sure you set STDOUT to UTF-8 encoding: use encoding 'utf8';
here the problem is with the column data type that you specified for the storing
you column database specified as VARCHAR2(10), then for oracle, actually stores the 10 bytes, for English 10 bytes means 10 characters, but in case the data you insert into the column contains some special characters like umlaut, it require 2 bytes. then you end up RA-12899: VALUE too large FOR column.
so in case the data that you inserting into the column which is provided the user and from different countries then use VARCHAR2(10 char)
In bytes: VARCHAR2(10 byte). This will support up to 10 bytes of data, which could be as few as two characters in a multi-byte character sets.
In characters: VARCHAR2(10 char). This will support to up 10 characters of data, which could be as much as 40 bytes of information.
I am trying to load a csv file with currency symbols, using SQL*Loader. The symbol '£' gets replaced by '#' and symbol '€' gets replaced by NULL.
Not sure if I should tweak some settings in my control file?
Here are the values from NLS_DATABASE_PARAMETERS:
NLS_NCHAR_CHARACTERSET = AL16UTF16
NLS_CHARACTERSET = AL32UTF8
Any pointers would be of great help.
Extract of csv file -
id,currency
1234,£
5678,€
Datatype of the column for currency is NVARCHAR2(10).
Here's the ctl file -
OPTIONS(skip=1)
LOAD DATA
TRUNCATE
INTO TABLE schema_name.table_name
FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
filler1 filler,
filler2 filler,
Id INTEGER EXTERNAL,
Currency CHAR "TRIM(:Currency)"
)
I guess this is a character set problem.
Did you set the character set of the sqlloader file to UTF8?
CHARACTERSET UTF8
Also, is the file itself UTF8 encoded?
Thanks Justin and Patrick for the pointer!
The file was not UTF-8 encoded. I converted the file to UTF-8 encoding and it worked!
For those who don't know how to convert the file's encoding using Notepad++ (like me, I just learned how to do it) :
Create a new file in Notepad++ -> Go to Encoding -> Encode in UTF-8 -> Copy-paste the contents -> save the file as .csv
I have oracle database that I can use to store/display chinese characters.
But when I am using sqlplus to query the database (e.g. running script from sqlplus) the chinese character can't be displayed (all question marks). My purpose is eventually to export the data to CSV files (that contains chinese characters). Any comment on how I can do it?
I am using oracle 10g.The character set for DB is as below: NLS_NCHAR_CHARACTERSET AL16UTF16 NLS_CHARACTERSET AL32UTF8.
I have ®(the circled "R") symbol coming in a .txt file in one of the fields and when the same file is loaded in a external table, the symbol is converted in a '?'.
Please suggest.
Where do you see the ® being converted to a question mark? It may be a problem in the encoding of what you're using to view the table rather than the table itself. I'd also check what you're using to load the database. UTF8 should support the character.