I'm trying to import CSV via Oracle SQL*LOADER, but I have a problem because some data has line break within the double-quotes. For example
"text.csv"
John,123,New York
Tom,456,Paris
Park,789,"Europe
London, City"
I think that SQL*LOADER uses the line break character to separate records.
This data generates an error "second enclosure string not present"
I use this control file.
(control.txt)
OPTIONS(LOAD=-1, ERRORS=-1)
LOAD DATA
INFILE 'test.csv'
TRUNCATE
INTO TABLE TMP
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(
field1,
field2,
field3
)
and I execute a command like this
sqlldr 'scott/tiger' control='control.txt' log='result.txt'
I want to import 3 records, not 4 records...
can SQL*LOADER ignore line breaks within double-quotes?
Seems you need to get rid of carriage return and/or line feed characters.
So replace
field3
with
field3 CHAR(4000) "REPLACE(TRIM(:field3),CHR(13)||CHR(10))"'
or
field3 CHAR(4000) "REPLACE(REPLACE(TRIM(:field3),CHR(13)),CHR(10))"'
where using a TRIM() would be useful in order to remove trailing and leading whitespaces.
In case you would like to preserve the embedded carriage returns, construct the control file like this using the "str" (stream) clause on the infile option line to set the end of record character. It tells sqlldr that hex 0D (carriage return, or ^M) is the record separator (this way it will ignore the linefeeds inside the double-quotes):
INFILE 'test.csv' "str x'0D'"
Related
Table contains 3 columns
ID -integer
Name-varchar
Description-varchar
A file with .FILE extension has data with delimiter as |
Eg: 12|Ramu|"Ramu" is an architect
Command I am using to load data to db2:
db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Description) nonrecoverable"
Data is loaded as follows:
12 Ramu Ramu
but I want it as:
12 Ramu "Ramu" is an architect
Take a look at how the format of delimited ASCII files is defined. The double quote (") is an optional delimited for character data. You would need to escape it. I have not tested it, but I would assume that you double the quote as you would do in SQL:
|12|Ramu|"""Ramu"" is an architect"
Delimited files (CSV) are defined in RFC 4180. You need to either use quotes for the entire field or none at all. Only in fields beginning and ending with a quote, other quotes can be used. They need to be escaped as shown.
Use the nochardel modifier.
If you use '|' as a column delimiter, you must use 0x7C and not 0x7x:
MODIFIED BY coldel0x7C keepblanks nochardel
I'm trying to run copy command that populate the db based on concatenation of the csv, but one column needs to be hardcoded.
Table columns names are:
col1,col2,col3
File content is (just the numbers, names are the db column names):
1234,5678,5436
What i need is a way to insert data say like this: based on my example:
I want to put in the db:
col1 col2 col3
1234 5678 10
Notice: 10 as hardcoded, ignoring the real value of col3 at db
Should I use FILLER? if so what is the command?
my starting point is:
COPY SAMPLE.MYTABLE (col1,col2,col3)
FROM LOCAL 'c:\\1\\test.CSV'
UNCOMPRESSED DELIMITER ',' NULL AS 'NULL' ESCAPE AS '\' RECORD TERMINATOR ' ' ENCLOSED BY '"' DIRECT STREAM NAME 'Identifier_0' EXCEPTIONS 'c:\\1\\test.exceptions'
REJECTED DATA 'c:\\1\\test.rejections' ABORT ON ERROR NO COMMIT;
Can you help how to load those columns (basically col3)?
Thanks
You need to just use a dummy filler to parse (but ignore) the 3rd value in your csv. Then you need to use AS to do an expression to assign the third table column to a literal.
I've added it to your COPY below. However, I'm not sure I understand your RECORD TERMINATOR setting. I'd look at that a little closer. Perhaps you had a copy/paste issue or something.
COPY SAMPLE.MYTABLE (col1, col2, dummy FILLER VARCHAR, col3 AS 10)
FROM LOCAL 'c:\1\test.CSV' UNCOMPRESSED DELIMITER ','
NULL AS 'NULL' ESCAPE AS '\' RECORD TERMINATOR ' '
ENCLOSED BY '"' DIRECT STREAM NAME 'Identifier_0'
EXCEPTIONS 'c:\1\test.exceptions' REJECTED DATA 'c:\1\test.rejections'
ABORT ON ERROR NO COMMIT;
How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?
If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:
Here is a sample dataset that I have loaded:
R1Col1 R1Col2 "R1Col3 MoreData" R1Col4
R2Col2 R2Col2 "R2Col3 MoreData" R2Col4
Register the jar from hive console
hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
Create a table with the specified serde and custom properties
hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'
> with serdeproperties(
> "separatorChar" = "\t",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile;
Load your dataset into the table:
hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
Do a select * from test_table to check the results
You could download the csv-serde from here.
It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.
To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.
Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.
We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.
Row format delimited fields terminated by \t escaped by \\ lines terminated by \n
I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int. Please help
Also i dont want to use regexserde command.
how can i use sed command from unix to clean the data and get my output.
i tried to learn about sed command and found the replace option. So i can remove the " double quotations. But how can i handle the extra ; semi colon which comes in the middle of the data
Please help
I think you can preprocess with sed and then use the MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES
sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*);/\1XXXXX/g; t a; s/;/ /g; s/XXXXX/;/g' file
This sed matches the quote pairs to avoid processing what is between quotes putting a placeholder for the semicolons outside of quoted text. Afterward it removes the ;'s from the book title text and replaces them w/a space and puts back the semicolons that are outside quotes.
See here for more how to load data using Hive including an example of MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES:
https://svn.apache.org/repos/asf/hive/trunk/serde/README.txt
create external table books (isbn int,title string,year int,publisher string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH
SERDEPROPERTIES ('separatorChar' = '\;' , 'quoteChar' = '\"' ) location 'S3
path/HDFS path for the file';
The question: How (where) can I specify the line terminator string of DAT file in case, that I pass the name of the DAT file on the command line using "data" parameter and not in CTL file? I am using Oracle 11.2 SQL Loader.
The goal: I need to load fast huge amount of data from CSV file into Oracle 11.2 (or above). The field (column) separator is hexa 1F (US character = unit separator), the string delimiter is the double quote, the record (row) separator is hexa 1E (RS character = record separator).
The problem: Using "stream record format" with "str terminator_string" of SQL Loader is fine, but just only in case, that I can specify the name of the DAT file using "infile" directive inside CTL. But the name of my DAT file is varying, so I pass the name of the DAT file on the command line as the "data parameter". And in this case I do not know, how (where) can I specify the line terminator string of DAT file in case.
Remark: The problem is the same as in the unsolved problem in this question.
Admittedly, more a workaround than a proper solution, but it should work if you have a fixed name in the controlfile, and then copy/rename/sym link each file to the fixed name and process. Or, have a control which has a infile entry "THE_DAT_FILE", and then run "sed" to change this to the required file name and then invoke sqlldr using this sed'd file.
So, something like:
Get the data file F1
Copy/SymLink F1 to the_file.dat (sym link asuming Unix/Linux/Cygwin)Admi
RUn sqlldr with STR which refers to INFILE as "the_file.dat"
When complete, delete/unlink the_file.dat
Repeat 1-4 for next file(s) F1, F2, ... Fn
E.g.
for DAT_FILE in *.dat
do
ln -s $DAT_FILE /tmp/the_file.dat
sqlldr .....
rm /tmp/the_file.dat
done
Or
for DAT_FILE in *.dat
do
cat the_ctl_file | \
sed "s/THE_DAT_FILE/£DAT_FILE/" > /tmp/ctl_$DAT_FILE.cf
sqlldr ..... controlfile=tmp/ctl_$DAT_FILE.cf
done
I just ran into a similar situation, where I need to use the same control file for a set of files, all with the windows EOL character for EOR with embedded newlines in text fields.
Rather than code a specific control file for each with the name on the INFILE directive, I coded the name as /dev/null with the STR as:
INFILE '/dev/null' "STR '\r\n'"
And then on the sqlldr command line I use the DATA option to specify the actual flat file.