How do I skip the first character?
Here is the CSV file that I want to load
H
B"01","Mosco"
B"02","Delhi"
T
Here is the control file
LOAD DATA
INFILE 'capital.csv'
APPEND
INTO TABLE CAPITALS
WHEN (01)='B'
FIELDS TERMINATED BY ","
OPTIONALLY ENCLOSED BY '"'
(
ID,
CAPITAL
)```
WHEN i RUN THIS THE 'B' COMES INTO PICTURE.
The table should look like
[![Table view][1]][1]
How do I skip the 'B'?
[1]: https://i.stack.imgur.com/2U3Vo.png
Disregard the first character. Can you have the source put a comma after the record type indicator?
If so, do this to ignore it:
(
RECORD_IND FILLER,
ID,
CAPITAL
)
If not, this should take care of it in your situation:
ID "SUBSTR(:ID, 2)",
Related
My searches for complex sqlldr parsing of key-value pairs was thin. So posting an example that worked for my needs that you may be able to adapt.
The issue: millions of lines of Tomcat access log e.g.
time='[01/Jan/2001:00:00:03 +0000]' srcip='192.168.0.1' localip='10.0.0.1' referer='-' url='/limsM/SamplesGet-SampleMaster?samplefilters=%5B%22parent_sample%20%3D%208504571%22%2C%22status%20%3D%20'D'%22%5D&depthfilters=%5B%22scale_id%20%3D%2011311%22%5D' servername='yo.yo.dyne.org' rspms='218' rspbytes='2198'
are to be parsed into this Oracle table for convenience of analysis of selected parameters.
create table transfer.loganal (
time date
, timestr varchar2(30)
, srcip varchar2(75)
, localip varchar2(15)
, referer clob
, uri clob
, servername varchar2(50)
, rspms number
, rspbytes number
, logsource varchar2(50)
);
What does a sqlldr control script look like that will accomplish this?
This is my first working solution. Refinements, suggestions, improvements always welcome.
Given Tomcat access log in a directory, e.g.
yoyotomcat/
combined.20010101
combined.20010102
...
This file saved as combined.ctl as a sibling of yoyotomcat
-- Load an Apache common log format
-- essentially key-value pairs
-- example line of source data
-- time='[01/Jan/2001:00:00:03 +0000]' srcip='192.168.0.1' localip='10.0.0.1' referer='-' url='/limsM/SamplesGet-SampleMaster?samplefilters=%5B%22parent_sample%20%3D%208504571%22%2C%22status%20%3D%20'D'%22%5D&depthfilters=%5B%22scale_id%20%3D%2011311%22%5D' servername='yo.yo.dyne.org' rspms='218' rspbytes='2198'
--
LOAD DATA
INFILE 'yoyodyne/combined.2001*' "STR '\n'"
TRUNCATE INTO TABLE transfer.loganal
TRAILING NULLCOLS
(
time enclosed by "time='[" and "+0000]' " "to_date(:time, 'dd/Mon/yyyy:hh24:mi:ss')"
, srcip enclosed by "srcip='" and "' "
, localip enclosed by "localip='" and "' "
, referer char(10000) enclosed by "referer='" and "' "
, uri char(10000) enclosed by "url='" and "' "
, servername enclosed by "servername='" and "' "
, rspms enclosed by "rspms='" and "' " "decode(:rspms, '-', null, to_number(:rspms))"
, rspbytes enclosed by "rspbytes='" and "'" "decode(:rspbytes, '-', null, to_number(:rspbytes))"
, logsource "'munchausen'"
)
Load the hypothetical example content by running this from a command prompt
sqlldr userid=buckaroo#banzai direct=true control=combined.ctl
Your mileage may vary. I'm on Oracle 12. There may be features used here that are relatively new. Not sure.
Illumination
This variant of the "enclosed by" functionality works well for key-value pairs. Its not regular expression, but is performant.
The ability to treat the column name as a bind variable and apply available SQL functions to it enables much additional flexibility.
Have some log that has really long GETs, thus the specification of unreasonably long string values. 255 as a default wasn't enough.
Rspms and rspbytes sometimes had '-'. Used SQL to work around frequent "not a number" errors.
The control file as written presumes all fields are present. Not a good assumption over time. Looking for config to allow null column when a enclosure is not matched.
Cheers.
I have a similiar question to here:
Hive table source delimited by multiple spaces
My data looks like this:
AL, 01, 2016010700, , BEST, 0, 266N, 753W
AL, 01, 2016010706, , BEST, 0, 276N, 747W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
AL, 01, 2016010712, , BEST, 0, 287N, 738W
That means my column delimiter is "a comma plus a variable number of spaces".
I tried to simply modify field.delim by adding this comma to the regex, but it doesn't work.
The result is, that all data gets put into the first column (basin) and all other columns are NULL.
CREATE EXTERNAL TABLE IF NOT EXISTS default.myTable1
(
basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"=",\\s+")
LOCATION '/data';
I am running HDP 2.5 (Hive 1.2.1).
Thanks for any help and suggestions.
We have two approach to solve your problem.
create table 'rawTbl' using below option
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
and use trim() to remove space
Insert into baseTbl select trim(basin), trim(cy),...., from rawTbl
OR you can use regEx
I have updated answer with regex which separate text input file composed of requested fields. Regex contains 7 regex groups capturing the requested field on each line.
CREATE EXTERNAL TABlE tableex(basin string
,cy string
,yyyymmddhh int
,technum_min string
,tech string
,tau string
,lat_n_s string
,lon_e_w string )
ROW FORMAT 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^([A-Za-z]{2}),\s+(\d{2}),\s(\d{10}),\s+,\s([A-Z]{4}),\s+(\d{1}),\s+(\d{3}[A-Z]{1}),\s+(\d+[A-Z]{1})'
)
LOCATION '/data';
how about this
(\S+),\s+(\S+),\s(\S+),\s+,\s(\S+)\s+(\S+),\s+(\S+),\s+(\S*)
I use Oracle 11g.
My data file looks like below:
1|"\a\ab\"|"do not "clean" needles"|"#"
2|"\b\bg\"|"wall "69" side to end"|"#"
My control file is:
load data
infile 'short.txt'
CONTINUEIF LAST <> '"'
into table "PORTAL"."US_FULL"
fields terminated by "|" OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
u_hlevel,
u_fullname NULLIF u_fullname=BLANKS,
u_name char(2000) NULLIF c_name=BLANKS ,
u_no NULLIF u_no=BLANKS
)
While loading data through sqlldr, a .bad file is created and .log file contains error message stating "No terminator found after terminated and enclosed field"
Double quotes starting and ending are not in my data, however I would need double quotes withing the data like in above example surrounding clean and 69. Ex: My data file after loading should look like:
1, \a\ab\, do not "clean" needles, #
2, \b\bg\ , wall "69" side to end , #
How to accomplish this?
Asking your provider to correct the data file may not be an option, but I ultimately found a solution that requires you to update your control file slightly to specify your "enclosed by" character for each field instead of for all fields.
For my case, I had an issue where if [first_name] field came in with double-quotes wrapping a nickname it would not load. (EG: Jonathon "Jon"). In the data file the name was shown as "Jonathon "Jon"" . So the "enclosed by" was throwing an error because there were double quotes around the value and double quotes around part of the value ("Jon"). So instead of specifying that the value should be enclosed by double quotes, I omitted that and just manually removed the quotes from the string.
Load Data
APPEND
INTO TABLE MyDataTable
fields terminated by "," ---- Noticed i omitted the "enclosed by"
TRAILING NULLCOLS
(
column1 enclosed by '"', --- Specified "enclosed by" here for all cols
column2 enclosed by '"',
FIRST_NAME "replace(substr(:FIRST_NAME,2, length(:FIRST_NAME)-2), chr(34) || chr(34), chr(34))", -- Omitted "enclosed by". substr removes doublequotes, replace fixes double quotes showing up twice. chr(34) is charcode for doublequote
column4 enclosed by '"',
column5 enclosed by '"'
)
I'm afraid since the fields are surrounded by double-quotes the double-quotes you want to preserve need to be escaped by adding another double-quote in front like this:
1|"\a\ab\"|"do not ""clean"" needles"|"#"
Alternately if you can get the data without the fields being surrounded by double-quotes, this would work too:
1|\a\ab\|do not "clean" needles|#
If you can't get the data provider to format the data as needed (i.e. search for double-quotes and replace with 2 double-quotes before extracting to the file), you will have to pre-process the file to set up double quotes one of these ways so the data will load as you expect.
I need to load from CSV file into an Oracle Table.
The problem i m facing is that, the DESCRIPTION field is having Multiple Lines in itself.
Solution i am using for it as ENCLOSURE STRING " (Double Quotes)
Using KSH to call for sqlldr.
I am getting following two problems:
The row having Description with multiple lines, is not getting loaded as it terminates there itself and values of further fields/columns are not visible for loader. ERROR: second enclosure string not present (Obviously " is not found.)
The second line(and lines beyond that) of DESCRIPTION field is being treated as NEW Row in itself and is thus getting populated. It is GARBAGE DATA.
CONTROL File:
OPTIONS(SKIP=1)
LOAD DATA
BADFILE '/home/fimsctl/datafiles/inbound/core_po/logs/core_po_data.bad'
DISCARDFILE '/home/fimsctl/datafiles/inbound/core_po/logs/core_po_data.dsc'
APPEND INTO TABLE FIMS_OWNER.FINANCE_PO_INBOUND_T
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
PO_NUM,
CREATED_DATE "to_Date(:CREATED_DATE,'mm/dd/yyyy hh24:mi:ss')",
PO_TYPE,
PO_STATUS,
NOTREQ1 FILLER,
NOTREQ2 FILLER,
PO_VALUE,
LINE_ITEM_NUMBER,
QUANTITY,
LINE_ITEM_DESCRIPTION,
RATE_VALUE,
CURRENCY_CODE,
UOM_ID,
PO_REQUESTER_WWID,
QUANTITY_ORDERED,
QUANTITY_RECEIVED,
QUANTITY_BILLED terminated by whitespace
)
CSV File Data:
COL1,8/4/2014 5:52,COL3,COL4,COL5,,,COL8,COL9,"Description Data",COL11,COL12,COL13,COL14,COL15,COL16,COL17
COL1,8/4/2014 8:07,COL3,COL4,COL5,,,COL8,COL9,,"GE MAKE 1X250 WATT HPSV NON INTEGRAL IP-65 **[NEWLINE HERE]**
DIE-CAST ALUMINIUM FIXTURE COMPLETE SET **[NEWLINE HERE]**
WITH SEPRATE CONTROL GEAR BOX WITH CHOKE, **[NEWLINE HERE]**
IGNITOR, CAPACITOR & LAMP-T",COL11,COL12,COL13,COL14,COL15,COL16,COL17
COL1,8/4/2014 8:13,COL3,COL4,COL5,,,COL8,COL9,"Description Data",COL11,COL12,COL13,COL14,COL15,COL16,COL17
I am loading a .csv file data into oracle table through sql loader. One of the fields has a new line character (CRLF) in its data and so, am getting the below error:
second enclosure string not present
This is my control file
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(:Comments, 0, 1000)"
)
The field COMMENTS has a new line character in one of its records. Can any one suggest a solution for this.
Thanks
If your last field is always present (though trailing nullcols suggests it isn't) and you have some control over the formatting, you can use the CONTINUEIF directive to treat the second line as part of the same logical record.
If the comments field is always present and enclosed in double-quotes then you can do:
...
truncate
continueif last != x'22'
into table ...
Which would handle data records like:
S;Y;"Test 1"
F;N;"Test 2"
P;Y;"Test with
new line"
P;N;""
Or if you always have a delimiter after the comments field, whether it is populated or not:
...
truncate
continueif last != ';'
into table ...
Which would handle:
S;Y;Test 1;
F;N;"Test 2";
P;Y;Test with
new line;
P;N;;
Both ways will load the data as:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test withnew line
P N
But this loses the new line from the data. To keep that you need the terminating field delimiter to be present, and instead of CONTINUEIF you can change the record separator using the stream record format:
...
infile 'C:\Users\lab.csv' "str ';\n'"
truncate
into table ...
The "str ';\n'" defines the terminator as the combination of the field terminator and a new line character. Your split comment only has that combination on the final line. With the same data file as the previous version, this gives:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test with
new line
P N
4 rows selected.
Since you're on Windows you might have to include \r in the format as well, e.g. "str ';\r\n'", but I'm not able to check that.
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(REPLACE(REPLACE(:Comments,CHR(13)),CHR(10)), 0, 1000)"
)
Note: The CHR(13) is the ASCII character for "carriage return" and the CHR(10) is the ASCII character for "new line". Using the Oracle PL/SQL REPLACE command without a replacement value will remove any "carriage return" and/or "new line" character that is embedded in your data. Which is probably the case because the comment field is the last field in your CSV file.
You can use replace(replace(column_name, chr(10)), chr(13)) to remove newline charactors or regexp_replace(column_name, '\s+') to remove non printable charactors during loading
I found the best way to load the .csv files with fields containing newline and comma.Please run the macro over the .csv file and then load using sqlloader
Sub remove()
Dim row As Integer
Dim oxcel As Excel.Application
Dim wbk As Excel.Workbook
Set oxcel = New Excel.Application
Set wbk = oxcel.Workbooks.Open("filename.csv", 0, True)
row = 0
With oxcel
.ActiveSheet.Select
Do
row = row + 1
'Assume first column is PK and so checking for empty pk to find the number of rows
Loop Until IsEmpty(Cells(row, 1)) Or IsNull(Cells(row, 1))
Range(Cells(1, 24), Cells(row - 1, 24)).Select
For Each oneCell In Selection
oneCell.Value = Application.Substitute(Application.Substitute
(Application.Substitute (CStr(oneCell.Value), vbLf, vbCr), vbCr, "-"),",","-")
Next oneCell
End With
End Sub
It's running perfect for me.