SQL*Loader: Dealing with delimiter characters in data - oracle

I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".
FIELDS TERMINATED BY '|'
But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indata pipe characters as field terminator.
Can you point me a direction to solve this issue?
Data file is about 9 GB, so it is hard to edit manually.
For example,
Loaded row:
ABC|1234567|STR 9 R 25|98734959,32|28.12.2011
Rejected Row:
DE4|2346543|WE| 454|956584,84|28.11.2011
Error:
Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month
DATE_N column is the last one.

You could not use any separator, and do something like:
field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\1')",
col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\2')",
col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\3')",
col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\4')",
col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\5')",
col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\6')"
This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).
This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).
I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:
with CTE as
(select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
from dual
union all
select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
,substr(CTETXT
,instr(CTETXT, '|', 1, 1) + 1
,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
as COL2
,substr(CTETXT
,instr(CTETXT, '|', 1, 2) + 1
,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
as COL3
,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
from CTE
It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.

OK, I recomend you to parse the file and replace the delimiter.
In command line in Unix/linux you should do:
cat current_file | awk -F'|' '{printf( "%s,%s,", $1, $2); for(k=3;k<NF-2;k++) printf("%s|", $k); printf("%s,%s,%s", $(NF-2),$(NF-1),$NF);print "";}' > new_file
This command will not change your current file.
Will create a new file, comma delimited, with five fields.
It splits the input file on "|" and take first, second, anything to antelast, antelast, and last chunk.
You can try to sqlldr the new_file with "," delimiter.
UPDATE:
The command can be put in a script like (and named parse.awk)
#!/usr/bin/awk
# parse.awk
BEGIN {FS="|"}
{
printf("%s,%s,", $1, $2);
for(k=3;k<NF-2;k++)
printf("%s|", $k);
printf("%s,%s,%s\n", $(NF-2),$(NF-1),$NF);
}
and you can run in this way:
cat current_file | awk -f parse.awk > new_file

Related

REGEXP_LIKE Oracle equivalent to count characters in Snowflake

I am trying to come up with an equivalent of the below Oracle statement in Snowflake. This would check if the different parts of the string separated by '.' matches the number of characters in the REGEXP_LIKE expression. I have come up with a rudimentary version to perform the check in Snowflake but I am sure there's a better and cleaner way to do it. I am looking to come up with a one-liner regular expression check in Snowflake similar to Oracle. Appreciate your help!
-- Oracle
SELECT -- would return True
CASE
WHEN REGEXP_LIKE('AB.XYX.12.34.5670.89', '^\w{2}\.\w{3}\.\w{2}') THEN 'True'
ELSE NULL
END AS abc
FROM DUAL
-- Snowflake
SELECT -- would return True
REGEXP_LIKE(SPLIT_PART('AB.XYX.12.34.5670.89', '.', 1), '[A-Z0-9]{2}') AND
REGEXP_LIKE(SPLIT_PART('AB.XYX.12.34.5670.89', '.', 2), '[A-Z0-9]{3}') AND
REGEXP_LIKE(SPLIT_PART('AB.XYX.12.34.5670.89', '.', 3), '[A-Z0-9]{2}') AS abc
You need to add a .* at the end as the REGEXP_LIKE adds explicit ^ && $ to string:
The function implicitly anchors a pattern at both ends (i.e. '' automatically becomes '^$', and 'ABC' automatically becomes '^ABC$'). To match any string starting with ABC, the pattern would be 'ABC.*'.
select
column1 as str,
REGEXP_LIKE(str, '\\w{2}\\.\\w{3}\\.\\w{2}.*') as oracle_way
FROM VALUES
('AB.XYX.12.34.5670.89')
;
gives:
STR
ORACLE_WAY
AB.XYX.12.34.5670.89
TRUE
Or in the context of your question:
SELECT IFF(REGEXP_LIKE('AB.XYX.12.34.5670.89', '\\w{2}\\.\\w{3}\\.\\w{2}.*'), 'True', null) AS abc;
Your use of \w seems to suggest you don't need delimited strings to be strictly [A-Z0-9] since word characters allow underscore and period. If all bets were off and the only requirement was to have . at 3rd, 7th and 10th position, you could have used like this way.
select 'AB.XGH.12.34.5670.89' like '__.___.__.%' ;

Escape Pipe in SQL Loader

I have a pipe delimited file which has to be loaded via SQL*Loader in Oracle.
My control file looks like this:
LOAD DATA
REPLACE
INTO TABLE1
FIELDS TERMINATED BY '|'
TRAILING NULLCOLS
(
ID "TRIM(:ID)",
TEXT "NVL(TRIM(:TEXT),' ')"
)
The TEXT column in the data file can contain text with "|"- i.e., delimiter too.
How can I accept pipe in the TEXT column?
You can't escape the delimiter; but if you want everything up to the first pipe to be the ID and everything after the first pipe to be TEXT, you could treat the record in the data file as a single field and split it using SQL functions, e.g.:
LOAD DATA
INFILE ...
REPLACE
INTO TABLE TABLE1
TRAILING NULLCOLS
(
ID CHAR(4000) "regexp_replace(:ID, '^(.*?)(\\|(.*))?$', '\\1')",
TEXT EXPRESSION "regexp_replace(:ID, '^(.*?)(\\|(.*))?$', '\\3')"
)
There is no FIELDS clause.
The ID is initially up to 4000 characters from the line (just a large value to hopefully capture any data you have). A regex replace is then applied to that; the pattern defines a first group as any characters (non-greedy), optionally followed by a second group comprising a pipe and a third inner group of zero or more characters after that pipe. The original value is replaced by group 1.
The TEXT is defined as an EXPRESSION, meaning it isn't obtained directly from the file; instead the same regex pattern is applied to the original ID value, but now that is replaced by the third group, which is everything after the first pipe (if there is one).
An equivalent in plain SQL as a demo would be:
with data (id) as (
select '123|test 1' from dual
union all
select '234|test 2|with pipe' from dual
union all
select '345|test 3|with|multiple|pipes|' from dual
union all
select null from dual
union all
select '678' from dual
union all
select '789|' from dual
)
select id as original,
regexp_replace(ID, '^(.*?)(\|(.*))?$', '\1') as id,
regexp_replace(ID, '^(.*?)(\|(.*))?$', '\3') as text
from data;
which gives:
ORIGINAL ID TEXT
------------------------------- ---- ------------------------------
123|test 1 123 test 1
234|test 2|with pipe 234 test 2|with pipe
345|test 3|with|multiple|pipes| 345 test 3|with|multiple|pipes|
567 567
678| 678
If you don't need to worry about records without that first pipe, or with that first pipe but followed by nothing, then the regex could be simpler:
(
ID CHAR(4000) "regexp_replace(:ID, '^(.*?)\\|(.*)$', '\\1')",
TEXT EXPRESSION "regexp_replace(:ID, '^(.*?)\\|(.*)$', '\\2')"
)

How to insert multiple rows from a flow

I have to insert multiple row into a table from a file structured like this:
BANAC2C100017701007_X75 _CA 4X2 CT MLCR DR SX EP 160 E4
where 4x2, MLCR, 160 E4 have to be inserted into the same column for the same code BANAC2C100017701007. As example, the table should be structured like this:
After to split the elements from the file, how can I put them into the table? Any suggestion?
It can be done with sqlldr. I have made some assumptions, but if the data is one row per line as you describe above, with the same number of elements a line, a properly constructed control file with multiple "into table" statements can write different parts of one row of data as multiple rows to the same table.
The control file:
LOAD DATA
infile "file.dat"
TRUNCATE
INTO TABLE data_table
(entirerow BOUNDFILLER char(4000)
,code expression "regexp_substr(:entirerow, '(.*?)(_)', 1, 1, NULL, 1)"
,desc expression "regexp_substr(:entirerow, '(.*?)( +)', 1, 3, NULL, 1)"
)
INTO TABLE data_table
(entirerow BOUNDFILLER position(1) char(4000)
,code expression "regexp_substr(:entirerow, '(.*?)(_)', 1, 1, NULL, 1)"
,desc expression "regexp_substr(:entirerow, '(.*?)( +)', 1, 5, NULL, 1)"
)
INTO TABLE data_table
(entirerow BOUNDFILLER position(1) char(4000)
,code expression "regexp_substr(:entirerow, '(.*?)(_)', 1, 1, NULL, 1)"
,desc expression "regexp_substr(:entirerow, '(.*?)( +)', 1, 9, NULL, 1) || ' ' ||
regexp_substr(:entirerow, '(.*?)( +|$)', 1, 10, NULL, 1)"
)
A couple of things to note:
Since there are no delimiters, and none are specified, the entire row will be read into the first field "entirerow". Since it is not a column in the table, and it is defined as BOUNDFILLER, it is "remembered" for use later.
The next field "code" is found in the control file. No data field exists to match it with, but sqlldr finds it matches a column in the table and sees it is an expression so it applies the expression with the intention of putting the result into the column. The expression uses REGEXP_SUBSTR against the remembered BOUNDFILLER to cut out the parts we need. For code, get the characters up to but not including the first underscore. For desc, get the 3rd set of characters that are followed by one or more spaces (but not the spaces).
For the second logical row, we need to re-position the logical pointer back to the beginning of the row read in so sqlldr can re-process. Otherwise the logical pointer is at the end of the data and nothing will be returned. This is done with the "position" parameter seen in the "entirerow" definition of the 2nd and 3rd "into table" statements. The last "into table" follows the previous paradigm of just getting the 9th and 10th fields and concatenating them together. I chose to do this rather than come up with another regex to do it as it keeps consistency with the other fields, plus if you want to change it in the future it will be easier to follow.
As you can see it works and is reusable:
SQL> select code, desc
from data_table;
CODE DESC
------------------------- -------------
BANAC2C100017701007 4X2
BANAC2C100017701007 MLCR
BANAC2C100017701007 160 E4
Possible caveat: each row is being scanned 3 times, and the regexp calls are expensive so depending on the amount of data you need to load this may not be a feasible solution for your situation.

Multiple lines in a column in oracle to a single row

My oracle table is as follows ( Address column having multiple lines):
ID Address
--------------------
1456897 No 61
11th Street
Tatabad Coimbatore - 641012
How to get the desired result as (with Address column as a single line) ?
ID Address
-------------------------
1456897 No 61 , 11th Street, Tatabad Coimbatore - 641012
I don't know if your database has its newlines as \x0a or \x0d or \x0d\x0a. I therefore propose a the following solution that handles all three kind of new lines. It will however replace mutliple newlines with one ,. This might be what you want, or it might not.
select
id,
regexp_replace(
address,
'('||chr(10)||'|'||chr(13)||')+',
', ') as address,
....
from
....
remove new line character in the column - something like
SELECT REPLACE(Address_column, '\n', ' ') -- \n might be also \r\n or even \r
FROM table_name

Oracle Regexp to replace \n,\r and \t with space

I am trying to select a column from a table that contains newline (NL) characters (and possibly others \n, \r, \t). I would like to use the REGEXP to select the data and replace (only these three) characters with a space, " ".
No need for regex. This can be done easily with the ASCII codes and boring old TRANSLATE()
select translate(your_column, chr(10)||chr(11)||chr(13), ' ')
from your_table;
This replaces newline, tab and carriage return with space.
TRANSLATE() is much more efficient than its regex equivalent. However, if your heart is set on that approach, you should know that we can reference ASCII codes in regex. So this statement is the regex version of the above.
select regexp_replace(your_column, '([\x0A|\x0B|`\x0D])', ' ')
from your_table;
The tweak is to reference the ASCII code in hexadecimal rather than base 10.
select translate(your_column, chr(10)||chr(11)||chr(13), ' ') from your_table;
to clean it is essential to serve non-null value as params ...
(oracle function basically will return null once 1 param is null, there are few excpetions like replace-functions)
select translate(your_column, ' '||chr(10)||chr(11)||chr(13), ' ') from your_table;
this examples uses ' '->' ' translation as dummy-value to prevent Null-Value in parameter 3

Resources