Replacing Text which does not match a pattern in Oracle - oracle

I have below text in a CLOB in a table
Table Name: tbl1
Columns
col1 - number (Primary Key)
col2 - clob (as below)
Row#1
-----
Col1 = 1
Col2 =
1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...
Row#2
-----
Col1 = 2
Col2 =
1331882981,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...
Now I need to know how we can get below output
Col1 Value
---- ---------------------
1 1331882981,ab123456
1 1331890329,pqr123223
2 1331882981,abc333
2 1331890329,pqrs23
([0-9]{10},[a-z 0-9]+.), ==> This is the regular expression to match "1331890329,pqrs23" and I need to know how can replace which are not matching this regex and then split them into multiple rows
EDIT#1
I am on Oracle 10.2.0.5.0 and hence cannot use REGEXP_COUNT function :-( Also, the col2 is a CLOB which is massive
EDIT#2
I've tried below query and it works fine for some records (i.e. if I add a "where" clause). But when I remove the "where", it never returns any result. I've tried to put this into a view and insert into a table and left it run overnight but still it had not completed :(
with t as (select col1, col2 from temp_table)
select col1,
cast(substr(regexp_substr(col2, '[^~]+', 1, level), 1, 50) as
varchar2(50)) data
from t
connect by level <= length(col2) - length(replace(col2, '~')) + 1
EDIT#3
# of Chars in Clob Total
----------- -----
0 - 1k 3196
1k - 5k 2865
5k - 25k 661
25k - 100k 36
> 100k 2
----------- -----
Grand Total 6760
I have ~7k rows of clobs which have the distribution as shown above...

Well, you could try something like:
with v as
(
select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
union all
select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= REGEXP_COUNT(col2, '([0-9]{10},[a-z0-9]+)')
order by col1
;
This gives:
1 1331882981,ab123456
1 1331890329,pqr123223
2 1331890329,pqrs23
2 3188298777,abc333
EDIT : for 10g, REGEXP_COUNT does not exist but you have workarounds. Here I replace the pattern found by something I hope I won't find in the text (here, XYZXYZ but you can choose something much more complex to be confident), do a diff with the same matching but replaced by the empty string, then divide by my pattern length (here, 6):
with v as
(
select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
union all
select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
order by col1
;
EDIT 2 : CLOBs (and LOBs in general) and regexp don't seem to fit well together:
ORA-00932: inconsistent datatypes: expected - got CLOB
Converting the CLOG to a string (regexp_substr(to_char(col2), ...) seems to fix the issue.
EDIT 3 : CLOBs don't like distinct either, so converting split result to char in an embedded request and then using the distinct on the upper request succeeds !
select distinct col1, split from
(
select col1, to_char(regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level)) split
from temp_epn
connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
order by col1
);

The above solutions didn't work and below is what I did.
update temp_table set col2=regexp_replace(col2,'([0-9]{10},[a-z0-9]+)','(\1)') ;
update temp_table set col2=regexp_replace(col2,'\),[\s\S]*~\(','(\1)$');
update temp_table set col2=regexp_replace(col2,'\).*?\(','$');
update temp_table set col2=replace(regexp_replace(col2,'\).*',''),'(','');
After these 4 update commands, the col2 will have something like
1 1331882981,ab123456$1331890329,pqr123223
2 1331882981,abc333$1331890329,pqrs23
Then I wrote a function to split this thing. The reason I went for the function is to split by "$" and the fact that the col2 still has >10k characters
create or replace function parse( p_clob in clob ) return sys.odciVarchar2List
pipelined
as
l_offset number := 1;
l_clob clob := translate( p_clob, chr(13)|| chr(10) || chr(9), ' ' ) || '$';
l_hit number;
begin
loop
--Find occurance of "$" from l_offset
l_hit := instr( l_clob, '$', l_offset );
exit when nvl(l_hit,0) = 0;
--Extract string from l_offset to l_hit
pipe row ( substr(l_clob, l_offset , (l_hit - l_offset)) );
--Move offset
l_offset := l_hit+1;
end loop;
end;
I then called
select col1,
REGEXP_SUBSTR(column_value, '[^,]+', 1, 1) col3,
REGEXP_SUBSTR(column_value, '[^,]+', 1, 2) col4
from temp_table, table(parse(temp_table.col2));

Related

Remove coma separated string from another coma separated string in oracle

Column1 =A,B,C,D,E,F
Column2 =C,D,A,F,C,B (It can have duplicates)
I need to remove column2 values from column1 and get the missing value.
Desired output
(Column1)-(Column2) = E
Split columns' contents into rows, use MINUS set operator. Sample data in lines #1 - 3; query begins at line #4.
SQL> with test (col1, col2) as
2 (select 'A,B,C,D,E,F', 'C,D,A,F,C,B' from dual
3 )
4 select regexp_substr(col1, '[^,]+', 1, level) val
5 from test
6 connect by level <= regexp_count(col1, ',') + 1
7 minus
8 select regexp_substr(col2, '[^,]+', 1, level) val
9 from test
10 connect by level <= regexp_count(col2, ',') + 1
11 /
VAL
--------------------------------------------
E
SQL>
If you're comparing columns in a multi-row table, the above approach won't work OK as it'll retrieve duplicates and will be slow. In that case, rewrite it to
SQL> with test (id, col1, col2) as
2 (select 1, 'A,B,C,D,E,F', 'C,D,A,F,C,B' from dual union all
3 select 2, 'A,B,C,D,E,F', 'A,B,B,B' from dual
4 )
5 select id, listagg(val, ',') within group (order by val) missing_letters
6 from
7 (
8 select id,
9 regexp_substr(col1, '[^,]+', 1, column_value) val
10 from test cross join
11 table(cast(multiset(select level from dual
12 connect by level <= regexp_count(col1, ',') + 1
13 ) as sys.odcinumberlist))
14 minus
15 select id,
16 regexp_substr(col2, '[^,]+', 1, column_value) val
17 from test cross join
18 table(cast(multiset(select level from dual
19 connect by level <= regexp_count(col2, ',') + 1
20 ) as sys.odcinumberlist))
21 )
22 group by id;
ID MISSING_LETTERS
---------- --------------------
1 E
2 C,D,E,F
SQL>
You may use translate function with additional cleanup logic to remove all remaining commas. This will work only for single character replacement (one character between commas), but doesn't require to split string into tokens and uses simple string functions.
with a(col1, col2) as (
select 'A,B,C,D,E,F', 'C,D,A,F,C,B' from dual
)
select
/*Then remove leading and trailing commas*/
trim(',' from
/*Then condense all intermediate commas and spaces*/
regexp_replace(
/*Do actual replacement*/
translate(col1, replace(col2, ','), ' '),
'[, ]+', ','
)
) as res
from a
| RES |
| :-- |
| E |
db<>fiddle here
You do not need to split the string.
If your delimited values do not have any characters with special meaning in regular expressions then you can double-up the delimiters in col1 and then convert col2 to a regular expression and replace matches with an empty string and then remove the excess delimiters:
SELECT col1,
col2,
TRIM(
BOTH ',' FROM
REPLACE(
REGEXP_REPLACE(
',' || REPLACE(col1, ',', ',,') || ',',
',(' || REPLACE(col2, ',', '|') || '),'
),
',,',
','
)
) AS missing
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name ( col1, col2 ) AS
SELECT 'A,B,C,D,E,F', 'C,D,A,F,C,B' FROM DUAL UNION ALL
SELECT 'A,AB,BA,B,', 'A,B' FROM DUAL;
Outputs:
COL1
COL2
MISSING
A,B,C,D,E,F
C,D,A,F,C,B
E
A,AB,BA,B,
A,B
AB,BA
If you do have characters with special meaning then you can do a similar replacement using a recursive sub-query:
WITH replacements ( col1, col2 ) AS (
SELECT ',' || REPLACE( col1, ',', ',,') || ',',
col2 || ','
FROM table_name
UNION ALL
SELECT REPLACE(col1, ',' || SUBSTR(col2, 1, INSTR(col2, ','))),
SUBSTR(col2, INSTR(col2, ',') + 1)
FROM replacements
WHERE col2 IS NOT NULL
)
SELECT TRIM(BOTH ',' FROM REPLACE(col1, ',,', ',')) AS missing
FROM replacements
WHERE col2 IS NULL
Which outputs:
MISSING
AB,BA
E
Note: both of these queries only require a single table scan.
db<>fiddle here
Using ora:tokenize you could do something like this (including a few test cases in the with clause; you should remove it, and use your actual table and column names in the main query):
with
inputs (col1, col2) as (
select 'A,B,C,D,E,F', 'C,D,A,F,C,B' from dual union all
select 'D,,F' , 'F,A' from dual union all
select 'A,B,E,F' , 'E' from dual union all
select 'ABC' , 'A,B,ABC' from dual
)
-- END OF TEST DATA; QUERY BEGINS **BELOW THIS LINE**
select i.col1, i.col2, l.diff
from inputs i cross join lateral
( select listagg(token, ',') within group (order by null) as diff
from xmltable('ora:tokenize(.,",")' passing i.col1 || ','
columns token varchar2(10) path '.')
where not ',' || col2 || ',' like '%,' || token || ',%' ) l
;
COL1 COL2 DIFF
----------- ----------- --------------------
A,B,C,D,E,F C,D,A,F,C,B E
D,,F F,A D
A,B,E,F E A,B,F
ABC A,B,ABC

How to select second split of column data from oracle database

I want to select the data from a Oracle table, whereas the table columns contains the data as , [ex : key,value] separated values; so here I want to select the second split i.e, value
table column data as below :
column_data
++++++++++++++
asper,worse
tincher,good
golder
null -- null values need to eliminate while selection
www,ewe
from the above data, desired output like below:
column_data
+++++++++++++
worse
good
golder
ewe
Please help me with the query
According to data you provided, here are two options:
result1: regular expressions one (get the 2nd word if it exists; otherwise, get the 1st one)
result2: SUBSTR + INSTR combination
SQL> with test (col) as
2 (select 'asper,worse' from dual union all
3 select 'tincher,good' from dual union all
4 select 'golder' from dual union all
5 select null from dual union all
6 select 'www,ewe' from dual
7 )
8 select col,
9 nvl(regexp_substr(col, '\w+', 1, 2), regexp_substr(col, '\w+', 1,1 )) result1,
10 --
11 nvl(substr(col, instr(col, ',') + 1), col) result2
12 from test
13 where col is not null;
COL RESULT1 RESULT2
------------ -------------------- --------------------
asper,worse worse worse
tincher,good good good
golder golder golder
www,ewe ewe ewe
SQL>

REGEXP to capture values delimited by a set of delimiters

My column value looks something like below: [Just an example i created]
{BASICINFOxxxFyyy100x} {CONTACTxxx12345yyy20202x}
It can contain 0 or more blocks of data... I have created the below query to split the blocks
with x as
(select
'{BASICINFOxxxFyyy100x}{CONTACTxxx12345yyy20202x}' a from dual)
select REGEXP_SUBSTR(a,'({.*?x})',1,rownum,null,1)
from x
connect by rownum <= REGEXP_COUNT(a,'x}')
However I would like to further split the output into 3 columns like below:
ColumnA | ColumnB | ColumnC
------------------------------
BASICINFO | F |100
CONTACT | 12345 |20202
The delimiters are always standard. I failed to create a pretty query which gives me the desired output.
Thanks in advance.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE your_table ( str ) AS
SELECT '{BASICINFOxxxFyyy100x}{CONTACTxxx12345yyy20202x}' from dual
/
Query 1:
select REGEXP_SUBSTR(
t.str,
'\{([^}]*?)xxx([^}]*?)yyy([^}]*?)x\}',
1,
l.COLUMN_VALUE,
NULL,
1
) AS col1,
REGEXP_SUBSTR(
str,
'\{([^}]*?)xxx([^}]*?)yyy([^}]*?)x\}',
1,
l.COLUMN_VALUE,
NULL,
2
) AS col2,
REGEXP_SUBSTR(
str,
'\{([^}]*?)xxx([^}]*?)yyy([^}]*?)x\}',
1,
l.COLUMN_VALUE,
NULL,
3
) AS col3
FROM your_table t
CROSS JOIN
TABLE(
CAST(
MULTISET(
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL <= REGEXP_COUNT( t.str,'\{([^}]*?)xxx([^}]*?)yyy([^}]*?)x\}')
) AS SYS.ODCINUMBERLIST
)
) l
Results:
| COL1 | COL2 | COL3 |
|-----------|-------|-------|
| BASICINFO | F | 100 |
| CONTACT | 12345 | 20202 |
Note:
Your query:
select REGEXP_SUBSTR(a,'({.*?x})',1,rownum,null,1)
from x
connect by rownum <= REGEXP_COUNT(a,'x}')
Will not work when you have multiple rows of input - In the CONNECT BY clause, the hierarchical query has nothing to restrict it connecting Row1-Level2 to Row1-Level1 or to Row2-Level1 so it will connect it to both and as the depth of the hierarchies gets greater it will create exponentially more duplicate copies of the output rows. There are hacks you can use to stop this but it is much more efficient to put the row generator into a correlated sub-query which can then be CROSS JOINed back to the original table (it is correlated so it won't join to the wrong rows) if you are going to use hierarchical queries.
Better yet would be to fix your data structure so you are not storing multiple values in delimited strings.
SQL> with x as
2 (select '{BASICINFOxxxFyyy100x}{CONTACTxxx12345yyy20202x}' a from dual
3 ),
4 y as (
5 select REGEXP_SUBSTR(a,'({.*?x})',1,rownum,null,1) c1
6 from x
7 connect by rownum <= REGEXP_COUNT(a,'x}')
8 )
9 select
10 substr(c1,2,instr(c1,'xxx')-2) z1,
11 substr(c1,instr(c1,'xxx')+3,instr(c1,'yyy')-instr(c1,'xxx')-3) z2,
12 rtrim(substr(c1,instr(c1,'yyy')+3),'x}') z3
13 from y;
Z1 Z2 Z3
--------------- --------------- ---------------
BASICINFO F 100
CONTACT 12345 20202
Here is another solution, which is derived from the place you left. Your query had already resulted into splitting of a row to 2 row. Below will make it in 3 columns:
WITH x
AS (SELECT '{BASICINFOxxxFyyy100x}{CONTACTxxx12345yyy20202x}' a
FROM DUAL),
-- Your query result here
tbl
AS ( SELECT REGEXP_SUBSTR (a,
'({.*?x})',
1,
ROWNUM,
NULL,
1)
Col
FROM x
CONNECT BY ROWNUM <= REGEXP_COUNT (a, 'x}'))
--- Actual Query
SELECT col,
REGEXP_SUBSTR (col,
'(.*?{)([^x]+)',
1,
1,
'',
2)
AS COL1,
REGEXP_SUBSTR (REGEXP_SUBSTR (col,
'(.*?)([^x]+)',
1,
2,
'',
2),
'[^y]+',
1,
1)
AS COL2,
REGEXP_SUBSTR (REGEXP_SUBSTR (col,
'[^y]+x',
1,
2),
'[^x]+',
1,
1)
AS COL3
FROM tbl;
Output:
SQL> /
COL COL1 COL2 COL3
------------------------------------------------ ------------------------------------------------ ------------------------------------------------ ------------------------------------------------
{BASICINFOxxxFyyy100x} BASICINFO F 100
{CONTACTxxx12345yyy20202x} CONTACT 12345 20202

Oracle instr position

I have 15 char string and need to loop through pulling the position of occurrence of the letter 'a'. I was going to use a cursor to loop through the string, but wasn't sure how to save each positions occurrence.
Something like this to break the string into each character and then filter on your desired value?
-- data setup to create a single value to test
WITH dat as (select 'ABCDEACDFA' val from DUAL)
--
SELECT lvl, strchr
from (
-- query to break the string into individual characters, returning a row for each
SELECT level lvl, substr(dat.val,level,1) strchr
FROM dat
CONNECT BY level <= length(val)
) WHERE strchr = 'A';
returns:
LVL STRCHR
1 A
6 A
10 A
Here's a different method using one less select and a regex. I don't believe it will help your performance issue though. Please try it and let us know:
SQL> with tbl(str) as (
select 'Aabjggaklkjha' from dual
)
select level as position
from tbl
where upper(REGEXP_SUBSTR(str, '.', 1, level)) = 'A'
connect by level <= length(str);
POSITION
----------
1
2
7
13
SQL>

oracle query to split the example#gmail.com into columns when ever special char is encountered

Here i have written code but that contains special characters also.But my requirement is ask for user to give a email dynamically and split that email when ever special chars occurs with out special characters i need the out put.
col1 col2 col3
------------------
example123 gmail com
select substr('exapmle123#gmail.com',instr('example123#gmail.com','#'),instr('example123#gmail.com','.')) as col1 ,
substr('exapmle123#gmail.com',1,instr('example123#gmail.com','#')) as col2,
substr('exapmle123#gmail.com',instr('example123#gmail.com','.'),length('example123#gmail.com')) as col3
from dual;
I suggest you to use REGEXP_SUBSTR for splitting strings
Approach 1
In the example below, there is a row for every new word and row and colnumbers are part of the resultset. I suggest you to use this approach since you can not know the numbers of words/colummns beforehand
Query1
with MyString as
( select 'exapmle123#gmail.com' Str, 1 rnum from dual
)
,pivot as (
Select Rownum Pnum
From dual
Connect By Rownum <= 100
)
SELECT REGEXP_SUBSTR (ms.Str,'([[:alnum:]])+',1,pv.pnum), ms.rnum, pv.pnum colnum
FROM MyString ms
,pivot pv
where REGEXP_SUBSTR (ms.Str,'([[:alnum:]])+',1,pv.pnum) is not null
Result1
REGEXP_SUBSTR(MS.STR RNUM COLNUM
-------------------- ---------- ----------
exapmle123 1 1
gmail 1 2
com 1 3
Approach 2
If you know how many words/columns you'll have, then you can use
Query2
with MyString as
( select 'exapmle123#gmail.com' Str, 1 rnum from dual
)
SELECT REGEXP_SUBSTR (ms.Str,'([[:alnum:]])+',1,1) col1, REGEXP_SUBSTR (ms.Str,'([[:alnum:]])+',1,2) col2, REGEXP_SUBSTR (ms.Str,'([[:alnum:]])+',1,3) col3
FROM MyString ms
Result2
COL1 COL2 COL
---------- ----- ---
exapmle123 gmail com

Resources