Oracle fuzzy text search with wildcards - oracle

I've got a SAP Oracle database full with customer data.
In our custom CRM it is quite common to search the for customers using wildcards. In addtion to the SAP standard search, we would like to do some fuzzy text searching for names which are similar to the entered name.
Currently we're using the UTL_MATCH.EDIT_DISTANCE function to search for similar names. The only disadvantage is that it is not possible to use some wildcard patterns.
Is there any possiblity to use wildcards in combination with the UTL_MATCH.EDIT_DISTANCE function or are there different(or even better) approaches to do that?
Let's say, there are the following names in the database:
PATRICK NOR
ORVILLE ALEX
OWEN TRISTAN
OKEN TRIST
The query could look like OKEN*IST* and both OWEN TRISTAN and OKEN TRISTAN should be returned. OKEN would be a 100% match and OWEN less.
My current test-query looks like:
SELECT gp.partner, gp.bu_sort1, UTL_MATCH.edit_distance(gp.bu_sort1, ?) as edit_distance,
FROM but000 gp
WHERE UTL_MATCH.edit_distance(gp.bu_sort1, ?) < 4
This query works fine except if wildcards * are used within the search string (which is quite common).

Beware of the implications of your approach in terms of performances. Even if it "functionally" worked, with UTL_MATCH you can only filter the results obtained by an internal table scan.
What you likely need is an index on such data.
Head to Oracle Text, the text indexing capabilities of Oracle. Bear in mind that they require some effort to be put at work.
You might juggle with the fuzzy operator, but handle with care. Most oracle text features are language dependent (they take into account the English dictionary, German, etc..).
For instance
-- create and populate the table
create table xxx_names (name varchar2(100));
insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');
--create the domain index
create index xxx_names_ctx on xxx_names(name) indextype is ctxsys.context;
This query would return results that you'd probably like (input is the string "TRST")
select
SCORE(1), name
from
xxx_names n
where
CONTAINS(n.name, 'definescore(fuzzy(TRST, 1, 6, weight),relevance)', 1) > 0
;
SCORE(1) NAME
---------- --------------------
1 OWEN TRISTAN
22 OKEN TRIST
But with the input string "IST" it would likely return nothing (in my case this is what it does).
Also note that in general, inputs of less than 3 characters are considered non-matching by default.
You'll possibly get a more "predictable" outcome if you take off the "fuzzy" requirement and stick to finding rows that just "contains" the exact sequence you passed in.
In this case try using a ctxcat index, which, by the way supports some wildcards (warning: supports multi columns, but a column cannot exceed 30 chars in size!)
-- create and populate the table
--max length is 30 chars, otherwise the catsearch index can't be created
create table xxx_names (name varchar2(30));
insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');
begin
ctx_ddl.create_index_set('xxx_names_set');
ctx_ddl.add_index('xxx_names_set', 'name');
end;
/
drop index xxx_names_cat;
CREATE INDEX xxx_names_cat ON xxx_names(name) INDEXTYPE IS CTXSYS.CTXCAT
PARAMETERS ('index set xxx_names_set');
The latter, with this query would work nicely (input is "*TRIST*")
select
UTL_MATCH.edit_distance(name, 'TRIST') dist,
name
from
xxx_names
where
catsearch(name, '*TRIST*', 'order by name desc') > 0
;
DIST NAME
---------- --------------------
7 OWEN TRISTAN
5 OKEN TRIST
But with the input "*O*TRIST*" wouldn't return anything (for some reasons).
Bottom line: text indexes are probably the only way to go (for performance) but you have to fiddle quite a bit to understand all the intricacies.
References:
fuzzy search: Oracle Text CONTAINS Query Operators
catsearch : Oracle Text SQL Statements and Operators

Assuming "wildcard" means an asterisk, you want a name that matches all specified letters to rank highest, with more specified letters matching better than less, otherwise rank by edit distance similarity.
using the placeholder ? for your search term, try this:
select *
from mytable
order by case
when name like '%' || replace(?, '*', '%') || '%' then 0 - length(replace(?, '*', ''))
else 100 - UTL_MATCH.edit_distance_similarity(?, name) end
fetch first 10 rows
FYI all "like" matches have a negative number for their ordering with magnitude the number of letters specified. All like misses have a non-negative ordering number with magnitude of the percentage difference. In all cases, a lower number is a better match.

Related

Bound Oracle Text "near" operator to the same sentences

I have a column that stores paragraphs with multiple sentences and I am using the "Near" statement to look for the right record. However, is it possible to bound the near statement that only look for words within the same sentences.
For example:
Paragraph
"An elderly man has died as a result of coronavirus in the Royal
Hobart Hospital overnight. It follows the death of a woman in her 80s
in the North West Regional Hospital in Burnie on Monday morning, and
brings the national toll to 19. "
indextype is ctxsys.context
select
score(1)
from
tbl
where
contains(Paragraph,
'Near (coronavirus, death),20,false)',1) > 0
The result I want is nothing as both words is from a different sentences. However, now it will return me a positive number as both words are less than 20 words apart.
Can you share me some idea on how to do this?
Thanks in advance!
The query should look like this:
select score(1)
from tbl
where contains(Paragraph, 'Near (coronavirus, death),20,false)
WITHIN SENTENCE',1) > 0
;
That is - use the WITHIN operator.
Note that you must tell the index to recognize sentences first. That is: if you created the index with a statement like this:
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
-- parameters(' ... ')
;
where the parameters (if you used that clause) don't say anything about "sentences", you will get an error if you try the query above - something along the lines of
DRG-10837: section sentence does not exist
First you will have to define "special" sections for sentences:
begin
ctx_ddl.create_section_group('my_section_group', 'AUTO_SECTION_GROUP');
ctx_ddl.add_special_section('my_section_group', 'SENTENCE');
end;
/
With this in hand:
drop index ctxidx;
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
parameters ('section group my_section_group')
;
Now you are ready to successfully run the query at the top of this Answer.

In query missing expressions of Oracle SQL Developer

SELECT b.*
FROM buses b,
bus_stations bs,
starts st,
stops_at sa
WHERE st.station_no = ( SELECT station_id
FROM bus_stations
WHERE station_name = "golden mile_Regina"
)
AND sa.station_no = ( SELECT station_id
FROM bus_stations
WHERE station_name = 'westmount_edmonton'
)
ORDER BY DATE;
You can't use double quotes with strings - use single ones, i.e.
WHERE station_name = 'golden mile_Regina'
By the way, are you sure of spelling & letter size? Is it really mixed case, with underscores? Just asking.
Furthermore, you're ordering by DATE - that won't work either, you can't use DATE as a column name (unless you enclose it into double quotes, but I certainly wouldn't recommend that). Have a look at the following example (stupid, yes - setting date to be a number, but I used it just to emphasize that DATE can't be used as a column name):
SQL> create table test (date number);
create table test (date number)
*
ERROR at line 1:
ORA-00904: : invalid identifier
Once you fix that, you'll get unexpected result as there are 4 tables in the FROM clause, but they aren't joined with one another, so that will be a nice Cartesian product.

Function index does not work in oracle where it is used with other operator

You assume this simple query:
select name, code
from item
where length(code) > 5
Due to avoiding of full access table, there is an function-index on length(code) by following command:
create index index_len_code on item(length(code));
The optimizer detects the index and use it(INDEX RANGE SCAN). Nonetheless the optimizer does not detect the above index for the below query:
select i.name, i.code
from item i, item ii
where length(i.code) - length(ii.code) > 0
When I see the execution plan, it is the access full table, not to be index range scan while index is existed on length(code).
Where is wrong and what is wrong?
If you have an EMP table with a column HIREDATE, and that column is indexed, then the optimizer may choose to use the index for accessing the table in a query with a condition like
... HIREDATE >= ADD_MONTHS(SYSDATE, -12)
to find employees hired in the last 12 months.
However, HIREDATE has to be alone on the left-hand side. If you add or subtract months or days to it, or if you wrap it within a function call like ADD_MONTHS, the index can't be used. The optimizer will not perform trivial arithmetic manipulations to convert the condition into one where HIREDATE by itself must satisfy an inequality.
The same happened in your second query. If you change the condition to
... length(i.code) > length(ii.code)
then the optimizer can use the function-based index on length(code). But even in your first query, if you change the condition to
... length(code) - 5 > 0
the index will NOT be used, because this is not an inequality condition on length(code). Again, the optimizer is not smart enough to perform trivial algebraic manipulations to rewrite this in a form where it's an inequality condition on length(code) itself.

Oracle SQL Query Performance, Function based Indexes

I have been trying to fine tune a SQL Query that takes 1.5 Hrs to process approx 4,000 error records. The run time increases along with the number of rows.
I figured out there is one condition in my SQL that is actually causing the issue
AND (DECODE (aia.doc_sequence_value,
NULL, DECODE(aia.voucher_num,
NULL, SUBSTR(aia.invoice_num, 1, 10),
aia.voucher_num) ,
aia.doc_sequence_value) ||'_' ||
aila.line_number ||'_' ||
aida.distribution_line_number ||'_' ||
DECODE (aca.doc_sequence_value,
NULL, DECODE(aca.check_voucher_num,
NULL, SUBSTR(aca.check_number, 1, 10),
aca.check_voucher_num) ,
aca.doc_sequence_value)) = " P_ID"
(P_ID - a value from the first cursor sql)
(Note that these are standard Oracle Applications(ERP) Invoice tables)
P_ID column is from the staging table that is derived the same way as above derivation and compared here again in the second SQL to get the latest data for that record. (Basically reprocessing the error records, the value of P_ID is something like "999703_1_1_9995248" )
Q1) Can I create a function based index on the whole left side derivation? If so what is the syntax.
Q2) Would it be okay or against the oracle standard rules, to create a function based index on standard Oracle tables? (Not creating directly on the table itself)
Q3) If NOT what is the best approach to solve this issue?
Briefly, no you can't place a function-based index on that expression, because the input values are derived from four different tables (or table aliases).
What you might look into is a materialised view, but that's a big and potentially difficult to solve a single query optimisation problem with.
You might investigate decomposing that string "999703_1_1_9995248" and applying the relevant parts to the separate expressions:
DECODE(aia.doc_sequence_value,
NULL,
DECODE(aia.voucher_num,
NULL, SUBSTR(aia.invoice_num, 1, 10),
aia.voucher_num) ,
aia.doc_sequence_value) = '999703' and
aila.line_number = '1' and
aida.distribution_line_number = '1' and
DECODE (aca.doc_sequence_value,
NULL,
DECODE(aca.check_voucher_num,
NULL, SUBSTR(aca.check_number, 1, 10),
aca.check_voucher_num) ,
aca.doc_sequence_value)) = '9995248'
Then you can use indexes on the expressions and columns.
You could separate the four components of the P_ID value using regular expressions, or a combination of InStr() and SubStr()
Ad 1) Based on the SQL you've posted, you cannot create function based index on that. The reason is that function based indexes must be:
Deterministic - i.e. the function used in index definition has to always return the same result for given input arguments, and
Can only use columns from the table the index is created for. In your case - based on aliases you're using - you have four tables (aia, aila, aida, aca).
Req #2 makes it impossible to build a functional index for that expression.

oracle not using defined indexes

As seen below there is a simple join between my Tables A And B.
In addition, there is a condition on each table which is combined with Or operator.
SELECT /*+ NO_EXPAND */
* FROM IIndustrialCaseHistory B ,
IIndustrialCaseProduct A
where (
A.ProductId IN ('9_2') OR
contains(B.KeyWords,'%some text goes here%' ) <=0
)
and ( B.Id = A.IIndustrialCaseHistoryId)
on ProductId defined a b-tree index and for KeyWords there is a function index.
but I dont know why my execution plan dose not use these indexes and performs table access full?
as I found in this URL NO_EXPAND optimization hint could couse using indexes in execution plan(The NO_EXPAND hint prevents the cost-based optimizer from considering OR-expansion for queries having OR conditions or IN-lists in the WHERE clause ). But I didn't see any use of defined indexes
whats is oracle problem with my query?!
Unless there is something magical about the contains() function that I don't know about, Oracle cannot use an index to find a matching value that leads with a wildcard, i.e. a text string value within a varchar2 column but not starting in the first position with that value. [OR B.KeyWords LIKE'%some text goes here%' -- as opposed to -- OR B.KeyWords LIKE'Some text starts here%' -- optimizable via index.] The optimizer will default back to the full table scan in that case.
Also, although it may not be material, why use IN() if there is only one value in the list? Why not A.ProductId = '9_2' ?

Resources