I have a column that stores paragraphs with multiple sentences and I am using the "Near" statement to look for the right record. However, is it possible to bound the near statement that only look for words within the same sentences.
For example:
Paragraph
"An elderly man has died as a result of coronavirus in the Royal
Hobart Hospital overnight. It follows the death of a woman in her 80s
in the North West Regional Hospital in Burnie on Monday morning, and
brings the national toll to 19. "
indextype is ctxsys.context
select
score(1)
from
tbl
where
contains(Paragraph,
'Near (coronavirus, death),20,false)',1) > 0
The result I want is nothing as both words is from a different sentences. However, now it will return me a positive number as both words are less than 20 words apart.
Can you share me some idea on how to do this?
Thanks in advance!
The query should look like this:
select score(1)
from tbl
where contains(Paragraph, 'Near (coronavirus, death),20,false)
WITHIN SENTENCE',1) > 0
;
That is - use the WITHIN operator.
Note that you must tell the index to recognize sentences first. That is: if you created the index with a statement like this:
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
-- parameters(' ... ')
;
where the parameters (if you used that clause) don't say anything about "sentences", you will get an error if you try the query above - something along the lines of
DRG-10837: section sentence does not exist
First you will have to define "special" sections for sentences:
begin
ctx_ddl.create_section_group('my_section_group', 'AUTO_SECTION_GROUP');
ctx_ddl.add_special_section('my_section_group', 'SENTENCE');
end;
/
With this in hand:
drop index ctxidx;
create index ctxidx on tbl(Paragraph)
indextype is ctxsys.context
parameters ('section group my_section_group')
;
Now you are ready to successfully run the query at the top of this Answer.
Related
You assume this simple query:
select name, code
from item
where length(code) > 5
Due to avoiding of full access table, there is an function-index on length(code) by following command:
create index index_len_code on item(length(code));
The optimizer detects the index and use it(INDEX RANGE SCAN). Nonetheless the optimizer does not detect the above index for the below query:
select i.name, i.code
from item i, item ii
where length(i.code) - length(ii.code) > 0
When I see the execution plan, it is the access full table, not to be index range scan while index is existed on length(code).
Where is wrong and what is wrong?
If you have an EMP table with a column HIREDATE, and that column is indexed, then the optimizer may choose to use the index for accessing the table in a query with a condition like
... HIREDATE >= ADD_MONTHS(SYSDATE, -12)
to find employees hired in the last 12 months.
However, HIREDATE has to be alone on the left-hand side. If you add or subtract months or days to it, or if you wrap it within a function call like ADD_MONTHS, the index can't be used. The optimizer will not perform trivial arithmetic manipulations to convert the condition into one where HIREDATE by itself must satisfy an inequality.
The same happened in your second query. If you change the condition to
... length(i.code) > length(ii.code)
then the optimizer can use the function-based index on length(code). But even in your first query, if you change the condition to
... length(code) - 5 > 0
the index will NOT be used, because this is not an inequality condition on length(code). Again, the optimizer is not smart enough to perform trivial algebraic manipulations to rewrite this in a form where it's an inequality condition on length(code) itself.
I've got a SAP Oracle database full with customer data.
In our custom CRM it is quite common to search the for customers using wildcards. In addtion to the SAP standard search, we would like to do some fuzzy text searching for names which are similar to the entered name.
Currently we're using the UTL_MATCH.EDIT_DISTANCE function to search for similar names. The only disadvantage is that it is not possible to use some wildcard patterns.
Is there any possiblity to use wildcards in combination with the UTL_MATCH.EDIT_DISTANCE function or are there different(or even better) approaches to do that?
Let's say, there are the following names in the database:
PATRICK NOR
ORVILLE ALEX
OWEN TRISTAN
OKEN TRIST
The query could look like OKEN*IST* and both OWEN TRISTAN and OKEN TRISTAN should be returned. OKEN would be a 100% match and OWEN less.
My current test-query looks like:
SELECT gp.partner, gp.bu_sort1, UTL_MATCH.edit_distance(gp.bu_sort1, ?) as edit_distance,
FROM but000 gp
WHERE UTL_MATCH.edit_distance(gp.bu_sort1, ?) < 4
This query works fine except if wildcards * are used within the search string (which is quite common).
Beware of the implications of your approach in terms of performances. Even if it "functionally" worked, with UTL_MATCH you can only filter the results obtained by an internal table scan.
What you likely need is an index on such data.
Head to Oracle Text, the text indexing capabilities of Oracle. Bear in mind that they require some effort to be put at work.
You might juggle with the fuzzy operator, but handle with care. Most oracle text features are language dependent (they take into account the English dictionary, German, etc..).
For instance
-- create and populate the table
create table xxx_names (name varchar2(100));
insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');
--create the domain index
create index xxx_names_ctx on xxx_names(name) indextype is ctxsys.context;
This query would return results that you'd probably like (input is the string "TRST")
select
SCORE(1), name
from
xxx_names n
where
CONTAINS(n.name, 'definescore(fuzzy(TRST, 1, 6, weight),relevance)', 1) > 0
;
SCORE(1) NAME
---------- --------------------
1 OWEN TRISTAN
22 OKEN TRIST
But with the input string "IST" it would likely return nothing (in my case this is what it does).
Also note that in general, inputs of less than 3 characters are considered non-matching by default.
You'll possibly get a more "predictable" outcome if you take off the "fuzzy" requirement and stick to finding rows that just "contains" the exact sequence you passed in.
In this case try using a ctxcat index, which, by the way supports some wildcards (warning: supports multi columns, but a column cannot exceed 30 chars in size!)
-- create and populate the table
--max length is 30 chars, otherwise the catsearch index can't be created
create table xxx_names (name varchar2(30));
insert into xxx_names(name) values('PATRICK NOR');
insert into xxx_names(name) values('ORVILLE ALEX');
insert into xxx_names(name) values('OWEN TRISTAN');
insert into xxx_names(name) values('OKEN TRIST');
insert into xxx_names(name) values('OKENOR SAD');
insert into xxx_names(name) values('OKENEAR TRUST');
begin
ctx_ddl.create_index_set('xxx_names_set');
ctx_ddl.add_index('xxx_names_set', 'name');
end;
/
drop index xxx_names_cat;
CREATE INDEX xxx_names_cat ON xxx_names(name) INDEXTYPE IS CTXSYS.CTXCAT
PARAMETERS ('index set xxx_names_set');
The latter, with this query would work nicely (input is "*TRIST*")
select
UTL_MATCH.edit_distance(name, 'TRIST') dist,
name
from
xxx_names
where
catsearch(name, '*TRIST*', 'order by name desc') > 0
;
DIST NAME
---------- --------------------
7 OWEN TRISTAN
5 OKEN TRIST
But with the input "*O*TRIST*" wouldn't return anything (for some reasons).
Bottom line: text indexes are probably the only way to go (for performance) but you have to fiddle quite a bit to understand all the intricacies.
References:
fuzzy search: Oracle Text CONTAINS Query Operators
catsearch : Oracle Text SQL Statements and Operators
Assuming "wildcard" means an asterisk, you want a name that matches all specified letters to rank highest, with more specified letters matching better than less, otherwise rank by edit distance similarity.
using the placeholder ? for your search term, try this:
select *
from mytable
order by case
when name like '%' || replace(?, '*', '%') || '%' then 0 - length(replace(?, '*', ''))
else 100 - UTL_MATCH.edit_distance_similarity(?, name) end
fetch first 10 rows
FYI all "like" matches have a negative number for their ordering with magnitude the number of letters specified. All like misses have a non-negative ordering number with magnitude of the percentage difference. In all cases, a lower number is a better match.
I'm a user of Oracle BI (v. 11.1.1.7.141014). I have a text column "description" and would like to create a new table with the word count for all words in that column. So for instance:
Source:
Description
___________
This is a test
Just a test
Result:
Word Count
_____________
a 2
test 2
is 1
just 1
this 1
Would it be possible? I have a user account, (no administration features), but I can work on reports (tables, pivot tables, etc.), data structures, custom SQL queries (limited to reports and data structures) and so on...
Thanks in advance
Defining "word" as any sequence of one or more consecutive English letters (upper or lower case), and assuming that "this" and "This" are the same, here is one possible solution. The first line of the code ends in "... from a)," substitute your table name in place of "a" (for my own testing purposes, I created a table with your input data and I called it a).
with b (d, ct) as (select Description, regexp_count(Description, '[a-zA-Z]+') from a),
h (pos) as (select level from dual connect by level <= 100),
prep (word) as (select lower(regexp_substr(d, '[a-zA-Z]+', 1, pos)) from b, h where pos <= ct)
select word, count(word) as word_count
from prep
group by word
order by word_count desc, word
/
The solution needs to know beforehand the maximum number of words per input string; I used 100, that can be increased (in the definition of h in the second line of code).
I'm new to database administration, but I need to create a database view, while the db admin requires it to run in 5 mins or less. My database is PostgreSQL 9.1.1 on RedHat4.4 linux 64-bit. I'm unsure about the hardware specifications. One of the tables is 40million rows. From the table, I have a column of directory paths from which I must group by about 20 string patterns and count its occurrences. The string pattern requires infix search, as it could be somewhere in the middle or end of the path. The string pattern also has a priority, as in when %str1% then 'str1', when %str2% then 'str2', and both str1, str2, str3, etc can occur on the same path, i.e.
path
/usr/myblock/str1/str2
/usr/myblock/something/str2
/usr/myblock/str1/something/str3
What I did so far was to build a table out of CASE statements then join it back to the original table by LIKE, then SELECT id, pattern, count(pattern). The query runtime was terrible, taking 5mins to retrieve from 5.5K rows. My query looks like this:
WITH a AS (
SELECT CASE
WHEN path ~ '^/usr/myblock/(.*)str1(.*)' THEN 'str1'
WHEN path ~ '^/usr/myblock/(.*)str2$' THEN 'str2'
WHEN path ~ '^/usr/myblock/(.*)str3$' THEN 'str3'
.... --multiple other case conditions
WHEN path ~ '^/usr/myblock/' THEN 'others'
ELSE 'n/a'
END as flow
FROM mega_t WHERE left(path,13)='/usr/myblock/' limit 5)
SELECT id, a.flow, count(*) AS flow_count FROM a
JOIN mega_t ON path LIKE '%' || a.flow || '%'
WHERE (some_conditions) AND to_timestamp(test_runs.created_at::double precision)
> ('now'::text::date - '1 mon'::interval) --collect last 1 month's results only
GROUP BY id, a.flow;
My expected output for that simple case would be:
id | flow | flow_count
1 | str1 | 2
2 | str2 | 1
What is a better way to search for substrings like this and count occurrences? I can't use ts_stat, nor 'SELECT count(path) WHERE path LIKE %str1%' because of the if-else priority it needs. I read about creating trigram indexes, but I think that is overkill for my patterns. I hope this question is clear and useful. Another thing I should add is that the 40million rows table is updated frequently every few seconds or minutes while the view will be accessed every eight hours daily.
I have a MAH_KERESES_MV table with 3 columns OBJEKTUM_NEV, KERESES_SZOVEG_1, KERESES_SZOVEG_2. I create the following multi column Oracle Text index:
exec ctx_ddl.create_preference( 'MAH_SEARCH', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute('MAH_SEARCH', 'COLUMNS', 'OBJEKTUM_NEV, KERESES_SZOVEG_1, KERESES_SZOVEG_2');
create index MAX_KERES_CTX on MAH_KERESES_MV(OBJEKTUM_NEV)
indextype is ctxsys.context
parameters ('DATASTORE MAH_SEARCH');
But the query does not return any rows, although if I formulate the query with the like operator, then I get the results as expected:
SELECT id, OBJEKTUM_NEV
FROM MAH_KERESES_MV
WHERE CONTAINS(OBJEKTUM_NEV, 'C')>0;
Can some body please help? TIA,
Tamas
Just in case any body might be interested later on, the solution was that the above CONTAINS clause filters for the C character as a stand alone entity (i.e. word). The correct where clause would have been:
WHERE CONTAINS(OBJEKTUM_NEV, 'C%')>0;