Oracle Text Search doesn't work on some words - oracle

I am using Oracle' Text Search for my project. I created a ctxsys.context index on my column and inserted one entry "Would you like some wine???". I executed the query
select guid, text, score(10) from triplet where contains (text, 'Would', 10) > 0
it gave me no results. Querying 'you' and 'some' also return zero results. Only 'like' and 'wine' matches the record. Does Oracle consider you, would, some as stop words?? How can I let Oracle match these words? Thank you.

so,
i found that the query's output is perfect according to the stop word lists that is in the oracle.
those words can be found in the ctxsys package, and you could query for the stoplist and the stop words using
SELECT * FROM CTX_STOPLISTS;
SELECT * FROM ctx_stopwords;
and yes, the oracle consider 'you', 'would' in your query as stop words.
The following lists are the default stop words.
a did in only then where
all do into onto there whether
almost does is or therefore which
also either it our these while
although for its ours they who
an from just s this whose
and had ll shall those why
any has me she though will
are have might should through with
as having Mr since thus would
at he Mrs so to yet
be her Ms some too you
because here my still until your
been hers no such ve yours
both him non t very
but his nor than was
by how not that we
can however of the were
could i on their what
d if one them when
if you need to remove some specified words (or add stop words),
(you need **GRANT EXECUTE ON CTXSYS.CTX_DDL to you **)
then, you've to execute a procedure,
example:
begin
ctx_ddl.remove_stopword('mystop_list','some');
ctx_ddl.remove_stopword('mystop_list','you');
end;
refer link for various functions in ctx_ddl package
you could get full description about the created ctx index by querying,
select ctx_report.describe_index('yourindex_name') from dual;

Look at the docs
In paragraph "4.1.5 Querying Stopwords" you can get some useful info :)

Related

oracle - can I use contain and near with a clob? Need to speed up query

We have a query that takes 48 minutes to run a search on a clob. The query is written as if it is not a clob column and uses contains and near. This search for 3 words within a certain word distance from each other is important. I'm needing to speed this up and want to do an index on the clob, but don't know if that would work and don't fully understand how to do it. I found this from Tom Burleson
http://www.dba-oracle.com/t_clob_search_query.htm OR https://asktom.oracle.com/pls/apex/asktom.search?tag=oracle-text-contains-search-with-near-is-very-slow
, but can't figure out how to do it with contains and near to enable the search of 3 words withing a certain distance from each other.
current script:
SELECT clob_field
FROM clob_table
WHERE contains(clob_field,'NEAR (((QUICK),(FOX),(LAZY)),5)') > 0;
Want to use something like this if it will act like indexing:
SELECT clob_field
FROM clob_table
WHERE contains(dbms_lob.substr(clob_field,'near(((QUICK),(FOX),(LAZY)),5)')) > 0;
If not, I need to do indexing, but I don't quite understand how to use CTXCAT and CONTEXT (https://docs.oracle.com/cd/A91202_01/901_doc/text.901/a90122/ind4.htm). I also don't like what I read here that says that if one uses CTXCAT for indexing a clob you have to use CONTEXT, or something like that. It can't affect the other queries that are done on this field.
Thanks in advance!
Contains won't work unless it is globally indexed, so I had to index the field and then could get the original query working.

Indexing for the columns in ORACLE

I have below query, because of the huge data in the MATTER Table, it is taking huge time for LIKE statement to execute, so I was thinking of using the CONTEXT Index and using CONTAIN.
Shall I do indexing only on Matter_title or some other column as well,. Based on the below select query
Inputs highly appreciated
SELECT DISTINCT dm.MATTER_SEQ
FROM MATTER dm
,MATTER_TYPE dmt
,MATTER_SUBTYPE dms
,STATUS ds
,FILING df
WHERE dm.MATTER_TYPE_SEQ=dmt.MATTER_TYPE_SEQ
AND dm.MATTER_SUBTYPE_SEQ=dms.MATTER_SUBTYPE_SEQ
AND dm.STATUS_CODE NOT IN ('abc','jkl','xyz')
AND dm.STATUS_CODE = DS.STATUS_CODE
AND dm.IS_EXTERNAL='1'
AND dm.IS_DELETED='0'
AND dm.MATTER_SEQ = df.MATTER_SEQ
AND trunc(dm.CREATED_DATE) between '01-NOV-95' AND '02-OCT-18'
AND upper(dm.MATTER_TITLE) like(upper (q'{%jdasuidhajsndjahs%}'))
It sounds like you're already aware that LIKE with a leading wildcard ('%ABC') is notoriously inefficient since it typically can't use indexes and does a full table scan.
If the other optimizing suggestions don't help much, you probably would see better performance with a Context index. Be sure to set the SUBSTRING_INDEX preference so it'll specifically prepare the index for infix searches like yours. See this Ask Tom for more details. (If you will also have wildcards in the middle of strings ('ABC%DEF'), you might also want to set the PREFIX options.)
begin
ctx_ddl.create_preference('SUBSTRING_PREF','BASIC_WORDLIST');
ctx_ddl.set_attribute('SUBSTRING_PREF','SUBSTRING_INDEX','TRUE');
end;
create index matter_title_idx on MATTER(MATTER_TITLE)
indextype is ctxsys.context
parameters ('wordlist SUBSTRING_PREF');
Also note that Context indexes are case-insensitive by default, so you don't need to do UPPER(). I haven't tried using q'' literals with contains, so I'm not sure how this'll work.
AND CONTAINS(dm.MATTER_TITLE, q'{%jdasuidhajsndjahs%}') > 0
Try creating function Indexes upper(dm.MATTER_TITLE) and second trunc(dm.CREATED_DATE).
Also I am considering that the columns in the Join conditions already have indexes. If not have them indexed.

Reversing a string using an index in Oracle

I have a table that has IDs and Strings and I need to be able to properly index for searching for the end of the strings. How we are currently handling it is copying the information into another table and reversing each string and indexing it normally. What I would like to do is use some kind of index that allows to search in reverse.
Example
Data:
F7421kFSD1234
d7421kFSD1235
F7541kFSD1236
d7421kFSD1234
F7421kFSD1235
b8765kFSD1235
d7421kFSD1234
The way our users usually input thier search is something along the lines of...
*1234
By reversing the strings (and the search string: 4321*) I could find what I am looking for without completely scanning the whole table. My question is: Is making a second table the best way of doing this?
Is there a way to reverse index?
Ive tried an index like this...
create index REVERSE_STR_IDX on TABLE(STRING) REVERSE;
but oracle doesn't seem to be using it according to the Explain Plan.
Thanks in advance for the help.
Update:
I did have a problem with unicode characters not being reversed correctly. The solution to this was casting them.
Example:
select REVERSE(cast(string AS varchar2(2000)))
from tbl
where id = 1
There is the myth that a reverse key index can be used for that, however, I've never seen that in action.
I would try a "manual" function based index.
CREATE INDEX REVERSE_STR_IDX on TBL(reverse(string));
SELECT *
FROM TBL
WHERE reverse(string) LIKE '4321%';

Full-text search in Postgres or CouchDB?

I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
If I understand your problem right, probably all you need is already built in the CouchDB.
To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
Here is a view function to make this:
function(doc) {
for (var name in doc.places) {
emit(name, doc._id);
}
}
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.
1) text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
2) Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE clause has to match the the index expression.
3) Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.
4) Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
5) Optimize disc representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.
6) Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7) Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.

Where can I find a list of 'Stop' words for Oracle fulltext search?

I've a client testing the full text (example below) search on a new Oracle UCM site.
The random text string they chose to test was 'test only'. Which failed; from my testing it seems 'only' is a reserved word, as it is never returned from a full text search (it is returned from metadata searches).
I've spent the morning searching oracle.com and found this which seems pretty comprehensive, yet does not have 'only'.
So my question is thus, is 'only' a reserved word. Where can I find a complete list of reserved words for Oracle full text search (10g)?
Full text search string example;
(<ftx>test only</ftx>)
Update.
I have done some more testing. Seems it ignores words that indicate places or times;
only, some, until, when, while, where, there, here, near, that, who, about, this, them.
Can anyone confirm this? I can't find this in on Oracle anywhere.
Update 2. Post Answer
I should have been looking for 'stop' words not 'reserved'.
Updated the question title and tags to reflect.
Additional answers:
See default Oracle (11g) stopword lists here: http://download.oracle.com/docs/cd/B28359_01/text.111/b28304/astopsup.htm#i634475
The following query allows to list stopwords from all stoplists (to be run on CTXSYS schema):
SELECT *
FROM DR$STOPWORD
LEFT JOIN DR$STOPLIST ON DR$STOPWORD.SPW_SPL_ID = DR$STOPLIST.SPL_ID
In the results, the SPL_* fields come from the DR$STOPLIST system table, and the SPW_* fields from the DR$STOPWORD table
From a user schema, user defined stoplists and stopwords can be retrieved through
SELECT * FROM CTX_USER_STOPLISTS;
SELECT * FROM CTX_USER_STOPWORDS;
I bet the system is trying to automatically ignore frequently occurring words. That would explain why you cannot find 'only' but 'onnly' can be found. Can you search for 'a', 'an', ...
The list you gave of words that do not work looks like some very common words that frequently are not the primary words in a sentence. Given this, they are not likely to be words you are searching for on a full text search.
What are the odds that you are looking for an article that includes the word 'that' and the inclusion of that word is the only fact you have on the article?
I think I found your list.... Ironically from the wiki page of the last company I started..: http://www.sugarcrm.com/wiki/index.php?title=Overview_of_Full_Text_Stop_Words#Default_Stop_Words_.28for_English.29
2.10.3 Modifying the Default Stoplist The default stoplist is always named CTXSYS.DEFAULT_STOPLIST. You can use the following procedures to modify this stoplist:
• CTX_DDL.ADD_STOPWORD
• CTX_DDL.REMOVE_STOPWORD
• CTX_DDL.ADD_STOPTHEME
• CTX_DDL.ADD_STOPCLASS
When you modify CTXSYS.DEFAULT_STOPLIST with the CTX_DDL package, you must re-create your index for the changes to take effect.
Default stopword list:
a he out up
be more their at
had one will from
it than and is
only when corp not
she also in says
was by ms to
about her over
because most there
has or with
its that are
of which could
some an inc
we can mz
after his s
been mr they
have other would
last the as
on who for
such any into
were co no
all if so
but mrs this
Update - A nice whitepaper from Oracle that includes how full text searching works can be downloaded from: http://www.oracle.com/technology/products/text/pdf/text_techwp.pdf. They mention the stopwords and the fact that there is a default list, but don't mention the words themselves.
Keywords reserved:
http://www.toadworld.com/KNOWLEDGE/KnowledgeXpertforOracle/tabid/648/TopicID/SQL15/Default.aspx
click on "Keyword reserved words" on left.
"Only" is in the list.
I am not sure what is going on in your case, but I cannot imaging that Oracle will not support the word only in full text search. In many full text cases, you have to search for one word. Could that be the problem you are encountering?

Resources