Oracle text search default Stop words for non english locales - oracle

Oracle documentation lists following default stop words.
http://docs.oracle.com/cd/B28359_01/text.111/b28304/astopsup.htm#CCREF1400
This includes stop words from all languages. But when i query my database to view list of stop words, it only shows English words.
SELECT * FROM ctx_stopwords;
It doesn't list non english stop words. I may be missing something here. I am looking for a query which returns all the default oracle stop words in all languages. Is there a way to achieve this?
Thanks!

Look for scripts in $ORACLE_HOME/ctx/admin/defaults/
drdefus.sql - american
drdefd.sql - german

Related

Oracle Contains Function Returning False BLOB Positives

I'm using the Contains function to search for strings in BLOB fields containing PDFs or Word documents. Recently I did the following search:
SELECT doc_id
FROM table_of_documents
WHERE CONTAINS (BLOB_FIELD, 'SDS.IF.00005') > 0
Most of the records returned were correct, but a few had PDFs in them that did not have "SDS.IF.00005" in them but did have "SDS.EL.00005" in them.
When I say the PDFs did not have the search term, I mean I opened them in Adobe reader and searched them using the search function and my own eyeballs, and also people extremely familiar with the documents insist that the term is not there and should not be there.
I tried treating the dots as escape characters: SDS\\.IF\\.00005 and {SDS.IF.00005}. However, I am still getting the same results.
I also tried setting CONTAINS (BLOB_FIELD, 'SDS.IF.00005') = 100, but I'm still getting documents with SDS.EL.00005 in them and not SDS.IF.00005.
Do the dots in the search term mean something like SDS.%.00005 to Oracle? Or should I be researching how to find deep hidden text in Adobe documents that's not visible to the naked eye or to the Adobe text search function?
Thanks for your help.
As far as I know, CONTAINS is a Oracle Text function that performs full text search, so Oracle is tokenizing your string, probably according to its BASIC_LEXER. This lexer uses . as a word separator. So Oracle understands your query as "return anything that matches at least one of the words 'SDS', 'IF' or '00005'". As your PDF will probably have been indexed using that same lexer, from Oracle Text point of view your PDF contains the words 'SDS', 'EL' and '00005', so it matches 2 of 3 words and so Oracle returns that row.
Actually, 'IF' is included in Oracle Text default stopword list (words that are ignored because they are so common that they mostly introduce "noise"); so your query actually is "return anything that matches at least one of 'SDS' or '00005'". Therefore I am not surprised that a PDF that contains the literal text "SDS.EL.00005" will give you CONTAINS(BLOB_FIELD, 'SDS.IF.00005') = 100 (a "perfect" match) as you wrote.
If you want to search for a verbatim string, I think you should rather not use Oracle Text and just implement a solution using plain old DBMS_LOB.INSTR. If that is not viable, then you will have to find a way to make Oracle Text index those strings without tokenizing them.

Web search algorithm with multiple words

I want to use search from database on my website, so I think about effective algorithm to use.
For example if I try to search "Hello my name is xxx" I want to see results:
Hello my name is John
Hello my name is Peter
Hello mr. xxx
His name is Peter
He is here
So I want to search all data from database with part of this text and sort result by number of matching words.
I made algorithm but I am pretty scared that it's so complicated and slow:
I split search text into words and use SQL select with multiple like or commands. Then I save this results into list. Then I count up numbers of matched words in each result and sort it by this count.
Problem is that when I will try to search long text.
Should I use better algorithm or should I learn somethink about thinks like Sphinx
For the first two results, a simple regex search should be able to retrieve results like that.
For the later ones, you might consider using an existing searching library thing, like Google Search Appliance, which can be used to search database information.

Sphinx reverse search - when new item is added, execute searches on existing stored keywords

I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?
You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search
http://en.wikipedia.org/wiki/Prospective_search

Searching a column with comma separated values in Oracle

I am using Oracle 11g and Oracle Text for a web search engine.
I have now created & text-indexed a CLOB column Keywords which contains space-separated words. This allowed me to extend the search, as Oracle Text will return the rows that have one or more keywords stored in that column. The contents of the column are hidden from the user, and is used only to "extend" the search. This is working as intended.
But now I need to support multiple words or even complete sentences. With the current configuration, Oracle Text will search only for individual keyword. How do I need to store the phrases and configure Oracle Text so that it will search for whole phrases (exact match is preferred, but fuzzy matching is fine too)?
Column content example of two rows(semi-colon seperated values):
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
I found a similar question: Searching a column with comma separated values, except that I need a solution for Oracle 11g with it's freetext search functionality.
Possible solutions:
1st solution: I was thinking of redesigning the DB as follows. I'd make a new table Keywords(pkID NUMBER, nonUniqueID NUMBER, singlePhrase VARCHAR2(100 BYTE)). And I'd change the previous column Keyword to KeywordNonUniqueID, which would hold the ID (instead of a list of values). At search-time I'd INNER JOIN with the new Keyword table. The problem with this solution is that I'll get multiple rows that contains the same data except the phrase. I assume this will destroy the ranking?
2nd solution: Is it possible to store phrases as a XML in the original Keyword column, and somehow tell Oracle Text to search within the XML?
3rd solution: ?
Note that, generally, there won't be a lot of phrases (less than 100), nor will they be long (a single phrase will have up to 5 words).
Also note that I am currently using CONTAINS, and a few of its operators, for my full-text searching needs.
EDIT: This https://forums.oracle.com/forums/thread.jspa?messageID=10791361 discussion that almost solves my problem, but it also matches the individual words, not the whole phrase (exact matching).
Oracle supports searching of phrases by default.
In docs we can see this
4.1.4.1 CONTAINS Phrase Queries
If multiple words are contained in a query expression, separated only
by blank spaces (no operators), the string of words is considered a
phrase and Oracle Text searches for the entire string during a query.
For example, to find all documents that contain the phrase
international law, enter your query with the phrase international law.
Did I answer your question or misunderstand you?
P.S. It seems to me that the solution is convert
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
to
"hello, hello aa is there anybody out there? aa nope aa" "just the aa basic
facts aa"
and search with CONTAINS for the phrase "is there anybody out there? aa"

How to search for the word 'IN' in an XML CLOB using Oracle Full Text Search?

I am trying to track down a bug in some existing code and I'm new to Oracle full text searches...
I'm having trouble with a query that returns rows that contain the word 'IN' from within an xml document.
For example, lets say I have this snippet stored in a CLOB in my Oracle 10 database:
<a>
<b>IN</b>
</a>
I want to find all entries that have b with a value of 'IN' so I create a where clause that looks like this:
where contains(column, '(IN) INPATH (//b)') > 0;
But this returns no results. I have tried searching for other terms such as 'AA' and 'BS' and it works fine.
I have searched the Google and the Oracle documentation for reserved words but I don't see 'IN' listed there. Obviously, it is a reserved word in SQL but I don't see a reference to that in the full test search or XPATH docs.
I have also tried escaping the term by surrounding it with {}, "", and any crazy thing I could think of but that doesn't help.
Any advice would be greatly appreciated!
Your assumption is probably right. In full text search the term is 'stop word'. If you don't define your own stoplist then a default is used: CTXSYS.DEFAULT_STOPLIST. See this article for the default English stoplist. The word 'in' is in it.
You can either remove stop words or create your own stoplist using the CTX_DDL package, specifically CTX_DDL.REMOVE_STOPWORD or CTX_DDL.CREATE_STOPLIST.

Resources