Searching a column with comma separated values in Oracle - oracle

I am using Oracle 11g and Oracle Text for a web search engine.
I have now created & text-indexed a CLOB column Keywords which contains space-separated words. This allowed me to extend the search, as Oracle Text will return the rows that have one or more keywords stored in that column. The contents of the column are hidden from the user, and is used only to "extend" the search. This is working as intended.
But now I need to support multiple words or even complete sentences. With the current configuration, Oracle Text will search only for individual keyword. How do I need to store the phrases and configure Oracle Text so that it will search for whole phrases (exact match is preferred, but fuzzy matching is fine too)?
Column content example of two rows(semi-colon seperated values):
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
I found a similar question: Searching a column with comma separated values, except that I need a solution for Oracle 11g with it's freetext search functionality.
Possible solutions:
1st solution: I was thinking of redesigning the DB as follows. I'd make a new table Keywords(pkID NUMBER, nonUniqueID NUMBER, singlePhrase VARCHAR2(100 BYTE)). And I'd change the previous column Keyword to KeywordNonUniqueID, which would hold the ID (instead of a list of values). At search-time I'd INNER JOIN with the new Keyword table. The problem with this solution is that I'll get multiple rows that contains the same data except the phrase. I assume this will destroy the ranking?
2nd solution: Is it possible to store phrases as a XML in the original Keyword column, and somehow tell Oracle Text to search within the XML?
3rd solution: ?
Note that, generally, there won't be a lot of phrases (less than 100), nor will they be long (a single phrase will have up to 5 words).
Also note that I am currently using CONTAINS, and a few of its operators, for my full-text searching needs.
EDIT: This https://forums.oracle.com/forums/thread.jspa?messageID=10791361 discussion that almost solves my problem, but it also matches the individual words, not the whole phrase (exact matching).

Oracle supports searching of phrases by default.
In docs we can see this
4.1.4.1 CONTAINS Phrase Queries
If multiple words are contained in a query expression, separated only
by blank spaces (no operators), the string of words is considered a
phrase and Oracle Text searches for the entire string during a query.
For example, to find all documents that contain the phrase
international law, enter your query with the phrase international law.
Did I answer your question or misunderstand you?
P.S. It seems to me that the solution is convert
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
to
"hello, hello aa is there anybody out there? aa nope aa" "just the aa basic
facts aa"
and search with CONTAINS for the phrase "is there anybody out there? aa"

Related

Elasticsearch - Match long query text to short field

Standard match involves providing a small query term or phrase and matching it against a larger blob of text stored in a document. I want to do the reverse - my query will be a large blob of text, like a paragraph, and I want to rank the relevance of documents that contain a full name (i.e. John Smith). I want to supply a paragraph and determine which document's full name is most likely to be contained in that paragraph. What's the best way to do this?
Answering my own question: The Percolate API allows you to store queries and run documents against them: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html

Oracle Contains Function Returning False BLOB Positives

I'm using the Contains function to search for strings in BLOB fields containing PDFs or Word documents. Recently I did the following search:
SELECT doc_id
FROM table_of_documents
WHERE CONTAINS (BLOB_FIELD, 'SDS.IF.00005') > 0
Most of the records returned were correct, but a few had PDFs in them that did not have "SDS.IF.00005" in them but did have "SDS.EL.00005" in them.
When I say the PDFs did not have the search term, I mean I opened them in Adobe reader and searched them using the search function and my own eyeballs, and also people extremely familiar with the documents insist that the term is not there and should not be there.
I tried treating the dots as escape characters: SDS\\.IF\\.00005 and {SDS.IF.00005}. However, I am still getting the same results.
I also tried setting CONTAINS (BLOB_FIELD, 'SDS.IF.00005') = 100, but I'm still getting documents with SDS.EL.00005 in them and not SDS.IF.00005.
Do the dots in the search term mean something like SDS.%.00005 to Oracle? Or should I be researching how to find deep hidden text in Adobe documents that's not visible to the naked eye or to the Adobe text search function?
Thanks for your help.
As far as I know, CONTAINS is a Oracle Text function that performs full text search, so Oracle is tokenizing your string, probably according to its BASIC_LEXER. This lexer uses . as a word separator. So Oracle understands your query as "return anything that matches at least one of the words 'SDS', 'IF' or '00005'". As your PDF will probably have been indexed using that same lexer, from Oracle Text point of view your PDF contains the words 'SDS', 'EL' and '00005', so it matches 2 of 3 words and so Oracle returns that row.
Actually, 'IF' is included in Oracle Text default stopword list (words that are ignored because they are so common that they mostly introduce "noise"); so your query actually is "return anything that matches at least one of 'SDS' or '00005'". Therefore I am not surprised that a PDF that contains the literal text "SDS.EL.00005" will give you CONTAINS(BLOB_FIELD, 'SDS.IF.00005') = 100 (a "perfect" match) as you wrote.
If you want to search for a verbatim string, I think you should rather not use Oracle Text and just implement a solution using plain old DBMS_LOB.INSTR. If that is not viable, then you will have to find a way to make Oracle Text index those strings without tokenizing them.

Web search algorithm with multiple words

I want to use search from database on my website, so I think about effective algorithm to use.
For example if I try to search "Hello my name is xxx" I want to see results:
Hello my name is John
Hello my name is Peter
Hello mr. xxx
His name is Peter
He is here
So I want to search all data from database with part of this text and sort result by number of matching words.
I made algorithm but I am pretty scared that it's so complicated and slow:
I split search text into words and use SQL select with multiple like or commands. Then I save this results into list. Then I count up numbers of matched words in each result and sort it by this count.
Problem is that when I will try to search long text.
Should I use better algorithm or should I learn somethink about thinks like Sphinx
For the first two results, a simple regex search should be able to retrieve results like that.
For the later ones, you might consider using an existing searching library thing, like Google Search Appliance, which can be used to search database information.

How to search for the word 'IN' in an XML CLOB using Oracle Full Text Search?

I am trying to track down a bug in some existing code and I'm new to Oracle full text searches...
I'm having trouble with a query that returns rows that contain the word 'IN' from within an xml document.
For example, lets say I have this snippet stored in a CLOB in my Oracle 10 database:
<a>
<b>IN</b>
</a>
I want to find all entries that have b with a value of 'IN' so I create a where clause that looks like this:
where contains(column, '(IN) INPATH (//b)') > 0;
But this returns no results. I have tried searching for other terms such as 'AA' and 'BS' and it works fine.
I have searched the Google and the Oracle documentation for reserved words but I don't see 'IN' listed there. Obviously, it is a reserved word in SQL but I don't see a reference to that in the full test search or XPATH docs.
I have also tried escaping the term by surrounding it with {}, "", and any crazy thing I could think of but that doesn't help.
Any advice would be greatly appreciated!
Your assumption is probably right. In full text search the term is 'stop word'. If you don't define your own stoplist then a default is used: CTXSYS.DEFAULT_STOPLIST. See this article for the default English stoplist. The word 'in' is in it.
You can either remove stop words or create your own stoplist using the CTX_DDL package, specifically CTX_DDL.REMOVE_STOPWORD or CTX_DDL.CREATE_STOPLIST.

SQLite - how to return rows containing a text field that contains one or more strings?

I need to query a table in an SQLite database to return all the rows in a table that match a given set of words.
To be more precise: I have a database with ~80,000 records in it. One of the fields is a text field with around 100-200 words per record. What I want to be able to do is take a list of 200 single word keywords {"apple", "orange", "pear", ... } and retrieve a set of all the records in the table that contain at least one of the keyword terms in the description column.
The immediately obvious way to do this is with something like this:
SELECT stuff FROM table
WHERE (description LIKE '% apple %') or (description LIKE '% orange %') or ...
If I have 200 terms, I end up with a big and nasty looking SQL statement that seems to me to be clumsy, smacks of bad practice, and not surprisingly takes a long time to process - more than a second per 1000 records.
This answer Better performance for SQLite Select Statement seemed close to what I need, and as a result I created an index, but according to http://www.sqlite.org/optoverview.html sqlite doesn't use any optimisations if the LIKE operator is used with a beginning % wildcard.
Not being an SQL expert, I am assuming I'm doing this the dumb way. I was wondering if someone with more experience could suggest a more sensible and perhaps more efficient way of doing this?
Alternatively, is there a better approach I could use to the problem?
Using the SQLite fulltext search would be faster than a LIKE '%...%' query. I don't think there's any database that can use an index for a query beginning with %, as if the database doesn't know what the query starts with then it can't use the index to look it up.
An alternative approach is putting the keywords in a separate table instead, and making an intermediate table that has the information about which row in your main table has which keywords. If you indexed all the relevant columns that way, it could be queried very quickly.
Sounds like you might want to have a look at Full Text Search. It was contributed to SQLite by someone from google. The description:
allows the user to efficiently query
the database for all rows that contain
one or more words (hereafter
"tokens"), even if the table contains
many large documents.
This is the same problem as full-text search, right? In which case, you need some help from the DB to construct indexes into these fields if you want to do this efficiently. A quick search for SQLite full text search yields this page.
The solution you correctly identify as clumsy is probably going to do up to 200 regular expression matches per document in the worst case (i.e. when a document doesn't match), where each match has to traverse the entire field. Using the index approach will mean that your search speed will be independent of the size of each document.

Resources