I'm using the Contains function to search for strings in BLOB fields containing PDFs or Word documents. Recently I did the following search:
SELECT doc_id
FROM table_of_documents
WHERE CONTAINS (BLOB_FIELD, 'SDS.IF.00005') > 0
Most of the records returned were correct, but a few had PDFs in them that did not have "SDS.IF.00005" in them but did have "SDS.EL.00005" in them.
When I say the PDFs did not have the search term, I mean I opened them in Adobe reader and searched them using the search function and my own eyeballs, and also people extremely familiar with the documents insist that the term is not there and should not be there.
I tried treating the dots as escape characters: SDS\\.IF\\.00005 and {SDS.IF.00005}. However, I am still getting the same results.
I also tried setting CONTAINS (BLOB_FIELD, 'SDS.IF.00005') = 100, but I'm still getting documents with SDS.EL.00005 in them and not SDS.IF.00005.
Do the dots in the search term mean something like SDS.%.00005 to Oracle? Or should I be researching how to find deep hidden text in Adobe documents that's not visible to the naked eye or to the Adobe text search function?
Thanks for your help.
As far as I know, CONTAINS is a Oracle Text function that performs full text search, so Oracle is tokenizing your string, probably according to its BASIC_LEXER. This lexer uses . as a word separator. So Oracle understands your query as "return anything that matches at least one of the words 'SDS', 'IF' or '00005'". As your PDF will probably have been indexed using that same lexer, from Oracle Text point of view your PDF contains the words 'SDS', 'EL' and '00005', so it matches 2 of 3 words and so Oracle returns that row.
Actually, 'IF' is included in Oracle Text default stopword list (words that are ignored because they are so common that they mostly introduce "noise"); so your query actually is "return anything that matches at least one of 'SDS' or '00005'". Therefore I am not surprised that a PDF that contains the literal text "SDS.EL.00005" will give you CONTAINS(BLOB_FIELD, 'SDS.IF.00005') = 100 (a "perfect" match) as you wrote.
If you want to search for a verbatim string, I think you should rather not use Oracle Text and just implement a solution using plain old DBMS_LOB.INSTR. If that is not viable, then you will have to find a way to make Oracle Text index those strings without tokenizing them.
Related
We use ElasticSearch in a reverse manner from what I usually see. We store lots of small documents, usually 1 or 2 words, for example, Job Titles like "software engineering", "car mechanics", "architect", etc.
Then we query with a longer string, for example a 1000 word Job Spec. This way we get all Job Titles present in the text of the Job Spec.
It works well. But I was wondering whether I could get ElasticSearch to highlight the matching Job Titles in the Job Spec, i.e. highlight the results in the query. I have tried the highlight keyword, but it doesn't highlight the query text, it highlights the results. I'm not sure how to get the query to be returned in the ElasticSearch response, let alone whether it can be highlighted.
You might wonder why I need ElasticSearch to highlight the query, can't I just pick out all the results from the text and highlight them myself? Yes I can, but there's various things to think about that makes it hard such as stemming and stopword removal. for example "jquery" is stemmed to "jqueri" when doing the tokenising in ElasticSearch, so it's found as a result, but if I want to highlight it myself, I have to unstem it so it matches the original text. Elasticsearch also removes symbols, so terms & conditions would become terms conditions which is problematic if I want to highlight it manually as I have to add back the "&" symbol. There's a hundred other problem cases, hence the question about whether ElasticSearch can do it for me.
I'm quite sure highlighting the query string isn't possible - only highlighting parts of documents in an index.
What you might try is indexing the query string itself in it's own index and then using the results of the first query as the query terms for a second query against the query string (in the second index). You could then have highlighting on the query string. You'll have to make an extra request to ES each time, but I think it'll get what you want.
I am using Oracle 11g and Oracle Text for a web search engine.
I have now created & text-indexed a CLOB column Keywords which contains space-separated words. This allowed me to extend the search, as Oracle Text will return the rows that have one or more keywords stored in that column. The contents of the column are hidden from the user, and is used only to "extend" the search. This is working as intended.
But now I need to support multiple words or even complete sentences. With the current configuration, Oracle Text will search only for individual keyword. How do I need to store the phrases and configure Oracle Text so that it will search for whole phrases (exact match is preferred, but fuzzy matching is fine too)?
Column content example of two rows(semi-colon seperated values):
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
I found a similar question: Searching a column with comma separated values, except that I need a solution for Oracle 11g with it's freetext search functionality.
Possible solutions:
1st solution: I was thinking of redesigning the DB as follows. I'd make a new table Keywords(pkID NUMBER, nonUniqueID NUMBER, singlePhrase VARCHAR2(100 BYTE)). And I'd change the previous column Keyword to KeywordNonUniqueID, which would hold the ID (instead of a list of values). At search-time I'd INNER JOIN with the new Keyword table. The problem with this solution is that I'll get multiple rows that contains the same data except the phrase. I assume this will destroy the ranking?
2nd solution: Is it possible to store phrases as a XML in the original Keyword column, and somehow tell Oracle Text to search within the XML?
3rd solution: ?
Note that, generally, there won't be a lot of phrases (less than 100), nor will they be long (a single phrase will have up to 5 words).
Also note that I am currently using CONTAINS, and a few of its operators, for my full-text searching needs.
EDIT: This https://forums.oracle.com/forums/thread.jspa?messageID=10791361 discussion that almost solves my problem, but it also matches the individual words, not the whole phrase (exact matching).
Oracle supports searching of phrases by default.
In docs we can see this
4.1.4.1 CONTAINS Phrase Queries
If multiple words are contained in a query expression, separated only
by blank spaces (no operators), the string of words is considered a
phrase and Oracle Text searches for the entire string during a query.
For example, to find all documents that contain the phrase
international law, enter your query with the phrase international law.
Did I answer your question or misunderstand you?
P.S. It seems to me that the solution is convert
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
to
"hello, hello aa is there anybody out there? aa nope aa" "just the aa basic
facts aa"
and search with CONTAINS for the phrase "is there anybody out there? aa"
I am trying to track down a bug in some existing code and I'm new to Oracle full text searches...
I'm having trouble with a query that returns rows that contain the word 'IN' from within an xml document.
For example, lets say I have this snippet stored in a CLOB in my Oracle 10 database:
<a>
<b>IN</b>
</a>
I want to find all entries that have b with a value of 'IN' so I create a where clause that looks like this:
where contains(column, '(IN) INPATH (//b)') > 0;
But this returns no results. I have tried searching for other terms such as 'AA' and 'BS' and it works fine.
I have searched the Google and the Oracle documentation for reserved words but I don't see 'IN' listed there. Obviously, it is a reserved word in SQL but I don't see a reference to that in the full test search or XPATH docs.
I have also tried escaping the term by surrounding it with {}, "", and any crazy thing I could think of but that doesn't help.
Any advice would be greatly appreciated!
Your assumption is probably right. In full text search the term is 'stop word'. If you don't define your own stoplist then a default is used: CTXSYS.DEFAULT_STOPLIST. See this article for the default English stoplist. The word 'in' is in it.
You can either remove stop words or create your own stoplist using the CTX_DDL package, specifically CTX_DDL.REMOVE_STOPWORD or CTX_DDL.CREATE_STOPLIST.
For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
I have asked same question at Different indexing and search strategies on same field without doubling index size?. However folks at office are not happy about indexing the same data as part of two fields. (we currently have around 20 text fields in lucene document). Is there any way to support both the cases I listed above using TokenFilters?
Say, for a StopFilter, make changes so that it emits both the input token and ? (for ignored word) with same position increments. Similarly for StemFilter, it emits both the input token and stemmed token with same position increments. Basically input and output tokens (even ignored ones) have same positions.
Is it safe to go ahead with this approach? Has anyone else faced the requirements listed here? Are there any Filters readily available which do something similar to what I mentioned in my approach?
Thanks
I don't understand what you mean by "input and output tokens." Are you storing the data twice - once as stemmed and once non-stemmed?
If you aren't storing it twice, I don't think your method will work. Suppose the stored word is jumping and they search for jumped. Your query parser can emit jump and jumped but it still won't match jumping unless you have a value stored as jump.
And if you're going to store the value once as stemmed and once as non-stemmed, then why not just store it in two fields? Then you won't have to deal with weird tokenizer changes.
I need to query a table in an SQLite database to return all the rows in a table that match a given set of words.
To be more precise: I have a database with ~80,000 records in it. One of the fields is a text field with around 100-200 words per record. What I want to be able to do is take a list of 200 single word keywords {"apple", "orange", "pear", ... } and retrieve a set of all the records in the table that contain at least one of the keyword terms in the description column.
The immediately obvious way to do this is with something like this:
SELECT stuff FROM table
WHERE (description LIKE '% apple %') or (description LIKE '% orange %') or ...
If I have 200 terms, I end up with a big and nasty looking SQL statement that seems to me to be clumsy, smacks of bad practice, and not surprisingly takes a long time to process - more than a second per 1000 records.
This answer Better performance for SQLite Select Statement seemed close to what I need, and as a result I created an index, but according to http://www.sqlite.org/optoverview.html sqlite doesn't use any optimisations if the LIKE operator is used with a beginning % wildcard.
Not being an SQL expert, I am assuming I'm doing this the dumb way. I was wondering if someone with more experience could suggest a more sensible and perhaps more efficient way of doing this?
Alternatively, is there a better approach I could use to the problem?
Using the SQLite fulltext search would be faster than a LIKE '%...%' query. I don't think there's any database that can use an index for a query beginning with %, as if the database doesn't know what the query starts with then it can't use the index to look it up.
An alternative approach is putting the keywords in a separate table instead, and making an intermediate table that has the information about which row in your main table has which keywords. If you indexed all the relevant columns that way, it could be queried very quickly.
Sounds like you might want to have a look at Full Text Search. It was contributed to SQLite by someone from google. The description:
allows the user to efficiently query
the database for all rows that contain
one or more words (hereafter
"tokens"), even if the table contains
many large documents.
This is the same problem as full-text search, right? In which case, you need some help from the DB to construct indexes into these fields if you want to do this efficiently. A quick search for SQLite full text search yields this page.
The solution you correctly identify as clumsy is probably going to do up to 200 regular expression matches per document in the worst case (i.e. when a document doesn't match), where each match has to traverse the entire field. Using the index approach will mean that your search speed will be independent of the size of each document.