Oracle string search performance issue - oracle

I have a simple search store procedure in Oracle 11GR2 in a table with over 1.6 million records. I am puzzled by the fact that if I want to search for a work inside a column, such as "%boston%", it would take 12 seconds. I have an index on the name collumn.
select description from travel_websites where name like "%boston%";
If I only search for a word start with Boston like "boston%", it only takes 0.15 seconds.
select description from travel_websites where name like "boston%";
I added an index hint and try to force optimizer to use my index on the name column, it did not help either.
select description /*+ index name_idx */ from travel_websites where name like "%boston%";
Any advises would be greatly appreciated.

You cannot use an index range scan for a predicate that has a leading wildcard (i.e. like '%boston%'). This makes sense if you think about how an index is stored on disk-- if you don't know what the first character of the string you are searching is, you can't traverse the index to look for index entries that match that string. You may be able to do a full scan of the index where you read every leaf block and search the name there to see if it contains the string you want. But that requires a full scan of the index plus you then have to visit the table for every ROWID you get from the index in order to fetch any columns that are not part of the index that you just full-scanned. Depending on the relative size of the table and the index and how selective the predicate is, the optimizer may easily decide that it is quicker to just do a table scan if you're searching for a leading wildcard.
Oracle does support full text search but you have to use Oracle Text which would require that you build an Oracle Text index on the name column and use the CONTAINS operator to do the search rather than using a LIKE query. Oracle Text is very robust product so there are quite a few options to consider both in building the index, refreshing the index, and building the query depending on how sophisticated you want to get.
Your index hint is not correctly specified. Assuming there is an index on name, that the name of that index is name_idx, and that you want to force a full scan of the index (just to reiterate, a range scan on the index is not a valid option if there is a leading wildcard), you would need something like
select /*+ index(travel_websites name_idx) */ description
from travel_websites
where name like '%boston%'
There is no guarantee, however, that a full index scan is going to be any more efficient than a full table scan. And it is entirely possible that the optimizer is choosing the index full scan already without the hint (you don't specify what the query plans are for the three queries).

Oracle (and as far as I know most other databases) by default indexes strings so that the index can only be used to look up string matches from the start of the string. That means, a LIKE 'boston%' (startswith) will be able to use the index, while a LIKE '%boston' (endswith) or LIKE '%boston%' (contains) will not.
If you really need indexes that can find substrings fast, you can't use the regular index types for strings, but you can use TEXT indexes which sadly may require slightly different query syntax.

Related

faster search for a substring through large document

I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?
If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html
For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

Elasticsearch query on string representation of number

Good day:
I have an indexed field called amount, which is of string type. The value of amount can be either one or 1. Say in this example, we have amount=1 as an indexed document but, I try to search for one, ElasticSearch will not return the value unless I put 1 for the search query. Thoughts on how I can get this to work? I'm thinking a tokenizer is what's needed.
Thanks.
You probably don't want this for sevenmillionfourhundredfifteenthousendtwohundredfourteen and the like, but only for a small number of values.
At index time I would convert everything to a proper number and store it in a numerical field, which then even allows to sort --- if you need it. Apart from this I would use synonyms at index and at query time and map everything to the digit-strings, but in a general text field that is searched by default.

how to optimize query with contains clause

When I use contains() function on title field with a parameter with length 1 (for example ..... where contains(title , '%x%') > 0 )
my query response time slows down.
As I know it's because of the length of the parameter (i.e length ('x') = 1 )
Is there any solution to optimize such query?
Are you using CONTAINS just to check whether title contains the letter x? That is a very inefficient way. Use this instead:
... where title like '%x%'
"like" simply returns yes or no, it doesn't calculate a score like contains does. And by not wrapping "title" within a function, the Oracle optimizer is free to use an index on "title" if you defined one. (If you didn't you may consider adding one.)
Good luck!
The problem is not in the length of the search string.
A search for %xxxxxxxxxxx% will take a similar time.
The problem is in the leading % which disables the index range scan. To match %x% the whole index must be scaned.
You should try to limit the search for only x%, i.e. without the leading % which is in general impossible as you get only subset of the results.
In some cases it is possible and I'll illustrate it on a simple example.
Suppose you are searching a prefix of a table and the table can by both unqualified and qualified.
You must search for %x% to get a match on both of the strings:
XTABLE
OWNER.XTABLE2
What you can do is define the dot as a whitespace
which will split the owner form the table name and index three tokes:
XTABLE
OWNER
XTABLE2
That will enable to search for x% only ending with index range scan with better performance.

Advantage Database: Full Text Search not returning results that start with the search string

I have a problem with a Full Text Search index. I have a table with a character field size 30. I have created a full text search index on this field in order to have fast search operations on this field that are case insensitive. Now, when I do a query like: SELECT fieldname FROM tablename WHERE CONTAINS(fieldname, '123') I get a few records that contain 123 in the specified fieldname. However, there are records in this table that start with 123 but these are not shown in the query result. In fact, it looks like the query only shows results that contain 123 after a preceding space character.
My full text search index looks like this:
CREATE INDEX idxname ON tablename (fieldname) CONTENT MIN WORD 1
I am using Advantage Database Server 9.1
Any ideas what could be wrong?
Thanks,
Jurgen
A search of the form CONTAINS(fieldname, '123') will find entries that have 123 as an individual "word". It won't find 1234 or 4123. You can use CONTAINS(fieldname, '123*') to find records that have "words" that begin with 123. And CONTAINS(fieldname, '*123*') will find records that contain words with 123 anywhere in them.
Note that the search of the form *123* is less efficient because it must scan the entire index.

SQLite - how to return rows containing a text field that contains one or more strings?

I need to query a table in an SQLite database to return all the rows in a table that match a given set of words.
To be more precise: I have a database with ~80,000 records in it. One of the fields is a text field with around 100-200 words per record. What I want to be able to do is take a list of 200 single word keywords {"apple", "orange", "pear", ... } and retrieve a set of all the records in the table that contain at least one of the keyword terms in the description column.
The immediately obvious way to do this is with something like this:
SELECT stuff FROM table
WHERE (description LIKE '% apple %') or (description LIKE '% orange %') or ...
If I have 200 terms, I end up with a big and nasty looking SQL statement that seems to me to be clumsy, smacks of bad practice, and not surprisingly takes a long time to process - more than a second per 1000 records.
This answer Better performance for SQLite Select Statement seemed close to what I need, and as a result I created an index, but according to http://www.sqlite.org/optoverview.html sqlite doesn't use any optimisations if the LIKE operator is used with a beginning % wildcard.
Not being an SQL expert, I am assuming I'm doing this the dumb way. I was wondering if someone with more experience could suggest a more sensible and perhaps more efficient way of doing this?
Alternatively, is there a better approach I could use to the problem?
Using the SQLite fulltext search would be faster than a LIKE '%...%' query. I don't think there's any database that can use an index for a query beginning with %, as if the database doesn't know what the query starts with then it can't use the index to look it up.
An alternative approach is putting the keywords in a separate table instead, and making an intermediate table that has the information about which row in your main table has which keywords. If you indexed all the relevant columns that way, it could be queried very quickly.
Sounds like you might want to have a look at Full Text Search. It was contributed to SQLite by someone from google. The description:
allows the user to efficiently query
the database for all rows that contain
one or more words (hereafter
"tokens"), even if the table contains
many large documents.
This is the same problem as full-text search, right? In which case, you need some help from the DB to construct indexes into these fields if you want to do this efficiently. A quick search for SQLite full text search yields this page.
The solution you correctly identify as clumsy is probably going to do up to 200 regular expression matches per document in the worst case (i.e. when a document doesn't match), where each match has to traverse the entire field. Using the index approach will mean that your search speed will be independent of the size of each document.

Resources