CHARINDEX vs LIKE search gives very different performance, why? - performance

We use Entity Frameworks for DB access and when we "think" LIKE statement - it actually generates CHARINDEX stuff. So, here is 2 simple queries, after I simplified them to prove a point on our certain server:
-- Runs about 2 seconds
SELECT * FROM LOCAddress WHERE Address1 LIKE '%1124%'
-- Runs about 16 seconds
SELECT * FROM LOCAddress WHERE ( CAST(CHARINDEX(LOWER(N'1124'), LOWER([Address1])) AS int)) = 1
Table contains about 100k records right now. Address1 is VarChar(100) field, nothing special.
Here is snip of 2 plans side by side. Doesn't make any sense, shows 50% and 50% but execution times like 1:8
I searched online and general advice is to use CHARINDEX instead of LIKE. In our experience it's opposite. My question is what causing this and how we can fix it without code change?

I will answer my own question since it was hard to find correct answer and I was pointed to the problem by SQL Server 2012 Execution Plan output. As you see in original question - everything looks OK on surface. This is SQL Server 2008.
When I run same query on 2012 I got warning on CHARINDEX query. Problem is - SQL Server had to do type conversion. Address1 is VarChar and query has N'1124' which is Unicode or NVarChar. If I change this query as so:
SELECT *
FROM LOCAddress
WHERE (CAST(CHARINDEX(LOWER('1124'), LOWER([Address1])) AS int))
It then runs same as LIKE query. So, type conversion that was caused by Entity Framework generator was causing this horrible hit in performance.

First, as you can see both queries are identical and neither can use index. CHARINDEX and LIKE perform same with wildcard. Ex: %YourValue%. However, there performance varies when you use wildcard like 'YourValue%'. Here, LIKE operator will likely to perform faster than CHARINDEX because it may allow partial scan of the index.
Now, in your case, both queries are same but there performance is difference because of following possible reason:
Statistics: SQL Server maintains statistics for sub string in string columns which are use by LIKE operator but not fully usable for CHARINDEX. In that case, LIKE operator will work faster than CHARINDEX.
You can force SQL Server to use index for CHARINDEX with proper table hints
Ex: FROM LOCAddress WITH (INDEX (index_name))
Read more Here, which in section "string summary stastics" says:
SQL Server 2008 includes patented technology for estimating the selectivity of LIKE conditions. It builds a statistical summary of
substring frequency distribution for character columns (a string
summary). This includes columns of type text, ntext, char, varchar,
and nvarchar. Using the string summary, SQL Server can accurately
estimate the selectivity of LIKE conditions where the pattern may have
any number of wildcards in any combination.

Related

Sqlite view vs plain select statement performance

I have a simple table (with about 8 columns and a LOT of rows) in a SQLite database. There is a single program that runs as a service and performs selects, updates and inserts on the table quite often (approximately every 5 minutes). The selects are used only to determine which rows are to be updated, and they are based on a column that holds boolean values (probably translated to integer internally by SQLite).
There is also a web application that performs selects (always with a GROUP BY clause) whenever a web user wishes to view part of the data.
There are two ways to ask for data through the web application: (a) predefined filters (i.e. the where clause has specific conditions on 3 specific columns) an (b) custom filters (i.e. the user chooses the values for the conditions, but the columns participating in the where clause are the same as in (a)). As mentioned, in both cases there is a GROUP BY operation.
I am wondering whether using a view or a custom function might increase the performance. Currently, a "custom" select may take more than 30 seconds to complete - and that's before any data has been sent back to the user.
EDIT:
Using EXPLAIN QUERY PLAN on a "predefined" select statement yields only one row:
0|0|TABLE mytable
Using EXPLAIN on the same query, yields the following:
0|OpenVirtual|1|4|keyinfo(2,-BINARY,BINARY)
1|OpenVirtual|2|3|keyinfo(1,BINARY)
2|MemInt|0|5|
3|MemInt|0|4|
4|Goto|0|27|
5|MemInt|1|5|
6|Return|0|0|
7|IfMemPos|4|9|
8|Return|0|0|
9|AggFinal|0|0|count(0)
10|AggFinal|2|1|sum(1)
11|MemLoad|0|0|
12|MemLoad|1|0|
13|MemLoad|2|0|
14|MakeRecord|3|0|
15|MemLoad|0|0|
16|MemLoad|1|0|
17|Sequence|1|0|
18|Pull|3|0|
19|MakeRecord|4|0|
20|IdxInsert|1|0|
21|Return|0|0|
22|MemNull|1|0|
23|MemNull|3|0|
24|MemNull|0|0|
25|MemNull|2|0|
26|Return|0|0|
27|Gosub|0|22|
28|Goto|0|82|
29|Integer|0|0|
30|OpenRead|0|2|
31|SetNumColumns|0|9|
32|Rewind|0|48|
33|Column|0|8|
34|String8|0|0|123456789
35|Le|356|39|collseq(BINARY)
36|Column|0|3|
37|Integer|180|0|
38|Gt|100|42|collseq(BINARY)
39|Column|0|7|
40|Integer|1|0|
41|Ne|356|47|collseq(BINARY)
42|Column|0|6|
43|Sequence|2|0|
44|Column|0|3|
45|MakeRecord|3|0|
46|IdxInsert|2|0|
47|Next|0|33|
48|Close|0|0|
49|Sort|2|69|
50|Column|2|0|
51|MemStore|7|0|
52|MemLoad|6|0|
53|Eq|512|58|collseq(BINARY)
54|MemMove|6|7|
55|Gosub|0|7|
56|IfMemPos|5|69|
57|Gosub|0|22|
58|AggStep|0|0|count(0)
59|Column|2|2|
60|Integer|30|0|
61|Add|0|0|
62|ToReal|0|0|
63|AggStep|2|1|sum(1)
64|Column|2|0|
65|MemStore|1|1|
66|MemInt|1|4|
67|Next|2|50|
68|Gosub|0|7|
69|OpenPseudo|3|0|
70|SetNumColumns|3|3|
71|Sort|1|80|
72|Integer|1|0|
73|Column|1|3|
74|Insert|3|0|
75|Column|3|0|
76|Column|3|1|
77|Column|3|2|
78|Callback|3|0|
79|Next|1|72|
80|Close|3|0|
81|Halt|0|0|
82|Transaction|0|0|
83|VerifyCookie|0|1|
84|Goto|0|29|
85|Noop|0|0|
The select I used was as the following
SELECT
COUNT(*) as number,
field1,
SUM(CAST(filter2 +30 AS float)) as column2
FROM
mytable
WHERE
(filter1 > '123456789' AND filter2 > 180)
OR filter3=1
GROUP BY
field1
ORDER BY
number DESC, field1;
Whenever you're going to be doing comparisons of a non-primary-key field, it's a good design idea to add an index into to the field(s). Too many, however, can cause INSERTs to crawl, so plan accordingly.
Also, if you have simple fields such as ones that only hold a boolean value, you may want to consider declaring it as an INTEGER instead of whatever you declared it as. Declaring it as any type not specifically defined by SQLite will cause it to default to a NUMERIC type which will take longer to compare values because it will store it internally as a double and will use the floating-point math processor instead of the integer math processor.
IMO, the GROUP BY sorting directive is sometimes a dead giveaway to an unoptimized query; its methodology involves eliminating redundant data which could have been eliminated beforehand if it hadn't been pulled out of the database to begin with.
EDIT:
I saw your query and saw there are some simple things you can do to optimize it:
SUM(CAST(filter2 +30 AS float)) is inefficient; why are you casting it as a float? Why not just SUM it then add 30 * the COUNT?
filter1 > '123456789' - Why the string comparison? Why not just use integer comparison?

Full-text search in Postgres or CouchDB?

I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
If I understand your problem right, probably all you need is already built in the CouchDB.
To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
Here is a view function to make this:
function(doc) {
for (var name in doc.places) {
emit(name, doc._id);
}
}
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.
1) text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
2) Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE clause has to match the the index expression.
3) Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.
4) Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
5) Optimize disc representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.
6) Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7) Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.

How to inline a variable in PL/SQL?

The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

What is the best way to integrate Solr as an index with Oracle as a storage DB?

I have an Oracle database with all the "data", and a Solr index where all this data is indexed. Ideally, I want to be able to run queries like this:
select * from data_table where id in ([solr query results for 'search string']);
However, one key issue arises:
Oracle WILL NOT allow more than 1000 items in the array of items in the "in" clause (BIG DEAL, as the list of objects I find is very often > 1000 and will usually be around the 50-200k items)
I have tried to work around this using a "split" function that will take a string of comma-separated values, and break them down into array items, but then I hit the 4000 char limit on the function parameter using SQL (PL/SQL is 32k chars, but it's still WAY too limiting for 80,000+ results in some cases)
I am also hitting performance issues using a WHERE IN (....), I am told that this causes a very slow query, even when the field referenced is an indexed field?
I've tried making recursive "OR"s for the 1000-item limit (aka: id in (1...1000 or (id in (1001....2000) or id in (2001....3000))) - and this works, but is very slow.
I am thinking that I should load the Solr Client JARs into Oracle, and write an Oracle Function in Java that will call solr and pipeline back the results as a list, so that I can do something like:
select * from data_table where id in (select * from table(runSolrQuery('my query text')));
This is proving quite hard, and I am not sure it's even possible.
Things that I can't do:
Store full data in Solr (security +
storage limits)
User Solr as
controller of pagination and ordering
(this is why I am fetching data from
the DB)
So I have to cook up a hybrid approach where Solr really act like the full-text search provider for Oracle. Help! Has anyone faced this?
Check this out:
http://demo.scotas.com/search-sqlconsole.php
This product seems to do exactly what you need.
cheers
I'm not a Solr expert, but I assume that you can get the Solr query results into a Java collection. Once you have that, you should be able to use that collection with JDBC. That avoids the limit of 1000 literal items because your IN list would be the result of a query, not a list of literal values.
Dominic Brooks has an example of using object collections with JDBC. You would do something like
Create a couple of types in Oracle
CREATE TYPE data_table_id_typ AS OBJECT (
id NUMBER
);
CREATE TYPE data_table_id_arr AS TABLE OF data_table_id_typ;
In Java, you can then create an appropriate STRUCT array, populate this array from Solr, and then bind it to the SQL statement
SELECT *
FROM data_table
WHERE id IN (SELECT * FROM TABLE( CAST (? AS data_table_id_arr)))
Instead of using a long BooleanQuery, you can use TermsFilter (works like RangeFilter, but the items doesn't have to be in sequence).
Like this (first fill your TermsFilter with terms):
TermsFilter termsFilter = new TermsFilter();
// Loop through terms and add them to filter
Term term = new Term("<field-name>", "<query>");
termsFilter.addTerm(term);
then search the index like this:
DocList parentsList = null;
parentsList = searcher.getDocList(new MatchAllDocsQuery(), searcher.convertFilter(termsFilter), null, 0, 1000);
Where searcher is SolrIndexSearcher (see java doc for more info on getDocList method):
http://lucene.apache.org/solr/api/org/apache/solr/search/SolrIndexSearcher.html
Two solutions come to mind.
First, look into using Oracle specific Java extensions to JDBC. They allow you to pass in an actual array/list as an argument. You may need to create a stored proc (it has a been a while since I had to do this), but if this is a focused use case, it shouldn't be overly burdensome.
Second, if you are still running into a boundary like 1000 object limits, consider using the "rows" setting when querying Solr and leveraging it's inherent pagination feature.
I've used this bulk fetching method with stored procs to fetch large quantities of data which needed to be put into Solr. Involve your DBA. If you have a good one, and use the Oracle specific extensions, I think you should attain very reasonable performance.

Resources