I've got a 3GB SQLite database file with a single table with 40 million rows and 14 fields (mostly integers and very short strings and one longer string), no indexes or keys or other constraints -- so really nothing fancy. I want to check if there are entries where a specific integer field has a specific value. So of course I'm using
SELECT EXISTS(SELECT 1 FROM FooTable WHERE barField=?)
I haven't got much experience with SQLite and databases in general and on my first test query, I was shocked that this simple query took about 30 seconds. Subsequent tests showed that it is much faster if a matching row occurs at the beginning, which of course makes sense.
Now I'm thinking of doing an initial SELECT DISTINCT barField FROM FooTable at application startup, and caching the results in software. But I'm sure there must be a cleaner SQLite way to do this, I mean, that should be part of what a DBMS's job right?
But so far, I've only created primary keys for speeding up queries, which doesn't work here because the field values are non-unique. So how can I speed up this query so that it works at constant time? (It doesn't have to be lightning fast, I'd be completely fine if it was under one second.)
Thanks in advance for answering!
P.S. Oh, and there will be about 500K new rows every month for an indefinite period of time, and it would be great if that doesn't significantly increase query time.
Adding an index on barField should speed up the subquery inside the EXISTS clause:
CREATE INDEX barIdx ON FooTable (barField);
To satisfy the query, SQLite would only have to seek the index once and detect that there is at least one matching value.
Related
For querying an sqlite table based on a list of IDs (i.e. distinct primary keys) I am using following statement (example based on the Chinook Database):
SELECT * FROM Customer WHERE CustomerId IN (1,2,3,8,20,35)
However, my actual list of IDs might become rather large (>1000). Thus, I was wondering if this approach using the IN statement is the most efficient or if there is a better/optimized way to query an sqlite table based on a list of primary keys.
If the number of elements in the IN is large enough, SQLite constructs a temporary index for them. This is likely to be more efficient than creating a temporary table manually.
The length of the IN list is limited only be the maximum length of an SQL statement, and by memory.
Because the statement you wrote does not include any instructions to SQLite about how to find the rows you want the concept of "optimizing" doesn't really exist -- there's nothing to optimize. The job of planning the best algorithm to retrieve the data belongs to the SQLite query optimizer.
Some databases do have idiosyncrasies in their query optimizers which can lead to performance issues but I wouldn't expect SQLite to have any trouble finding the correct algorithm for this simple query, even with lots of values in the IN list. I would only worry about trying to guide the query optimizer to another execution plan if and when you find that there's a performance problem.
SQLite Optimizer Overview
IN (expression-list) does use an index if available.
Beyond that, I can't glean any guarantees from it, so the following is subject to a performance measaurement.
Axis 1: how to pass the expression-list
hardocde as string. Overhead for int-to-string conversion and string-to-int parsing
bind parameters (i.e. the statement is ... WHERE CustomerID in (?,?,?,?,?,?,?,?,?,?....), which is easier to build from a predefined string than hardcoded values). Prevents int → string → int conversion, but the default limit for number of parameters is 999. This can be increased by SQLITE_LIMIT_VARIABLE_NUMBER, but might lead to excessive allocations.
Temporary table. Possibly less efficient than any of the above methods after the statement is prepared, but that doesn't help if most time is spent preparing the statement
Axis 2: Statement optimization
If the same expression-list is used in multiple queries against changing CustomerIDs, one of the following may help:
reusing a prepared statement with hardcoded values (i.e. don't pass 1001 parameters)
create a temporary table for the CustomerIDs with index (so the index is created once, not on the fly for every query)
If the expression-list is different with every query, ist is probably best to let SQLite do its job. The following might be an improvement
create a temp table for the expression-list
bulk-insert expression-list elements using union all
use a sub query
(from my experience with SQLite, I'd expect it to be on par or slightly worse)
Axis 3 Ask Richard
the sqlite mailing list (yeah I know, that technology even older than rotary phones!) is pretty active with often excellent advise, including from the author of SQLite. 90% chance someone will dismiss you ass "Measure before asking suhc a question!", 10% chance someone gives you detailed insight.
I noticed when looking at dba_tab_col_statistics for my table (m_CURRENT) that the num_buckets value for column TNE is 75 on one database and 254 on another. DB is Oracle 10g.
This seems to be the main difference between the two tables. Is there a way to get both databases to match num_bucket values?
I have a delete statement that is fast on one database and very slow on another. I know there are several reasons a query plan can differ on two databases. After a lot of analysis, I suspect that getting the slow query database to have the same num_bucket setting could ensure my delete statement does a range_scan (fast) and not a fast_full_scan (slow in this case) on index TNE_idx.
how do you gather stats on both databases, do you have a regular stats gather script? as you can as a one off do this to gather histograms on that one column alone:
begin
dbms_stats.gather_table_stats(user,'M_CURRENT',
method_opt=>'for columns TNE size 254', cascade=>false,
granularity=>'ALL', degree=>8);
end;
/
the size parameter will set the buckets (if there's less than that number of distinct values, the result will be less).
The above doesn't specify estimate_pct so will sample a smaller population of values rather than 100%. if you want 100 percent then specify this in the estimate_pct parameter*), but if you have a regular script, this may get overwritten later on.
* you can check the current sample size by comparing sample_size on the dba_tab_col_statistics to the num_rows on dba_tables
I want to follow up, DazzaL's answer was very informative, but I was wrong, the buckets did not make a difference. I could have saved a lot of time by running
explain plan for delete blah blah blah
then
select * from table(dbms_xplan.display)
and focusing on the right step id.
I zeroed in on step id 5 where a index fast full scan was being used when I should have started on step 2 where the two database plans were different. I saw the fast database doing nested loop anti join and the slow database doing a merge join anti. Once I added the (undocumented?) hint
NL_AJ
with no parameters in the subquery, the desired fast index_range_scan was used.
As I write this I wonder if the bucket number matching was important, but I don't have the time to retest. It ain't broke, so I will not touch it!
I have a quite complex multi-join TSQL SELECT query that runs for about 8 seconds and returns about 300K records. Which is currently acceptable. But I need to reuse results of that query several times later, so I am inserting results of the query into a temp table. Table is created in advance with columns that match output of SELECT query. But as soon as I do INSERT INTO ... SELECT - execution time more than doubles to over 20 seconds! Execution plans shows that 46% of the query cost goes to "Table Insert" and 38% to Table Spool (Eager Spool).
Any idea why this is happening and how to speed it up?
Thanks!
The "Why" of it hard to say, we'd need a lot more information. (though my SWAG would be that it has to do with logging...)
However, the solution, 9 times out of 10 is to use SELECT INTO to make your temp table.
I would start by looking at standard tuning itmes. Is disk performing? Are there sufficient resources (IOs, RAM, CPU, etc)? Is there a bottleneck in the RDBMS? Does sound like the issue but what is happening with locking? Does other code give similar results? Is other code performant?
A few things I can suggest based on the information you have provided. If you don't care about dirty reads, you could always change the transaction isolation level (if you're using MS T-SQL)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select ...
This may speed things up on your initial query as locks will not need to be done on the data you are querying from. If you're not using SQL server, do a google search for how to do the same thing with the technology you are using.
For the insert portion, you said you are inserting into a temp table. Does your database support adding primary keys or indexes on your temp table? If it does, have a dummy column in there that is an indexed column. Also, have you tried to use a regular database table with this? Depending on your set up, it is possible that using that will speed up your insert times.
Because I am not familiar with ADO under the hood, I was wonder which of the two methods of finding a record generally yields quicker results using VB6.
Use a 'select' statement using 'where' as a qualifier. If the recordset count yields zero, the record was not found.
Select all records iterating through records with a client-side cursor until record is found, or not at all.
The recordset is in the range of 10,000 records and will grow. Also, I am open to anything that will yield shorter search times other than what was mentioned.
SELECT count(*) FROM foo WHERE some_column='some value'
If the result is greater than 0 the record satisfying your condition was found in the database. It is unlikely you would get any faster than this. Proper indexes on the columns you are using in the WHERE clause could considerably improve performance.
In every case I can think of, selecting using the where clause is faster.
Even in situations where the client code will iterate through the whole database (file-based databases like Access, for example), you will have optimized code written in c or c++ doing the selection (in the database driver.) This is always faster than VB6.
For Database engines (SQL, MySQL, etc), the performance increase can even be more profound. By using the where clause, you limit the amount of data that must be transmitted over the network, vastly improving the response.
Some additional performance tips:
Select only the fields you want.
Build indexes on frequently used fields
Watch what kind of recordset you are returning. Use Forward-only cursors if you are just returning data from a database.
Lastly, I was shocked by VB.NET's database performance, it being several times faster than the fastest VB6 code.
We have a large table, with several indices (say, I1-I5).
The usage pattern is as follows:
Application A: all select queries 100% use indices I1-I4 (assume that they are designed well enough that they will never use I5).
Application B: has only one select query (fairly frequently run), which contains 6 fields and for which a fifth index I5 was created as a covered index.
The first 2 fields of the covered index are date, and a security ID.
The table contains rows for ~100 dates (in date order, enforced by a clustered index I1), and tens of thousands of security identifiers.
Question: dies the order of columns in the covered index affect the performance of the select query in Application B?
I.e., would the query performance change if we switched around the first two fields of the index (date and security ID)?
Would the query performance change if we switch around one of the last fields?
I am assuming that the logical IOs would remain un-affected by any order of fields in the covered index (though I'm not 100% sure).
But will there be other performance effects? (Optimizer speed, caching, etc...)
The question is version-generic, but if it matters, we use Sybase 12.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
It depends. If you have a WHERE clause such as the following, you will get better performance out an index on (security_ID, date_column) than the converse:
WHERE date_column BETWEEN DATE '2009-01-01' AND DATE '2009-08-31'
AND security_ID = 373239
If you have a WHERE clause such as the following, you will get better performance out of an index on (date_column, security_ID) than the converse:
WHERE date_column = DATE '2009-09-01'
AND security_ID > 499231
If you have a WHERE clause such as the following, it really won't matter very much which column appears first:
WHERE date_column = DATE '2009-09-13'
AND security_ID = 211930
We'd need to know about the selectivity and conditions on the other columns in the index to know if there are other ways of organizing your index to gain more performance.
Just like your question is version generic, my answer is DBMS-generic.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
The problem is not the size of the table. Millions of rows is nothing for Sybase.
The problem is an absence of a test system.