I am testing an PostgreSQL extension named Timescaledb for time series data.
If I read the document of PostgreSQL right, the query for example
WHERE x = 'somestring' and timestamp between 't1' and 't2'
will work best with index (x,timestamp). And run EXPLAIN on that SQL query shows that it works.
When I try the same query on Timescaledb hypertable, which contains same data and without index (x,timestamp). The performance is about the same (if not better). After creating index (x,timestamp), the performance does not improve.
I understand that the hypertable have a build-in timestamp index. So, I should have a different strategy to add index to the table, for example index with just (x). Is that right?
A few things about how TimescaleDB handles queries:
The primary way that time-based queries get improved performance is
through chunk exclusion. Data is partitioned by time into chunks so
that when a query for a particular time range is executed, the
planner can ignore chunks that have data outside of that time range.
Indexes are then applied for chunks that are being searched.
If you are searching a time-range that includes all chunks, chunk
exclusion does not apply, and so you get query times closer to
standard PostgreSQL.
If your query matches on a large number of the rows in the chunks
being scanned, the query planner may choose a sequential scan
instead of an index scan to save on I/O operations
https://github.com/timescale/timescaledb/issues/317.
There is nothing inherently special about the built-in indexes, you can drop the indexes after hypertable creation or turn them off when running create_hypertable (see timescale api docs).
I have a high level question. Say I have a SQL query that takes 30ms to complete, it runs against an indexed column on a table with 1million records. Now if the table size is increased to 5million records should I expect the query to take 5 times as long (as 5 times the indexes have to be searched), so 150ms. I apologise if this is too simplistic, I have a program that is running 10 SQL (indexed) against a table that is going to be increased by this factor, the queries currently take 300ms and I am concerned this would increase to 1.5s. Any help would be appreciated!
You can think of an index lookup as doing a search through a binary tree followed by a fetch of the page with the appropriate data. Typically, the index would fit in memory and the search through the index would be quite fast. Multiplying the data size of 10 would increase the depth of the tree by 3 or 4. With in-memory comparison operations this would not be noticeable for most queries. (There are other types of indexes besides binary b-trees, but this is a convenient model for thinking about performance.)
The data fetch then could incur the overhead of reading a page from disk. That should still be quite fast.
So, the easy answer to your question is: no. However, this assumes that the query is something like:
select t.*
from table t
where t.indexcol = CONSTANTVALUE
And, it assumes that the query only returns one row. Things that might affect the performance as the table size increases include:
The size of the returned data set increases with the size of the table. Returning more values necessarily takes longer. For some queries, the performance is more dependent on the mechanism for returning values than calculating/fetching the data.
The query contains join or group by.
The statistics of the table are up-to-date, so the optimizer doesn't accidentally choose a full table scan rather than an index lookup.
You are in a memory-constrained environment where the index doesn't fit in memory. Or, the entire table is in memory when smaller but incurs the overhead of a cache-miss as it gets larger.
A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.
I have a table in my oracle db with 100 columns. 50 columns in this table are not used by the program accessing this table. (i.e. the select queries only select the relevant columns and NOT using '*')
My question is this :
If I recreate the same table with only the columns I need will it improve queries performance using the same query I used with the original table (remember that only the relevant columns are selected)
It is well worth mentioning the the program makes these queries a reasonable amount of times per second!
P.S. :
This is an existing project I am working on and the table design was made a long time ago for other products as well (thats why we have unused columns now)
So the effect of this will be that the average row will be smaller, if the extra columns have got data that will no longer be in the table. Therefore the table can be smaller, and not only will it use less space on disk it will use less memory space in the SGA, and caching will be more efficient.
Therefore, if you access the table via a full table scan then it will be faster to read the segment, but if you use index-based access mechanisms then the only performance improvement is likely to be through an improved chance of fetching the block from cache.
[Edited]
This SO thread suggests "it always pulls a tuple...". Hence, you are likely to see some performance improvement, not sure major or minor, as already mentioned.
We have about 10K rows in a table. We want to have a form where we have a select drop down that contains distinct values of a given column in this table. We have an index on the column in question.
To increase performance I created a little cache table that contains the distinct values so we didn't need to do a select distinct field from table against 10K rows. Surprisingly it seems doing select * from cachetable (10 rows) is no faster than doing the select distinct against 10K rows. Why is this? Is the index doing all the work? At what number of rows in our main table will there be a performance improvement by querying the cache table?
For a DB, 10K rows is nothing. You're not seeing much difference because the actual calculation time is minimal, with most of it consumed by other, constant, overhead.
It's difficult to predict when you'd start noticing a difference, but it would probably be at around a million rows.
If you've already set up caching and it's not detrimental, you may as well leave it in.
10k rows is not much... start caring when you reach 500k ~ 1 million rows.
Indexes do a great job, specially if you just have 10 different values for that index.
This depends on numerous factors - the amount of memory your DB has, the size of the rows in the table, use of a parameterised query and so forth, but generally 10K is not a lot of rows and particularly if the table is well indexed then it's not going to cause any modern RDBMS any sweat at all.
As a rule of thumb I would generally only start paying close attention to performance issues on a table when it passes the 100K rows mark, and 500K doesn't usually cause much of a problem if indexed correctly and accessed by such. Performance usually tends to fall off catastrophically on large tables - you may be fine on 500K rows but crawling on 600K - but you have a long way to go before you are at all likely to hit such problems.
Is the index doing all the work?
You can tell how the query is being executed by viewing the execution plan.
For example, try this:
explain plan for select distinct field from table;
select * from table(dbms_xplan.display);
I notice that you didn't include an ORDER BY on that. If you do not include ORDER BY then the order of the result set may be random, particularly if oracle uses the HASH algorithm for making a distinct list. You ought to check that.
So I'd look at the execution plans for the original query that you think is using an index, and at the one based on the cache table. Maybe post them and we can comment on what's really going on.
Incidentaly, the cache table would usually be implemented as a materialised view, particularly if the master table is generally pretty static.
Serious premature optimization. Just let the database do its job, maybe with some tweaking to the configuration (especially if it's MySQL, which has several cache types and settings).
Your query in 10K rows most probably uses HASH SORT UNIQUE.
As 10K most probably fit into db_buffers and hash_area_size, all operations are performed in memory, and you won't note any difference.
But if the query will be used as a part of a more complex query, or will be swapped out by other data, you may need disk I/O to access the data, which will slow your query down.
Run your query in a loop in several sessions (as many sessions as there will be users connected), and see how it performs in that case.
For future plans and for scalability, you may want to look into an indexing service that uses pure memory or something faster than the TCP DB round-trip. A lot of people (including myself) use Lucene to achieve this by normalizing the data into flat files.
Lucene has a built-in Ram Drive directory indexer, which can build the index all in memory - removing the dependency on the file system, and greatly increasing speed.
Lately, I've architected systems that have a single Ram drive index wrapped by a Webservice. Then, I have my Ajax-like dropdowns query into that Webservice for high availability and high speed - no db layer, no file system, just pure memory and if remote tcp packet speed.
If you have an index on the column, then all the values are in the index and the dbms never has to look in the table. It just looks in the index which just has 10 entries. If this is mostly read only data, then cache it in memory. Caching helps scalability and a lot by relieving the database of work. A query that is quick on a database with no users, might perform poorly if a 30 queries are going on at the same time.