How to make simple GROUP BY use index? - performance

I want to get average temperatures hourly for given table with temperature reads of a thermometer, with row structure: thermometer_id, timestamp (float, julian days), value (float) plus ascending index on timestamp.
To get whole day 4 days ago, I'm using this query:
SELECT
ROUND(AVG(value), 2), -- average temperature
COUNT(*) -- count of readings
FROM reads
WHERE
timestamp >= (julianday(date('now')) - 5) -- between 5 days
AND
timestamp < (julianday(date('now')) - 4) -- ...and 4 days ago
GROUP BY CAST(timestamp * 24 as int) -- make hours from floats, group by hours
It does it work well, yet it works very slowly, for a 9MB database, 355k rows, it takes more than half a second to finish, which is confusingly long, it shouldn't take more than few tens of ms. It does so on not quite fast hardware (not ssd though), yet I'm preparing it to use on raspberry pi, quite slow in comparison + it's going to get 80k more rows per day of work.
Explain explains the reason:
"USE TEMP B-TREE FOR GROUP BY"
I've tried adding day and hour columns with indexes just for the sake of quick access, but still, group by didn't use any of the indexes.
How can I tune this query or database to make this query faster?

If an index is used to optimize the GROUP BY, the timestamp search can no longer be optimized (except by using the skip-scan optimization, which your old SQLite might not have). And going through all rows in reads, only to throw most of them away because of a non-matching timestamp, would not be efficient.
If SQLite doesn't automatically do the right thing, even after running ANALYZE, you can try to force it to use a specific index:
CREATE INDEX rhv ON reads(hour, value);
SELECT ... FROM reads INDEXED BY rhv WHERE timestamp ... GROUP BY hour;
But this is unlikely to result in a query plan that is actually faster.

As #colonel-thirty-two commented, the problem was with cast and multiplication on GROUP BY CAST(timestamp * 24 as int). Such grouping would totally omit the index, hence the slow query time. When I've used hour column both for time comparison and grouping, the query finished immediately.

Related

How to setup CosmosDB when need to search for "like" in string tags

I have a 3 tables structure, Customer, Invoice, InvoiceItem that I would like to try to move from the relational DB and store it in CosmosDB. Currently, there are quite intensive queries being run on the InvoiceItem table. This InvoiceItem table has up to 10 optional TagX columns that are basically text that might include the brand, group, type, or something that would group this InvoiceItem and make it searchable by saying (simplified):
SELECT * FROM InvoiceItem WHERE Tag1 LIKE '%shirt%' AND Tag2 LIKE '%training%'
A query like this on a multi-million table can take more than 8 minutes. We are working on the archiving strategy and indexes to speed up the process but it looked to me like CosmosDB could be worth trying in this case, since all of the data is write-once-read-many scenario.
Back to CosmosDB, how do I deal with those string tags in CosmosDB. As a start, I thought about having Invoice and InvoiceItem in the same partition with "type" property that would differ them. But then I cannot stick the tags anywhere so they would be easily searchable. Any ideas on how to set it up?
Thanks!
Textbook database performance issue caused by either lack of, or inefficient indexing.
With that many rows, index cardinality becomes important. You don't want to index the entire field, you only want to index the first n characters of the columns you're indexing, and only index columns you are searching, whether by join or direct where clauses.
The idea is to keep the indexes as small as possible, while still giving you the query performance you need.
With 18 million rows you probably want to start with an index cardinality of the square root of 18m.
That means to hit the index segment you need, you only want to search no more than 5000 index rows, each of which have 400-5000 rows in their segment, at least for sub-second result times.
indexing the first 3-4 letters would be a good starting point. Based on the square root of 18000000 being 4242 and the nearest exponent of 26(3) (assuming alpha characters only) overshooting that. Even if alpha-numeric, 3 characters is still a good starting point.
If the queries then run super fast, but the index takes forever to build, drop a character. This is called "index tuning". You pick a starting point and find the largest cardinality (lowest number of characters indexed) that gives you the performance you need.
If I'm way off because index performance in this DB is way off the mark of a relational db, you'll need to experiment.
As far as I'm concerned, a select query that takes more than a few seconds is unacceptable, except in rare cases. I once worked for a security company. Their license management system took minutes to pull large customers.
After indexing the tables correctly the largest customer took less than 2 seconds. I had to sift through a table with billions of rows for number of downloads, and some of these queries had 7 joins.
If that database can't do this with 18m rows, I'd seriously consider a migration to a better architecture, hardware, software or otherwise.
As index cardinality increases, the performance gains drop to negative as the index cardinality approaches table cardinality, as compared to no index.
As in all things in life, moderation. At the other end of the spectrum, an index with a cardinality of 2 is just about useless. Half of 8 minutes is 4 minutes, assuming a nearly equal distribution.... useless, so indexing a boolean field isn't a great thing to do, usually. There are few hard and fast rules though. Lots of edge cases. Experimentation is your friend.

Convert ms timestamp to sequential unique 32 bit number?

I have a table where each record has a field for the timestamp (in ms) from when it was created. This gives a unique ID for each record, as well as sequential ordering. Record 12345678 is different from and comes after 12222222.
There are not records every millisecond, or even every second (although the rate could increase).
My problem is I have a client expecting unique 32-bit IDs. These IDs also need to be numeric, unique and sequential. But the above timestamp currently is ~43 bits.
I could hash them down, but then I lose the sequential and numeric properties. I could chop off the first 10-15 bits or the last, but then I might lose the uniqueness. Someone suggested accepting that the first record isn't from before 1 Jan 2010, so take timestamp - (40 years). I don't love it, and there are enough milliseconds in one year to make that not work.
Any good ways of dealing with this?
if you need to be able to handle records even in ms time difference, there is no way to squeeze the timestamp down to 32bit without the risk of collisions. Simply because there might be more than 2^32 records some day.
How I understand your problem, you need to be able to find the records later by the id and you are not able to store the 32bit id in the records.
is this right?
I see the following possibilities:
You can ensure that there is no more than one record every 4s than you can simply remove the last 12 bits of your 43bit timestamp.But this will no longer work if your timestamp increases to 44bits
if you can modify the timestamp of your records you can take the above approach and if two records are to close together, you can simply modifiy the timestamp of the later one to make the upper 32 bits of the timestamp unique. This will work as long as the average rate of records is less than one records every 4 seconds. [Disadvantage: the timestamps of the records are no longer exactly the creation time but still more or less ok]

Oracle 11g - doing analytic functions on millions of rows

My application allows users to collect measurement data as part of an experiment, and needs to have the ability to report on all of the measurements ever taken.
Below is a very simplified version of the tables I have:
CREATE TABLE EXPERIMENTS(
EXPT_ID INT,
EXPT_NAME VARCHAR2(255 CHAR)
);
CREATE TABLE USERS(
USER_ID INT,
EXPT_ID INT
);
CREATE TABLE SAMPLES(
SAMPLE_ID INT,
USER_ID INT
);
CREATE TABLE MEASUREMENTS(
MEASUREMENT_ID INT,
SAMPLE_ID INT,
MEASUREMENT_PARAMETER_1 NUMBER,
MEASUREMENT_PARAMETER_2 NUMBER
);
In my database there are 2000 experiments, each of which has 18 users. Each user has 6 samples to measure, and would do 100 measurements per sample.
This means that there are 2000 * 18 * 6 * 100 = 21600000 measurements currently stored in the database.
I'm trying to write a query that will get the AVG() of measurement parameter 1 and 2 for each user - that would return about 36,000 rows.
The query I have is extremely slow - I've left it running for over 30 minutes and it doesn't come back with anything. My question is: is there an efficient way of getting the averages? And is it actually possible to get results back for this amount of data in a reasonable time, say 2 minutes? Or am I being unrealistic?
Here's (again a simplified version) the query I have:
SELECT
E.EXPT_ID,
U.USER_ID,
AVG(MEASUREMENT_PARAMETER_1) AS AVG_1,
AVG(MEASUREMENT_PARAMETER_2) AS AVG_2
FROM
EXPERIMENTS E,
USERS U,
SAMPLES S,
MEASUREMENTS M
WHERE
U.EXPT_ID = E.EXPT_ID
AND S.USER_ID = U.USER_ID
AND M.SAMPLE_ID = S.SAMPLE_ID
GROUP BY E.EXPT_ID, U.USER_ID
This will return a row for each expt_id/user_id combination and the average of the 2 measurement parameters.
For your query, in any case, the DBMS needs to read the complete measurements table. This is by far the biggest part of data to read, and the part which takes most time if the query is optimized well (will come to that later). That means that the minimum runtime of your query is about the time it takes to read the complete measurements table from whereever it is stored. You can get a rough estimate by checking how much data that is (in MB or GB) and checking how much time it would take to read this amount of data from the harddisk (or where the table is stored). If your query runs slower by a factor of 5 or more, you can be sure that there is room for optimization.
There is a vast amount of information (tutorials, individual hints which can be invaluable, and general practices lists) about how to optimize oracle queries. You will not get through all this information quickly. But if you provide the execution plan of your query (this is what oracle's query optimizer thinks is the best way to fulfill your query), we will be able to spot steps which can be optimized and suggest solutions.

Oracle Date Range Inconsistancy

I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.

Index needed for max(col)?

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

Resources