How do I not normalize continuous data (INTS, FLOATS, DATETIME, ....)? - performance

According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing
However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)
"Normalize, but don't over-normalize." In particular, do not normalize
datetimes or floats or other "continuous" values.
source
Sure purists say normalize time. That is a big mistake. Generally,
"continuous" values should not be normalized because you generally
want to do range queries on them. If it is normalized, performance
will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each
time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still
sucks.
source
Do not normalize "continuous" values -- dates, floats, etc --
especially if you will do range queries.
source.
I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.
Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance

The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?
Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
Moving the range test to a JOINed table causes the slowdown.
The first query is quite optimizable via
INDEX(x, date)
In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)
DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).
Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Time series in Cassandra when measures can go "back in time"

this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.

Oracle Date Range Inconsistancy

I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.

multicolumn index column order

I've be told and read it everywhere (but no one dared to explain why) that when composing an index on multiple columns I should put the most selective column first, for performance reasons.
Why is that?
Is it a myth?
I should put the most selective column first
According to Tom, column selectivity has no performance impact for queries that use all the columns in the index (it does affect Oracle's ability to compress the index).
it is not the first thing, it is not the most important thing. sure, it is something to consider but it is relatively far down there in the grand scheme of things.
In certain strange, very peculiar and abnormal cases (like the above with really utterly skewed data), the selectivity could easily matter HOWEVER, they are
a) pretty rare
b) truly dependent on the values used at runtime, as all skewed queries are
so in general, look at the questions you have, try to minimize the indexes you need based on that.
The number of distinct values in a column in a concatenated index is not relevant when considering
the position in the index.
However, these considerations should come second when deciding on index column order. More importantly is to ensure that the index can be useful to many queries, so the column order has to reflect the use of those columns (or the lack thereof) in the where clauses of your queries (for the reason illustrated by AndreKR).
HOW YOU USE the index -- that is what is relevant when deciding.
All other things being equal, I would still put the most selective column first. It just feels right...
Update: Another quote from Tom (thanks to milan for finding it).
In Oracle 5 (yes, version 5!), there was an argument for placing the most selective columns first
in an index.
Since then, it is not true that putting the most discriminating entries first in the index
will make the index smaller or more efficient. It seems like it will, but it will not.
With index
key compression, there is a compelling argument to go the other way since it can make the index
smaller. However, it should be driven by how you use the index, as previously stated.
You can omit columns from right to left when using an index, i.e. when you have an index on col_a, col_b you can use it in WHERE col_a = x but you can not use it in WHERE col_b = x.
Imagine to have a telephone book that is sorted by the first names and then by the last names.
At least in Europe and US first names have a much lower selectivity than last names, so looking up the first name wouldn't narrow the result set much, so there would still be many pages to check for the correct last name.
The ordering of the columns in the index should be determined by your queries and not be any selectivity considerations. If you have an index on (a,b,c), and most of your single column queries are against column c, followed by a, then put them in the order of c,a,b in the index definition for the best efficiency. Oracle prefers to use the leading edge of the index for the query, but can use other columns in the index in a less efficient access path known as skip-scan.
The more selective is your index, the fastest is the research.
Simply imagine a phonebook: you can find someone mostly fast by lastname. But if you have a lot of people with the same lastname, you will last more time on looking for the person by looking at the firstname everytime.
So you have to give the most selective columns firstly to avoid as much as possible this problem.
Additionally, you should then make sure that your queries are using correctly these "selectivity criterias".

Codeigniter WHERE on "AS" field

I have a query where I need to modify the selected data and I want to limit my results of that data. For instance:
SELECT table_id, radians( 25 ) AS rad FROM test_table WHERE rad < 5 ORDER BY rad ASC;
Where this gets hung up is the 'rad < 5', because according to codeigniter there is no 'rad' column. I've tried writing this as a custom query ($this->db->query(...)) but even that won't let me. I need to restrict my results based on this field. Oh, and the ORDER BY works perfect if I remove the WHERE filter. The results are order ASC by the rad field.
HELP!!!
With many DBMSes, we need to repeat the formula / expression in the where clause, i.e.
SELECT table_id, radians( 25 ) AS rad
FROM test_table
WHERE radians( 25 ) < 5
ORDER BY radians( 25 ) ASC
However, in this case, since the calculated column is a constant, the query itself doesn't make much sense. Was there maybe a missing part, as in say radians (25 * myColumn) or something like that ?
Edit (following info about true nature of formula etc.)
You seem disappointed, because the formula needs to be repeated... A few comments on that:
The fact that the formula needs to be explicitly spelled-out rather than aliased may make the query less readable, less fun to write etc. (more on this below), but the more important factor to consider is that the formula being used in the WHERE clause causes the DBMS to calculate this value for potentially all of the records in the underlying table!!!
This in turns hurts performance in several ways:
SQL may not be able use some indexes, and instead have to scan the table (or parts thereof)
if the formula is heavy, it both makes for slow response and for a less scalable server
The situation is not quite as bad if additional predicates in the WHERE clause allow SQL to filter out [a significant amount of] records that would otherwise be processed. Such additional search criteria may be driven by the application (for example in addition to this condition on radiant, the [unrelated] altitude of the location is required to be below 6,000 ft), or such criteria may be added "artificially" to help with the query (for example you may know of a rough heuristic which is insufficient to calculate the "radian" value within acceptable precision, but may yet be good enough to filter-out 70% of the records, only keeping these which have a chance of satisfying the exact range desired for the "radian".
Now a few tricks regarding the formula itself, in an attempt to make it faster:
remember that you may not need to run 100% of the textbook formula.
I'm not sure which part of the great circle math is relevant to this radian calculation, but speaking in generic terms, some formulas include an expensive step, such as a square root extraction, a call to a trig function etc. In some cases it may be possible to simplify the formula (which has to be run for many records/values), by applying the reverse step to the other side of the predicate (which, typically, only needs to be evaluated once). For example if say the search condition predicate is "WHERE SQRT((x1-x2)^2 + (y1-y2)^2) > 5". Since the calculation of distance involves finding the square root (of the sum of the squared differences), one may decide to remove the square root and instead compare the results of this modified formula with the square of the distance value origninally, i.e. "WHERE ((x1-x2)^2 + (y1-y2)^2) > (5^2)"
Depending on your SQL/DBMS system, it may be possible to implement the formula in a custom-defined function, which would make it both more efficient (because "pre-compiled", maybe written in a better language etc.) and shorter to reference, in the SQL query itself (event though it would require being listed twice, as said)
Depending on situation, it may also be possible to alter the database schema and underlying application, to have the formula (or parts thereof) but pre-computed, and indexed, saving the DBMS this lengthy resolution of the function-based predicate.

Resources