According to this:
Selectivity is the value between 0 and 1, and it is the fraction of rows returned after applying a filter on the table.
For example if
a table has 10,000 rows and the query returns 2601 rows, the
selectivity would be 2601/10000 or .26 or 26 percent. Selectivity
enables you (or optimizer for that matter) to decide which data access
method is optimum in the execution plan.
I need some clarifications: ok, that table has 10000 rows and only 2601 are returned by the query. But what if, that query block contains three joined tables, or it contains a subquery in the where clause? So in the FROM clause there are three tables, and the fourth table is part of a where clause subquery, then how is this selectivity calculated?
Selectivity = number of rows satisfying a condition (from which table?) / total number of rows (from all the four tables?)
Same question for cardinality (cardinality = selectivity * total number of rows).
I found many articles about this, but each of them exemplifies these concepts with simple select statements, based on a single table or a single where clause condition.
Can someone give me an example of how are these measures calculated in case of a bit more complex query (on "hr" schema, or other training purpose schema), meaning subqueries in the FROM clause, or WHERE clause, and so on?
Thank you.
EDIT:
I need some clarification about the selectivity measure, computed by the Estimator (Cost-Based Optimizer).
http://gerardnico.com/wiki/database/oracle/selectivity
For example, for an equality predicate (last_name = 'Smith'), selectivity is set to the reciprocal of the number n of distinct values of last_name, because the query selects rows that all contain one out of n distinct values.
I don't know how to understand that "reciprocal of the number n of distinct values".
Assuming the employees table have 107 rows, and the query
Code: [Select all] [Show/ hide]
select * from employees where last_name = 'Smith'
returns 2 rows, the selectivity is 2/107 = 0.01? So it's the number of rows satisfying the predicate / total number of rows. So no "distinct" involved in this equation.
Apart from this selectivity of the statement, there is also a column selectivity, which is represented by the NDV (number of distinct values in that column - which can be queried from dba_tab_col_statistics) / total number of rows (http://www.runningoracle.com/product_info.php?products_id=233). So if NDV is 103, the selectivity of the last_name column is 103/107 = 0,96.
This is what I understood.. is this correct? If I'm wrong, please correct me.
Thank you.
Selectivity is always based on whatever criteria are being applied at that time.
Exactly what this means for a particular table depends on the join order.
Oracle will always start executing a query by selecting rows from a particular table on it's own. In this case the selectivity is straightforward as per the examples you have read. There are no join conditions to take into account at this point.
Next, it joins in a second table. Oracle makes an estimate of how many rows will satisfy both the constant conditions directly on that table only, along with any join conditions from the first table. The later is called "join selectivity".
Then, when joining the third table it estimates based on joining to the first two tables as well as any constant conditions.
This is one of the reasons that join order is so important to a plan.
Related
Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.
I wanted to use Google Sheets to do a competition ranking which can help me to rank or sort the ranking automatically when I key in the Points.
However, there is a condition where there will be a tied happens. If a tie happens, I will take the Score Differences (SD) into consideration. If the Score Differences is low, then it will be rank higher in the tie condition.
See below table for illustration:
For example: Currently Team A and Team D having the highest PTS, so both of them are currently Rank 1. However, Team D is having a lower SD compare to Team A. So I wanted to have it automatically rank Team D as Rank 1 and Team A as Rank 2.
Is this possible?
One solution might be to create a hidden column with a formula like:
=PTS * 10000 - SD
(Replacing PTS and SD with the actual cell references)
Multiplying PTS by 10000 ensures it has a higher priority than SD.
We want to reward low SDs, so we subtract instead of add.
Finally, in the rank column, we can use a formula like:
=RANK(HiddenScoreCell, HiddenScoreColumnRange, 0)
So, for example, if the HiddenScore column is column K, the actual formula for row 2 might look like
=RANK(K2, K:K, 0)
The third parameter is 0 as we want higher scores to have a lower rank.
To sort, you can just apply a sort on the Rank column.
With sort() you can define multiple sorting criteria (see [documentation][1], e.g.
=sort(A2:I5,8,false,7,false)
So you're going to sort your table (in A2:I5, change accordingly) based first on PTS, descending, then on SD, descending? You can add more criteria with more pairs of parameters (column index, then descending or ascending as a boolean).
Then you need to compare your team name with with the sorted table and find its rank in the sorted list:
=ArrayFormula(match(A2:I5,sort(A2:I5,8,false,7,false),0))
Paste that formula in I2 (assuming your table starts in A1 with its headers, otherwise adjust accordingly).
=ARRAYFORMULA(IF(LEN(A2:A), RANK(H2:H*9^9-G2:G, H2:H*9^9-G2:G), ))
Is there any way to subtract values from two different columns using RelaX (an relational algebra online calculator)? I have tried using projection, group by, as well as a few examples I saw here on SO. I am trying to subtract the average wage from value the of the wage of employees.
The RelaX projection operator takes a list of expressions giving the column values of each row returned. Those expressions can be just column names but they don't have to be. (As with an SQL select clause.)
From the help link:
projection
Expressions can be used to create more complex statements using one or more columns of a single row.
pi c.id, lower(username)->user, concat(firstname, concat(' ', lastname))->fullname (
ρ c ( Customer )
)
Value expressions
With most operators you can use a value-expression which connects one or more columns of a single row to calculate a new value. This is possible for:
the projection creating a new column (make sure to give the column a name)
the selection any expression evaluating to boolean can be used
for the joins any expression evaluating to boolean can be used; note that the rownum() expression always represents the index of the lefthand relation
PS RelaX is properly a query language, not an algebra. Its "value expressions" are not evaluated to a value before the call. That begs the question of how we would implement a language using an algebra.
From Is multiplication allowed in relational algebra?:
Some so-called "algebras" are really languages because the expressions don't only represent the results of operators being called on values. Although it is possible for an algebra to have operand values that represent expressions and/or relation values that contain names for themselves.
The projection that takes attribute expressions begs the question of its implementation given an algebra with projection only on a relation value and attribute names. This is important in an academic setting because a question may be wanting you to actually figure out how to do that, or because the difficulty of a question is dependent on the operators available. So find out what algebra you are supposed to use.
We can introduce an operator on attribute values when we only have basic relation operators taking attribute names and relation values. Each such operator can be associated with a relation value that has an attribute for each operand and an attribute for the result. The relation holds the tuples where the result value is equal to the the result of the operator called on the operand values. (The result is functionally dependent on the operands.)
From Relational Algebra rule for column transformation:
Suppose you supply the division operator on values of a column in the form of a constant base relation called DIVIDE holding tuples where dividend/divisor=quotient. I'll use the simplest algebra, with headings that are sets of attribute names. Assume we have input relation R with column c & average A. We want the relation like R but with each column c value set to its original value divided by A.
/* rows where
EXISTS dividend [R(dividend) & DIVIDE(dividend, A, c)]
*/
PROJECT c (
RENAME c\dividend (R)
NATURAL JOIN
RENAME quotient\c (
PROJECT dividend, quotient (SELECT divisor=A (DIVIDE))))
From Relational algebra - recode column values:
To introduce specific values into a relational algebra expression you have to have a way to write table literals. Usually the necessary operators are not made explicit, but on the other hand algebra exercises frequently use some kind of notation for example values.
I have a set of tables that are related (parent child relationships).
I need a solution where in I can quickly find if two tables are related.
Also If they are related I need to find out if the relationship is a parent-child relationship or child-parent relationship.
My solution:
Store the relationship details in the form of a matrix.
Say there are three tables T1, T2 and T3. T1 has two children T2 and T3.
Then I can represent the relationship as
{{0,1,1},
{-1,0,0},
{-1,0,0}}
The first row and first column represent T1.
The second row and second column represent T2.
The third row and third column represent T3.
To find the relationship between T1 and T2 you go to the first row and second column. The value is 1. This shows that T1 is the parent and T2 is the child.
A -1 would indicate that the first table is the child and the second table is the parent.
A 0 would indicate that the two tables are not related.
Is there a better solution to this problem?
Let's say we have a large dataset that has to get into the SQLite database, 250 million items. Let's say the table is
create table foo (myInt integer, name text)
and myInt is indexed but is not unique. There's no primary key.
The values are between 1 and 250000000 and duplicates are very very rare but not impossible. That is intentional/by design.
Given the way the b-tree algorithms work (and ignoring other factors) which is the faster insert, and why?
(a) dataset is first sorted on myInt column (ascending or descending) and the data
rows are then inserted in their pre-sorted order into SQLite
(b) dataset is inserted in a totally random order
Absolutely (a).
Random insertion in a btree is much slower.