Procedural and non-procedural query language difference - relational-algebra

Going through the relational algebra, I encountered the term "procedural query language".
So what is the difference between a procedural query language and a non-procedural query language?

There is a myth that relational algebra notations are procedural and relational calculus notations are not. But every relation expression corresponds to a certain calculus expression with the same tree structure. So it cannot be procedural when calculus is not. You can implement/execute a query in either notation per its expression tree--or not.
A (query) language is procedural when it has to use looping or otherwise relies on state. The alternative is often called declarative or functional.
Any database notation that updates the database is procedural, including SQL. But that's not "querying". Typically DBMSs have extensions to SQL that allow you to partially control query execution and/or data storage order in terms of implementations concepts; that's non-procedural. But that's not SQL.

In a procedural query language, like Relational Algebra, you write a query as an expression consisting of relations and Algebra Opertors, like join, cross product, projection, restriction, etc. Like in an arithmetical expression (e.g. 2 / (3 + 4)), the operators have an order (in the example, the addition is performed before the division). So for instance you join the results of two different projection, and then perform a restriction, etc. A language like this is called procedural since each expression establish a certain order of performing its operators.
On the contrary, query languages like Relational Calculus, and the well knwon SQL query language are called “non procedural” since they express the expected result only through its properties, and not the order of the operators to be performed to produce it. For instance, with an SQL expression like:
SELECT t1.b
FROM t1
WHERE t1.b > 10
we specify the we want all the tuples of relation t1 for which t1.b > 10 is true, and from these we want the value of t1.b, but we do not specify if first the projection must be performed, and then the restriction, or the restriction first and then the projection. Imagine a complex SQL query, with many joins, conditions, restrictions, etc. Many different orders of executing the query could be devised (and in effect the task of the query optimizer is that of devising an efficient order to perform these operations, so to trasform this declarative query into a procedural one).

Related

How does predicate pushdown work exactly?

Could anyone please explain with examples how exactly predicate pushdown works?
Say you want to execute a query
SELECT
SUM(price)
FROM sales
WHERE
purchase_date BETWEEN '2018-01-01' and '2018-01-31';
A very trivial implementation of a query engine is to iterate over all parquet/orc files, deserialize the price and purchase_date columns, apply the predicate on the purchase_date and sum the filtered rows.
Parquet (not sure about orc) maintains statistics on the columns in each file, so if the execution engine is smart enough, it can look at the min/max of the purchase_date within the statistics and determine if any rows is going to match. For example, if purchase_date.min=2014-05-05 and purchase_date.max=2014-05-06, it can deduce that the predicate will always evaluate to false.
In other words, it can skip parquet files by combining statistics and the filter predicate . This can lead to massive gain of performance because IO (file or memory) is usually the bottleneck. The gain is inversely proportional to the selectivity (the percentage of matching rows).
The term predicate push-down comes from the fact that you're "hinting" the scan operator with the predicate that is then going to be used to filter the rows of interest. Or, pushing the predicate to the scan.

Order of multiple conditions in where clause in oracle [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

ORA-01795 - why is the maximum number of expressions limited to 1000

SO is full of work-arounds, but I'm wondering about the historical reasons behind the 1000 limit for "maximum number of expressions" in IN clause?
It might be because, there is potential of being abused with tons of values. And every value in it will be transformed into equivalent OR condition.
For example NAME IN ('JOHN', 'CHARLES'..) would be transformed into NAME = 'JOHN' OR NAME = 'CHARLES'
So, it might impact the performance..
But note Oracle still supports
SELECT ID FROM EMP WHERE NAME IN (SELECT NAME FROM ATTENDEES)
In this case, the optimizer doesn't convert into multiple OR conditions, but make a JOIN instead..
This restriction is not only for IN list, but on any expression list. Documentation says :
A comma-delimited list of expressions can contain no more than 1000 expressions.
Your question is WHY the limit is 1000. Why not 100 or 10000 or a million? I guess it relates to the limit of the number of columns in a table, which is 1000. Perhaps, this relation is true in Oracle internally to make the expression list and the columns to match with the DML statement.
But, for a good design, the limit 1000 itself is big. Practically, you won't reach the limit.
And, a quote from the famous AskTom site on similar topic,
We'll spend more time parsing queries then actually executing them!
Update My own thoughts
I think Oracle is quite old in DB technology, that these limits were made then once and they never had to think about it again. All expression list have 1000 limit. And a robust design never let the users to ask Oracle for an explanation. And Tom's answer abour parsing always make me think that all this limit purpose back then in 70s or 80s was more of computation issue. The algorithms based on C might have needed some limit and Oracle came uo with 1000.
Update 2 : From application and it's framework point of view
As a DBA, I have seen so many develpers approaching me with performance issues which are actually issues with application framework generating the queries to fetch the data from database. The application provides the functionality to the users to add filters, which eventually form the AND, OR logic within the IN list of the query. Internally Oracle expands it as query rewrite in the optimization stage as OR logic. And the query becomes huge, thus increasing the time to PARSE it. Most of the times, it suppresses the index usage. So, this is one of the cases where a query is generated with huge IN list, via application framework.

Is there a way to analyse how a particular Linq-to-objects query will execute?

In the past, I've written Linq to SQL queries that haven't performed well. Using SQL Profiler (or similar) I can look at how my query is translated to SQL by intercepting it at the database.
Is there a way to do this with Linq queries that operate solely on objects?
As an example, consider the following Linq query on a list of edges in a directed graph:
var outEdges = from e in Edges
where e.StartNode.Equals(currentNode) &&
!(from d in deadEdges select d.StartNode).Contains(e.EndNode)
select e;
That code is supposed to select all edges that start from the current node except for those that can lead to a dead edge.
Now, I have a suspicion that this code is inefficient, but I don't know how to prove it apart from analysing the MSIL that's generated. I'd prefer not to do that.
Does anyone know how I could do this without SQL?
Edit:
When I talk about inefficiency, I mean inefficiency in terms of "Big O" notation or asymptotic notation. In the example above, is the code executing the Linq in O(n) or O(n log m) or even O(n.m)? In other words, what's the complexity of the execution path?
With Linq to SQL, I might see (for example) that the second where clause is being translated as a subquery that runs for each edge rather than a more efficient join. I might decide not to use Linq in that case or at least change the Linq so it's more efficient with large data sets.
Edit 2:
Found this post - don't know how I missed it in the first place. Just searching for the wrong thing I guess :)
I don't think you need a profiler for that...
Linq to SQL (or Linq to Entities) queries are translated to another language (SQL) and then executed using an optimized execution plan, so it's hard to see how exactly what happens ; for this kind of scenario, a profiler can be helpful. On the other hand, Linq to Objects queries are not translated, they are executed "as is". A Linq to Objects query using the SQL-like syntax is just syntactic sugar for a series of method calls. In your case, the full form of the query would be :
var outEdges = Edges.Where(e => e.StartNode.Equals(currentNode) &&
!deadEdges.Select(d => d.StartNode).Contains(e.EndNode));
So, basically, you iterate over Edges, and for each item in Edges you iterate over deadEdges. So the complexity here is O(n.m), where n is the number of items in Edges, and m the number of items in deadEdges

Database and EF performance concern?

I have a basically sql select question that people gave me different answers over the years. Say I have a couple of tables designed each with over 40 columns and potentially will hold ten and thousands of row, I'm using SqlServer2005.
On joining these tables, in the where clause if I have things like
select * from t1, t2
where t1.UserID = 5
and t1.SomeID = t2.SomeOtherID
some people say you should alwasys have the constant "t1.UserID = 5" up front rather than after the "t1.SomeID = t2.SomeOtherID", it boosts the select performance. While others say it doesn't matter.
What is the correct answer?
Also, if I use ADO.NET Entity Framework to implement my DAL, will modeling tables that have over 40 columns and doing CRUD operations be a performance issue to it?
Thank you,
Ray.
In general, with database optimization, you should write SQL which is conceptually correct first, and then tweak performance if profiling shows it to be necessary. When doing an inner join, it is better to use SQL-92, explicit INNER JOINs than Cartesian products. So I would begin by writing your SQL as follows:
SELECT *
FROM t1
INNER JOIN t2
ON t1.SomeID = t2.SomeOtherID
WHERE
t1.UserID = 5
The t1.SomeID = t2.SomeOtherID that goes in the ON part of the INNER JOIN, because it expresses the relationship between the two tables. The UserID that goes in the WHERE clause because it is a filter to limit the result set. Writing your SQL in this way gives more information to the database optimizer, because it expresses your intentions about the join versus the filtering.
Now IF you are not getting acceptable performance with this syntax in a real-world database, then do feel free to experiment with moving bits around. But like I said, start with something which is conceptually correct.
With regards to the second part of your question, the most obvious performance implication is that when you select a collection of entities, the Entity Framework needs to bring back all properties for the entities it is materializing. So if you have 40 columns, then you will be pulling that data back over the wire, if you materialize them as entities. It is, however, possible to write LINQ queries which return anonymous types containing only the columns you need. However, to do full CRUD, you will need to return entities.
People's opinion on this will change over time because RDBMS query optimisation has evolved over time, and different RDBMSs will have different approaches. I can't speak for every syste out there but it's really unlikely that in 2008 this is going to make any difference. YMMV if you are interested only in a specific system.
I can tell you that for any recent version of Oracle it makes no difference.
I know this answer is kind of trite, but I would suggest writing benchmarks. Whip up a console app and test it out yourself. Run the query a couple hundred times and see how long it takes for each way.
There is a lot of superstition when it comes to SQL query performance and optimization. Some people do things thinking it is faster but they don't actually check their facts. Also, the way EF or LinqToSql work and interact with the DB may introduce performance differences not evident in SQL.
If you're optimizing code you may also want to use a profiler like RedGate ANTS. Its not free, but it can help a lot to find bottlenecks in your code. Then you can find places in your code to optimize much easier. It's not always your database slowing your apps down. Or sometimes you're executing a fast query, but doing it a jillion times when you could actually be caching the result.
Firstly, construct the query using an explicit JOIN syntax, rather than the cartesian product. It probably won't make any difference performance-wise for any modern optimiser, but it does make the information on how the JOINs work more accessible for the programmers.
SELECT Player.Name, Game.Date
FROM Player
INNER JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Game.WinnerFrags > Game.TotalFrags/2
ORDER BY Player.Name
Which will give us all the players sorted by name who have take more frags in a game than all the other players in the game put together, and the dates of the games. Putting both the conditions are in the JOIN probably won't affect performance either, since the optimiser is likely do the filtering as part of the JOIN anyway. It does start to matter for LEFT JOINs though. Lets say we're looking for how many games the week's top ten players have ever won by the margin described above. Since it is possible that some of them have never one this spectacularly, we'll need LEFT JOIN.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Player.WeekRank >= 10
AND Game.WinnerFrags > Game.TotalFrags/2
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Well, not quite. The JOIN will return records for each game played by a player, or the player data and NULL game data if the player has played no games. These results will get filtered, during or after the JOIN depending on the optimiser's decision, based on the frag criteria. This will eliminate all the records that don't meet the frag criteria. So there will be no records to group for players who have never had such a spectacular win. Effectively creating an INNER JOIN .... FAIL.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
AND Game.WinnerFrags > Game.TotalFrags/2
WHERE Player.WeekRank >= 10
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Once we move the frag criteria into the JOIN the query will behave correctly, returning records for all players in the week's top ten, irrespective of whether they've achieved a whitewash.
After all of that, the short answer is:
For INNER JOIN situations it probably doesn't make a to performance difference where you put the conditions. The queries are more readable if you separate the the join and filtering conditions though. And getting a condition in the wrong place can seriously mess up the results of a LEFT JOIN.

Resources