Non aggregate functions, relational algebra - relational-algebra

How can we translate the non aggregate functions of Structured Query Language into relational algebra expressions?! I know how to express the aggregate functions, but what about the non aggregate functions?!
e.g How can we write the Year( a date format column) function?! Just Year(date)?
select e.name,year(e.dateOfEmployment) from Employees e
Thanks!

(This is a very reasonable question, I don't understand why it should get downvoted.)
The "Relational" in RA means expressing functions as mathematical relations -- using a set-theoretic approach. (It doesn't mean, as often thought, relating one table or datum to another as in "Entity Relational" modelling.) I can't grab a very succinct reference for this off the top of my head, but start here http://en.wikipedia.org/wiki/Binary_relation and follow the links.
How does this get to answer your question in context of a practical RA? Have a look at this:
http://www.dcs.warwick.ac.uk/~hugh/TTM/APPXA.pdf, especially the section Treating Operators as Relations.
See how the relations PLUS and SQRT can be 'applied' (using COMPOSE, which is a shorthand for Natural JOIN and PROJECT) to behave as a function.
For your specific question, you need a relation with two attributes (type Date and Year).

Related

Linq to sql/xml , to object ,and to dataset comparison

Hellow Dears,
i read deeply in LinQ articles and i wonder about one thing not got it is what difference between LinQ Types [To SQL/XML - To Object - To DataSet].
need simple clarify comparison specially for memory view
Thanks
"Linq" stands for "Langage Integrated Query" --- basically, it means that the query syntax keywords (from where select etc) are now officially part of the language.
Now, at the highest level, a query against an array or a database table is doing essentially the same thing --- but the actual mechanics of how the query happens are quite different.
Linq2Sql, Linq2Object et al, are different subsystems which allow very different queries to be expressed using a common syntax.

Compound "from" clause in Linq query in Entity Framework

I've been working with Entity Framework for a few weeks now. I have been working with Linq-Objects and Linq-SQL for years. A lot of times, I like to write linq statements like this:
from student in db.Students
from score in student.Scores
where score > 90
select student;
With other forms of linq, this returns distinct students who have at least one score greater than 90. However, in EF this query returns one student for every score greater than 90.
Does anyone know if this behavior can be replicated in unit tests? Is it possible this is a bug in EF?
I don't like that SQL-like syntax (I have no better name for it), especially when you start nesting them.
var students = db.Students.Where(student
=> student.Scores.Any(score => score > 90)
)
.ToList();
This snippet, using the method syntax, does the same thing. I find it far more readable. It's more explicit in the order of operations used.
And as far as I have experienced, EF hasn't yet shown a bug with its selection using method syntax.
Edit
To actually answer your problem:
However, in EF this query returns one student for every score greater than 90.
I think is is due to a JOIN statement used in the final SQL that will be run. This is why I avoid SQL-like syntax, because it becomes very hard to differentiate between what you want to retrieve (students) and what you want to filter with (scores).
Much like you would do in SQL, you are joining the data from students and scores, and then running a filtering operation on that collection. It becomes harder to then unseparate that result again into a collection of students. I think this is the main cause of your issue. It's not a bug per sé, but I think EF can only handle it one way.
Alternative solutions to the above:
If it returns one student per score over 90, take the distinct students returned. It should be the same result set.
Use more explicit parentheses () and formatting to nest separate SQL-like statements.
Note: I'm not saying it can't be done with SQL-like syntax. I am well aware that most of this answer is opinion based.

(var)char as the type of the column for performance?

I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
UPDATE:
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.
I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
http://www.postgresql.org/docs/9.2/static/datatype-enum.html
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
name text,
current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.
I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.

LINQ - Using where or join - Performance difference?

Based on this question:
What is difference between Where and Join in linq?
My question is following:
Is there a performance difference in the following two statements:
from order in myDB.OrdersSet
from person in myDB.PersonSet
from product in myDB.ProductSet
where order.Persons_Id==person.Id && order.Products_Id==product.Id
select new { order.Id, person.Name, person.SurName, product.Model,UrunAdı=product.Name };
and
from order in myDB.OrdersSet
join person in myDB.PersonSet on order.Persons_Id equals person.Id
join product in myDB.ProductSet on order.Products_Id equals product.Id
select new { order.Id, person.Name, person.SurName, product.Model,UrunAdı=product.Name };
I would always use the second one just because it´s more clear.
My question is now, is the first one slower than the second one?
Does it build a cartesic product and filters it afterwards with the where clauses ?
Thank you.
It entirely depends on the provider you're using.
With LINQ to Objects, it will absolutely build the Cartesian product and filter afterwards.
For out-of-process query providers such as LINQ to SQL, it depends on whether it's smart enough to realise that it can translate it into a SQL join. Even if LINQ to SQL doesn't, it's likely that the query engine actually performing the query will do so - you'd have to check with the relevant query plan tool for your database to see what's actually going to happen.
Side-note: multiple "from" clauses don't always result in a Cartesian product - the contents of one "from" can depend on the current element of earlier ones, e.g.
from file in files
from line in ReadLines(file)
...
My question is now, is the first one slower than the second one? Does it build a cartesic product and filters it afterwards with the where clauses ?
If the collections are in memory, then yes. There is no query optimizer for LinqToObjects - it simply does what the programmer asks in the order that is asked.
If the collections are in a database (which is suspected due to the myDB variable), then no. The query is translated into sql and sent off to the database where there is a query optimizer. This optimizer will generate an execution plan. Since both queries are asking for the same logical result, it is reasonable to expect the same efficient plan will be generated for both. The only ways to be certain are to
inspect the execution plans
or measure the IO (SET STATISTICS IO ON).
Is there a performance difference
If you find yourself in a scenario where you have to ask, you should cultivate tools with which to measure and discover the truth for yourself. Measure - not ask.

Data structure to hold HQL or EJB QL

We need to produce a fairly complex dynamic query builder for retrieving reports on the fly. We're scratching our heads a little on what sort of data structure would be best.
It's really nothing more than holding a list of selectParts, a list of fromParts, a list of where criteria, order by, group by, that sort of thing, for persistence. When we start thinking about joins, especially outer joins, having clauses, and aggregate functions, things start getting a little fuzzy.
We're building it up interfaces first for now and trying to think as far ahead as we can, but definitely will go through a series of refactorings when we discover limitations with our structures.
I'm posting this question here in the hopes that someone has already come up with something that we can base it on. Or know of some library or some such. It would be nice to get some tips or heads-up on potential issues before we dive into implementations next week.
I've done something similar couple of times in the past. A couple of the bigger things spring to mind..
The where clause is the hardest to get right. If you divide things up into what I would call "expressions" and "predicates" it makes it easier.
Expressions - column references, parameters, literals, functions, aggregates (count/sum)
Predicates - comparisons, like, between, in, is null (predicates have expression as children, e.g. expr1 = expr2. Then you also having composites such as and/or/not.
The whole where clause, as you can imagine, is a tree with a predicate at the root, with maybe sub-predicates underneath eventually terminating with expressions at the leaves.
To construct the HQL you walk the model (depth first usually). I used a visitor as I need to walk my models for other reasons, but if you don't have multiple purposes you can build the rendering code right into the model.
e.g. If you had
"where upper(column1) = :param1 AND ( column2 is null OR column3 between :param2 and param3)"
Then the tree is
Root
- AND
- Equal
- Function(upper)
- ColumnReference(column1)
- Parameter(param1)
- OR
- IsNull
- ColumnReference(column2)
- Between
- ColumnReference(column3)
- Parameter(param2)
- Parameter(param3)
Then you walk the tree depth first and merge rendered bits of HQL on the way back up. The upper function for example would expect one piece of child HQL to be rendered and it would then generate
"upper( " + childHql + " )"
and pass that up to it's parent. Something like Between expects three child HQL pieces.
You can then re-use the expression model in the select/group by/order by clauses
You can skip storing the group by if you wish by just storing the select and before query construction scan for aggregate. If there is one or more then just copy all the non-aggregate select expressions into the group by.
From clause is just a list of table reference + zero or more join clauses. Each join clause has a type (inner/left/right) and a table reference. Table reference is a table name + optional alias.
Plus, if you ever get into wanting to parse a query language (or anything really) then I can highly recommend ANTLR. Learning curve is quite steep but there are plenty of example grammars to look at.
HTH.
if you need EJB-QL parser and data structures, EclipseLink (well several of it's internal classes) have good one:
JPQLParseTree tree = org.eclipse.persistence.internal.jpa.parsing.jpql.JPQLParser.buildParserFor(" _the_ejb_ql_string_ ").parse();
JPQLParseTree contains the all the data.
but generating EJB-QL back from modified JPQLParseTree is something you have to do yourself.

Resources