Library to convert SQL to relational algebra - ruby

There are many relational algebra packages (arel, axiom, alf) which generate SQL from an abstract representation of a query.
Are there any libraries that allow you to go the other way - from SQL to a relational algebra?

No, I wouldn't count on it.
SQL is a horrendous language, parsing it is an immense task, and parsing it for the purpose of capturing the original algebraic intent is considered infeasible by just about the whole world, as far as I know.
And then I haven't even begun to mention the various ways in which vendors turn it into something that is actually no less than a completely proprietary language, despite a possible superficial resemblance to what is supposed to be a standard.
And even if such a package existed, what would you do with the output you obtained from it ?

Apache Calcite might be what you're looking for.

Related

My Database Design skills stink. Where to seek remedy?

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?
Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!
I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
http://www.amazon.com/Server-Query-Performance-Tuning-Distilled/dp/1590594215/ref=sr_1_2?s=books&ie=UTF8&qid=1334154710&sr=1-2
http://www.amazon.com/Performance-Tuning-Server-Dynamic-Management/dp/1906434476/ref=sr_1_12?s=books&ie=UTF8&qid=1334154710&sr=1-12
MySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-ebook/dp/B0028N4W7Y/ref=sr_1_3?ie=UTF8&qid=1334154504&sr=8-3
Oracle
http://www.amazon.com/Oracle-Database-Release-Performance-Techniques/dp/0071780262/ref=sr_1_2?s=books&ie=UTF8&qid=1334154909&sr=1-2
General performance Tuning
http://www.amazon.com/SQL-Performance-Tuning-Peter-Gulutzan/dp/0201791692/ref=sr_1_18?s=books&ie=UTF8&qid=1334154964&sr=1-18
First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.

Linq, entity framework and their usage

I had to develop system for my university which includes tracking of almost all data (lectures, lecturers, teaching assistants, students, etc..)
My database have like 30 tables, and it's pretty complex.
I used EF and linq to solve connecting to database and querying from it.
But the more I'm going into, the more my queries become to hard to write, and not to mention to maintain.
Here is the example of one query: http://pastebin.com/Za1cYMPa
It is pretty much in chaos as you can see.
So, am I misusing linq (linq can solve this but on different way) or linq just isn't for more complex systems like this one?
This is general question, not one to solve particular problem.
Do you think that query will look better or be better maintainable if written in native SQL? In such case you can hide the query in stored procedure. Once you came into advanced queries it always mess. You can reduce some complexities by hiding subqueries either into database views or into EF query views. But still if you have highly normalized OLTP database and you need to do complex reporting / analysis / data mining query it will always be big and badly maintainable. That is the reason why OLAP systems exist (I didn't check content of your query - just length so don't take it as the reason to build OLAP Cube).
More important is performance of the query ...
In general, Complexity can be reduced by abstracting your repetitive code into reusable, easily-maintainable components.
In your particular case, the complexity can be reduced by implementing the Repository and Specification patterns to make your code more consistent, easier to read, and easier to maintain.
There is a very helpful sequence of articles by Huy Nhuyen explaining the repository pattern, the specification pattern and how to effectively combine them with the Entity Framework for better code. These are available at:
Entity Framework 4 POCO, Repository and Specification Pattern
Specification Pattern In Entity Framework 4 Revisited
Entity Framework 4 POCO, Repository and Specification Pattern [Upgraded to CTP5]
Entity Framework 4 POCO, Repository and Specification Pattern [Upgraded to EF 4.1]
Ooof, that is pretty nasty.
You should think about your database design. I believe having so many tables is causing your queries to be overly complicated. I well placed set of views (or stored procedures) can simplify the querying logic. However this will not simplify your overall system because at some point you have to connect all these tables.

Most powerful and unexpected benefit of Linq in .NET OOP/D?

Since learning about Linq and gaining experience in it, I find myself leaning on it more and more. It’s changing how I think about my classes. Some of these changes were expected (ex. Using collections more) but often unexpected (ex. Getting initial data for a class as an XElement and sometimes just keeping it there, processing it lazily.)
What is the most powerful and unexpected benefit of Linq to .NET OOP/D? I am thinking of Linq-to-objects and Linq-to-xml in particular, but include Linq-to-Entities/SQL too in so far as it has changed your class strategy.
I've noticed a couple of significant benefits from using LINQ:
Maintainability - it's much easier to understand what code does when you read a semantic transformation using LINQ, rather than some confusing looping constructs hand-written by a developer.
Performance - Because of LINQ's deferred and streaming execution, you often end up with code that performs better - either by distributing the workload, or allowing unnecessary transformations to be avoided (particularly when only a subset of results are consumed). In the future, as multicore processing becomes more significant, I expect that many LINQ methods may evolve to support native parallel processing (think multi-core sort) - which should help keep .NET applications scalable in the multi-code future.
There are a couple of other nice benefits:
Awareness of Iterator Generators: Once developers learn about LINQ, some of them go on to learn about how it works. This helps to generate awareness of the yield return syntax in C# - which is a powerful way of writing concise and correct sequence iterators.
Focus on business problems: LINQ frees developers to focus on solving the underlying business problems, rather than trying to optimize loops and algorithms to run in the fewest cycles, or use the least number of lines of code. This goes beyond just the productivity of having a library of powerful sequence transformation routines.
I feel the code is easier to maintain and easier to Test compared to have a solution in SQL stored procedures.
Combining LINQ with extensions I get something like (should maybe use some kind of Fluent Interface.....)
return source.Growth().ShareOfChangeDate();
where Growth and ShareOfChageDate are extensions that I easily can do unit tests on
and as LBushkin says the line above I can present for the customer when we discuss
Issues
I feel i get less controll on the SQL generated and it is a littlebit magic to find performance problems.....

Is LINQ to Everything a good abstraction?

There is a proliferation of new LINQ providers. It is really quite astonishing and an elegant combination of lambda expressions, anonymous types and generics with some syntax sugar on top to make it easy reading. Everything is LINQed now from SQL to web services like Amazon to streaming sensor data to parallel processing. It seems like someone is creating an IQueryable<T> for everything but these data sources can have radically different performance, latency, availability and reliability characteristics.
It gives me a little pause that LINQ makes those performance details transparent to the developer. Is LINQ a solid general purpose abstraction or a RAD tool or both?
To me, LINQ is just a way to make code more readable, and hence more maintainable. LINQ does nothing more than takes standard methods and integrates them into the language (hence the name - language integrated query).
It's nothing but a syntax element around normal interfaces and methods - there is no "magic" here, and LINQ-to-something really should (IMO) be treated as any other 3rd party API - you need to understand the cost/benefits of using it just like any other technology.
That being said, it's a very nice syntax helper - it does a lot for making code cleaner, simpler, and more maintainable, and I believe that's where it's true strengths lie.
I see this as similar to the model of multiple storage engines in an RDBMS accepting a common(-ish) language of SQL, in it's design ... but with the added benefit of integreation into the application language semantics. Of course it is good!
I have not used it that much, but it looks sensible and clear when performance and layers of abstraction are not in a position to have a negative impact on the development process (and trust that standards and models wont change wildly).
It is just an interface and implementation that may fit your needs, like all interfaces, abstractions, libraries and implementations, does it fit?... it is all the same answers.
I suppose - no.
LINQ is just a convenient syntax, but not a common RAD tool. In the big projects with complex logic I noticed that developers do more errors in LINQ that in the same instructions they could do if they write the same thing in .NET 2.0 manner. The code is produced faster, it is smaller, but it is harder to find bugs. Sometimes it is not obvious from the first look, at what point the queried collection turns from IQueryable into IEnumerable... I would say that LINQ requires more skilled and disciplined developers.
Also SQL-like syntax is OK for a functional programming but it is a sidestep from object oriented thinking. Sometimes when you see 2 very similar LINQ queries, they look like copy-paste code, but not always any refactoring is possible (or it is possible only by sacrificing some performance).
I heard that MS is not going to further develop LINQ to SQL, and will give more priority to Entities. Is the ADO.NET Team Abandoning LINQ to SQL? Isn't this fact a signal for us that LINQ is not a panacea for everybody ?
If you are thinking about to build a connector to "something", you can build it without LINQ and, if you like, provide LINQ as an additional optional wrapper around it, like LINQ to Entities. So your customers will decide, whether to use LINQ or not, depending on their needs, required performance etc.
p.s.
.NET 4.0 will come with dynamics, and I expect that everybody will also start to use them as LINQ... without taking into considerations that code simplicity, quality and performance may suffer.

Is it stupid to write a large batch processing program entirely in PL/SQL?

I'm starting work on a program which is perhaps most naturally described as a batch of calculations on database tables, and will be executed once a month. All input is in Oracle database tables, and all output will be to Oracle database tables. The program should stay maintainable for many years to come.
It seems straight-forward to implement this as a series of stored procedures, each performing a sensible transformation, for example distributing costs among departments according to some business rules. I can then write unit tests to check if the output of each transformation is as I expected.
Is it a bad idea to do this all in PL/SQL? Would you rather do heavy batch calculations in a typical object oriented programming language, such as C#? Isn't it more expressive to use a database centric programming language such as PL/SQL?
You describe the following requirements
a) Must be able to implement Batch Processing
b) Result must be maintainable
My Response:
PL/SQL was designed to achieve just what you describe. It's also important to note that there are efficiencies in PL/SQL that are not available in other tools. An stored procedure language put the processing next to the data - which is where batch processing ought to sit.
It easy enough to write poorly maintainable code in any language.
Having said the above, your implementation will depend on the available skills, a proper design and adherence to good quality processes.
To be efficient your implementation must process data in batches ( select in batches and insert/update in batches ). The danger with an OO approach is that it is easy to be led towards a design that processes data row by row. This type of approach contains unnecessary overhead, and will be significantly less efficient than a design that processes data in batches of rows.
It is possible to use both approaches successfully.
Mathew Butler
Something for other commenters to note - the question is about PL/SQL, not about SQL. Some of the answers have obviously been about SQL, not PL/SQL. PL/SQL is a fully functional database language, and it's mature as well. There are some shortcomings, but for the type of thing the poster wants to do, it's very good.
No, it isn't necessarily a bad idea. If the solution seems straightforward to you and allows you to test and verify each process, its sounds like it could be a good idea. OO platforms can be (though they don't have to be) bad for large data sets, as object creation and overhead can kill performance.
Oracle designed PL/SQL with problems like yours in mind, if there is sufficient corporate knowledge of the database and PL/SQL this seems like a reasonable solution. Keep large batch sets in mind, as each call from PL/SQL to the actual SQL engine is a context switch, so single record processes should be batched together where possible to improve performance.
Just make sure you somehow log what is happening while it's working. Otherwise you'll have a black box and if it gets stuck somewhere for hours, you'll be wondering whether to stop it or let it work 'a little bit more'.
PL/SQL is a mature language that integrates well with SQL. With each version of Oracle it becomes more and more powerful.
Also starting from Oracle 11, PL/SQL compiles to machine code by default.
Normally I say put as little in PL/SQL as possible - it is typically a lot less maintainable - at one of my last jobs I really saw how messy and hard to work with it could get.
However, since it is batch processing - and since the input and output are both the DB - it makes good sense to put the logic into PL/SQL - to minimize "moving parts". However, if it were business logic - or components used by other pieces of your system - I would say don't do it..
I wrote a huge amount of batch processing and report generation programs in both PL/SQL and ProC for one project. They generally preferred I write in PL/SQL as their own developers who would maintain in the future found that easier to understand than ProC code.
It ended up being only the really funky processing or reports that ended up being written in Pro*C.
It is not necessary to write these as stored procedures as other people have alluded to, they can be just script files that are run as necessary, kind of like a shell script. Make source code revision control and migration between test and production systems a heck of a lot easier, too.
As long as the calculations you need to perform can be adequately AND readably captured in PL/SQL, then using only PL/SQL would make the most sense.
The real catch is maintainability -- it's very easy to write unmaintainable SQL, if only because every RDBMS has a different syntax and different function set once you step outside of simple SQL DML, and no real standards for formatting. commenting, etc.
I've created batch programs using C# and SQL.
Pros of C#:
You've got the full library of .NET and all the power of an OO
language.
Cons of C#:
Batch program and db separate - this means, you'll have to manage your batch program separate from the database.
You need to escape all that dang sql code
Pros of SQL:
Integrates nicely with the DBMS. If this job only manipulates the database, it would make sense to include it with the database. You end up with a single db and all of its components in one package.
No need to escape sql code
keeping it real - you are programming in your problem domain
Cons of SQL:
Its SQL and I personally just don't know it as well as C#.
In general, I would stick with using SQL because of the Pros outlined above.
This is a loaded question :)
There's a couple of database programming architecture designs you should know of, and what their costs/benefits are.
2 Tier generally means you have a client connecting to a DB, issuing direct SQL calls.
3 Tier generally means you have an "application server" that is issuing direct SQL calls to the DB, but the client is talking to the app server. Generally, this affords "scaling out".
Finally, you have 2 1/2 tiered apps that employ a 2 Tier like format, only the work is compartmentalized within stored procedures.
Your process sounds like a "back office" kind of thing, and clients/processes just need results that are being aggregated and cached on a once a month basis.
That is, there is no agent that connects, and connects often, and says "do these calculations". Instead you allude to a process that happens once in a while, and you can get away with non-real time.
Therefore, given those requirements, I'd say that generally, it will be faster to be closer to the data, and let SQL server do all the calculations.
I think you'll find that proximity to the data will serve you well.
However, in performing these calculations, you may find that some calculations are not amenable to SQL Servers. Take for example calculating the accrued interest of a bond, or any fixed income instrument. Not very pretty in SQL, and much more suited for a richer programming language. However, if you just have simple averages and other relatively sane aggregates, I'd stick to stored procedures, on the SQL side.
So again, there's not enough information as to the nature of your calculations, or what your house mandates in terms of SQL capabilities of devs for support, or what your boss says...but since I know my way around SQL, and like to stay close to the data, I'd stay pure SQL/Stored Procedures for a task like this.
YMMV :)
It's not usually more expressive because most stored procedure languages suck by design. But it will probably run faster than in an external app.
I guess it boils down to how familiar you are with PL/SQL, how much time you have to write this, how important is performance and if you can reasonably expect maintainers to be familiar enough with PL/SQL to maintain a big program written in it.
If speed is not relevant and maintainers will probably be not PL/SQL proficient, you might be better using a 'traditional' language.
You could also use a hybrid approach, where you use PL/SQL to generate intermediate data (say, table joins and sums or whatever) and a separate application to control flow and check values and errors.

Resources