Linq, entity framework and their usage - linq

I had to develop system for my university which includes tracking of almost all data (lectures, lecturers, teaching assistants, students, etc..)
My database have like 30 tables, and it's pretty complex.
I used EF and linq to solve connecting to database and querying from it.
But the more I'm going into, the more my queries become to hard to write, and not to mention to maintain.
Here is the example of one query: http://pastebin.com/Za1cYMPa
It is pretty much in chaos as you can see.
So, am I misusing linq (linq can solve this but on different way) or linq just isn't for more complex systems like this one?
This is general question, not one to solve particular problem.

Do you think that query will look better or be better maintainable if written in native SQL? In such case you can hide the query in stored procedure. Once you came into advanced queries it always mess. You can reduce some complexities by hiding subqueries either into database views or into EF query views. But still if you have highly normalized OLTP database and you need to do complex reporting / analysis / data mining query it will always be big and badly maintainable. That is the reason why OLAP systems exist (I didn't check content of your query - just length so don't take it as the reason to build OLAP Cube).
More important is performance of the query ...

In general, Complexity can be reduced by abstracting your repetitive code into reusable, easily-maintainable components.
In your particular case, the complexity can be reduced by implementing the Repository and Specification patterns to make your code more consistent, easier to read, and easier to maintain.
There is a very helpful sequence of articles by Huy Nhuyen explaining the repository pattern, the specification pattern and how to effectively combine them with the Entity Framework for better code. These are available at:
Entity Framework 4 POCO, Repository and Specification Pattern
Specification Pattern In Entity Framework 4 Revisited
Entity Framework 4 POCO, Repository and Specification Pattern [Upgraded to CTP5]
Entity Framework 4 POCO, Repository and Specification Pattern [Upgraded to EF 4.1]

Ooof, that is pretty nasty.
You should think about your database design. I believe having so many tables is causing your queries to be overly complicated. I well placed set of views (or stored procedures) can simplify the querying logic. However this will not simplify your overall system because at some point you have to connect all these tables.

Related

ORM query performance vs RDBMS performance

I am developing an application with laravel framework, and I want to measure orm query (CRUD) performance and compare it with RDBMS query performance.
I need to show that orm query performance is better than RDBMS but somewhere I read that eloquent laravel has a slow query performance. I need to make the right decision in order to show the desired result. Is doctrine better than eloquent ORM? And also which benchmark do you suggest me to use. I need these results for my thesis.
You're comparing apples to oranges here.
By definition, an RDBMS is always going to be faster, because the RDBMS is your database (RDBMS = Relational Database Management System). IE -- MySQL, SQL Server, PostgreSQL, etc. And the database does one thing really, really well -- handle data (okay, two things, depending on how you look at it -- storing and retrieving data).
Since every way of accessing the database from any other language is at least one step removed from the database itself, everything is slower than the RDBMS itself, if for no other reason than the language's interpreter has to first connect to the database at least once, before it can do anything.
That said, there are a few different layers available when dealing with databases in PHP:
Raw queries using PHP's built in mysql_* functions.
Basic database abstraction layers (ie - PDO)
Basic query builders (ie - Laravel's Query Builder)
Active Record pattern ORMs (ie - Eloquent)
Stateless/transactional ORMs (ie - Doctrine)
Assuming perfectly optimized queries fed into the given method by the developer, raw queries will be fastest, followed by the basic DBAL, followed by anything built on top of the basic DBAL. Where query builders and the ORMs built on them fall will depend on whether the query builder is, itself, built on top of another DBAL (in this case, I think it puts Eloquent one more layer removed than Doctrine, because Eloquent is built on Query Builder, which is built on PDO). This is because each one is an abstraction layer over the previous, so the path of the code, when executed, has to run through the stack.
The question then becomes how much of a difference are we talking? That depends entirely on the queries you're feeding into the system, as well as the quality of the system, itself. What are you looking for to show differences? How fast it can do a basic SELECT? Or how well it can do some crazy multi-JOIN query? What determines "speed" for the purpose of your thesis? Only you can really decide that, because you have more information than anyone here. For the sake of thoroughness, you're probably looking at basic SELECTs, complex queries that include things like JOINs, ORDER BYs, and GROUP BYs, and INSERT and UPDATE commands.
I will tell you this, though -- any test to show speed differences will likely have be on thousands or tens of thousands of transactions, at least, in order to show any significant differences, because on an individual transaction level, we're talking microseconds and possibly even nanoseconds in differences.
In actual industry use, then, how do we decide what route to go? Speed and ease of writing and maintaining the code. In that aspect, ORMs or DBALs will very often beat out raw queries. The fractions of a second per script run lost to the abstraction overhead is recuperated thousands upon thousands of times over in developer costs for time spent writing and maintaining the code in question.
In fact, by the time you get to the point where ORM vs DBAL vs raw queries actually matters, odds are good that you're starting to question whether your original database, language interpreter, or server is up to par with your software's demands. This is actually the issue that Facebook started facing a couple of years ago, at which point they started offloading some of their PHP to C, because C is faster in certain cases. It's also why they've created a completely new interpreter for PHP code (HipHop Virtual Machine, or HHVM), which is quite a bit faster than the original PHP engine.

Entity Framework with a very large and complex dataset

I would appreciate it very much if you can help me with my questions:
Is EF5 reliable and efficient enough to deal with very large and complex dataset in the real world?
Comparing EF5 with ADO.NET, does EF5 requires significantly more resources such as memory?
For those who have tried EF5 on a real world project with very large and complex dataset, are you happy with the performance so far?
As EF creates an abstraction over the data access. Usage of EF introduces number of additional steps to execute any query. But there are workarounds to reduce the cost. As MS is promoting this ORM, i believe they are doing alot for performance improvement as well. EF 6 beta is already out.
There is good article on performance of EF5 available on MSDN for this.
http://msdn.microsoft.com/en-us/data/hh949853
I will not use EF if lot of complex manipulations and iterations are required on the DBsets after populating the DBSets from the EF query.
Hope this helps.
EF is more than capable of handling large amounts of data. Just as in plain ADO.NET, you need to understand how to use it correctly. Its just as easy to write code in ADO.NET that performs poorly. Its also important to remember that EF is built on top of ADO.NET.
DBSets will be much slower with large amounts of data than a code first EF approach. Plain Datareaders could be faster if done correctly.
The correct answer is 'profile'. Code some large objects and profile the differences.
I did some research into this on EF 4.1 and some details might still apply though there have been upgrades in performance to EF5 to keep in mind.
ADO vs EF performance
My conclusions:
-You won't match ADO performance with a framework that has to generate the actual SQL statement dynamically from C# and turn that sql statement into a strongly typed object, there is simply too much going on, even with pre compiled and 'warmed' statements (and performance tests conclude this). This is a happy trade off for many who find it much easier to write and maintain Linq in C# than stored procs and mapping code.
-You can still use stored procedures with equal performance to ADO which to me is the perfect situation. You can write linq to sql in situations where it works, and as soon as you would like more performance, write the sproc and import it to EF to enjoy the best performance possible.
-There is no technical reason to avoid EF accept for the requirements of your project and possibly your teams knowledge and comfort level. Be sure to make sure it fits into your selected design pattern.. EF Anti Patterns to avoid
I hope this helps.

My Database Design skills stink. Where to seek remedy?

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?
Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!
I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
http://www.amazon.com/Server-Query-Performance-Tuning-Distilled/dp/1590594215/ref=sr_1_2?s=books&ie=UTF8&qid=1334154710&sr=1-2
http://www.amazon.com/Performance-Tuning-Server-Dynamic-Management/dp/1906434476/ref=sr_1_12?s=books&ie=UTF8&qid=1334154710&sr=1-12
MySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-ebook/dp/B0028N4W7Y/ref=sr_1_3?ie=UTF8&qid=1334154504&sr=8-3
Oracle
http://www.amazon.com/Oracle-Database-Release-Performance-Techniques/dp/0071780262/ref=sr_1_2?s=books&ie=UTF8&qid=1334154909&sr=1-2
General performance Tuning
http://www.amazon.com/SQL-Performance-Tuning-Peter-Gulutzan/dp/0201791692/ref=sr_1_18?s=books&ie=UTF8&qid=1334154964&sr=1-18
First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.

What's the current state of ORMs?

Historically I've been completely against using ORMS for all but the most basics applications.
My reasoning has and always has been that it's a very leaky abstraction ... mostly because SQL provides a very powerful way to retreive data from a relational source which usually gets messed up by the ORM so that you lost a lot of performance to gain an appearance of not having a relational backend.
I've always thought the DATA should always be kept in the Data Base, not eat up application memory which won't scale anyway. In addition the performance hit of being to generic is harmful. For example, if I need the name and address of all the clients of my database SQL provides me with an easy way to get it, in one query. With an ORM I need to get all the clients and then each name and address, even if it's lazy loaded it's gonna take a LOT longer.
That's what I think but has any of the above changed? I'm seeing a lot of ORMS like the Entity Framework, NHibernate, etc. And they seem to have a lot of popularity lately... Are they worth it? Do they solve the problems I describe above??
Please read: All Abstractions Are Failed Abstractions It should put a lot of your questions in perspective.
Performance is usually not an issue with ORM - and if you really find yourself in a situation where it is, then there usually is always the option to handcraft the SQL statements the ORM uses.
IMHO ORM give you an instant and huge development speed increase. That's why they are so popular. And using them right does not make you paint yourself in a corner. There is always the option of hand tuning the performance.
Edit:
Even though Jeff focuses on Linq to SQL all he says about abstractions and performance are equally true for NHibernate (which I know from years of real world app development). IMHO one should use by default an ORM since they are more than fast enough for the notorious 90% of situations. Reading code written for an ORM usually is more maintainable and readable especially when your code is picked up by the next developer that inherits your code. Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. Never forget about that guy!
In addition they give out of the box caching, lazy loading, unit of work, ... you name it. And I found that when I was not happy about the performance of the ORM it was MY fault. ORM do force you to adhere to good OO design practices and help you shape your Domain Model.
On the Ruby on Rails side, ActiveRecord -- essentially an ORM -- is the basis of 95% of Rails applications (made-up statistic, but it's around there). Actually, to get to that 95% we would probably need to include other ORMs for Rails, like DataMapper.
The abstraction is leaky, and a developer can always dip down to SQL as necessary. Even when you're not using SQL directly, you have to think about number of database hits, etc. For instance, in ActiveRecord, "eager loading" is used to avoid multiple database hits, so you see stuff like this (includes the related "author" field of each Post in the initial query... it does a join under the hood, I think)
for post in Post.find(:all, :include => :author)
The point is that the abstraction leaks as do all abstractions, but that's not really the point. To decide whether to use the abstraction or not, you have to consider whether it will add to or reduce your general workload. In other words, will you spend more time retrofitting your concepts to make the abstraction work, or is it ready to do what you need without much hacking (saving you time)?
I think that the abstractions that work are those that are mature: ActiveRecord has been around the block a ton (as has Hibernate), so it provides an abstract way to patch most of the leaks you would normally be worried about, without explicitly rolling your own lower-level solution (i.e., without writing SQL).
Beyond the learning curve, I think that ORMs are an amazing time-saver for most of your database access, and that most apps actually do make quite "normal" use of the DB. While it may not be your case whatsoever, eschewing an ORM for direct DB access is often a case of early, and unnecessary, optimization.
Edit: I hadn't seen this, but the Jeff quote is
Does this abstraction make our code at
least a little easier to write? To
understand? To troubleshoot? Are we
better off with this abstraction than
we were without it?
saying essentially the same thing.
Some of the more modern ORM's are really powerful tools that solve a lot of real world problems. The good ORM's don't try to hide the relational model from you, but actually leverage it to make OO programming more powerful. They really aren't abstractions in the sense that they let you ignore the "lowlevel" details of relational algebra, instead they are toolkits that let you build abstractions on the relational model and make it easier to bring in data into the imperative model, track the changes and push them back to the database. The SQL language really doesn't provide any good way to factor out common predicates into composable, reusable components to achieve businesstule level abstractions.
Sure there is a performance hit, but it's mostly a constant factor thing as you can make the ORM issue what ever SQL you would issue yourself. Like for your name and address example, in SQLAlchemy you'd just do
for name, address in session.query(Client.name, Client.address):
# process data
and you're done. But where the ORM helps you is when you have reusable relations and predicates. For instance, say you have defined a way to join to a client's favorited items, and a predicate to see if it is on sale. Then you can get the list of clients that have some of their favorite items on sale while also fetching the assigned salesperson with the following query:
potential_sales = (session.query(Client).join(Client.favorite_items)
.filter(Item.is_on_sale)
.options(eagerload(Client.assigned_salesperson)))
Atleast for me, the intent of the query is a lot faster to write, clearer and easier to understand when written like this, instead of a dozen lines of SQL.
As to any abstraction, you'll have to pay either in form of performance, or leaking. I agree with you in being against ORM's, since SQL is a clean and elegant language. I've sort of written my own little frameworks which do this things for me, but hey, then I sat there with my own ORM (but with a little more control over it than for example Hibernate). The people behind Hibernate states that it is fast. It should be able to do about 95% of the boring work against your database (simple queries, updates etc..) but gives you freedom to do the last 5% yourself if you want (you could always write your own mappings in special cases).
I think most of the popularity stems from that many programmers are lazy and want established frameworks to do the dirty boring persistence job for them (I can understand that), but the price of an abstraction will always be there. I would consider my options thoroughly before choosing to use an ORM in a serious project.

Is LINQ to Everything a good abstraction?

There is a proliferation of new LINQ providers. It is really quite astonishing and an elegant combination of lambda expressions, anonymous types and generics with some syntax sugar on top to make it easy reading. Everything is LINQed now from SQL to web services like Amazon to streaming sensor data to parallel processing. It seems like someone is creating an IQueryable<T> for everything but these data sources can have radically different performance, latency, availability and reliability characteristics.
It gives me a little pause that LINQ makes those performance details transparent to the developer. Is LINQ a solid general purpose abstraction or a RAD tool or both?
To me, LINQ is just a way to make code more readable, and hence more maintainable. LINQ does nothing more than takes standard methods and integrates them into the language (hence the name - language integrated query).
It's nothing but a syntax element around normal interfaces and methods - there is no "magic" here, and LINQ-to-something really should (IMO) be treated as any other 3rd party API - you need to understand the cost/benefits of using it just like any other technology.
That being said, it's a very nice syntax helper - it does a lot for making code cleaner, simpler, and more maintainable, and I believe that's where it's true strengths lie.
I see this as similar to the model of multiple storage engines in an RDBMS accepting a common(-ish) language of SQL, in it's design ... but with the added benefit of integreation into the application language semantics. Of course it is good!
I have not used it that much, but it looks sensible and clear when performance and layers of abstraction are not in a position to have a negative impact on the development process (and trust that standards and models wont change wildly).
It is just an interface and implementation that may fit your needs, like all interfaces, abstractions, libraries and implementations, does it fit?... it is all the same answers.
I suppose - no.
LINQ is just a convenient syntax, but not a common RAD tool. In the big projects with complex logic I noticed that developers do more errors in LINQ that in the same instructions they could do if they write the same thing in .NET 2.0 manner. The code is produced faster, it is smaller, but it is harder to find bugs. Sometimes it is not obvious from the first look, at what point the queried collection turns from IQueryable into IEnumerable... I would say that LINQ requires more skilled and disciplined developers.
Also SQL-like syntax is OK for a functional programming but it is a sidestep from object oriented thinking. Sometimes when you see 2 very similar LINQ queries, they look like copy-paste code, but not always any refactoring is possible (or it is possible only by sacrificing some performance).
I heard that MS is not going to further develop LINQ to SQL, and will give more priority to Entities. Is the ADO.NET Team Abandoning LINQ to SQL? Isn't this fact a signal for us that LINQ is not a panacea for everybody ?
If you are thinking about to build a connector to "something", you can build it without LINQ and, if you like, provide LINQ as an additional optional wrapper around it, like LINQ to Entities. So your customers will decide, whether to use LINQ or not, depending on their needs, required performance etc.
p.s.
.NET 4.0 will come with dynamics, and I expect that everybody will also start to use them as LINQ... without taking into considerations that code simplicity, quality and performance may suffer.

Resources