ORM query performance vs RDBMS performance - performance

I am developing an application with laravel framework, and I want to measure orm query (CRUD) performance and compare it with RDBMS query performance.
I need to show that orm query performance is better than RDBMS but somewhere I read that eloquent laravel has a slow query performance. I need to make the right decision in order to show the desired result. Is doctrine better than eloquent ORM? And also which benchmark do you suggest me to use. I need these results for my thesis.

You're comparing apples to oranges here.
By definition, an RDBMS is always going to be faster, because the RDBMS is your database (RDBMS = Relational Database Management System). IE -- MySQL, SQL Server, PostgreSQL, etc. And the database does one thing really, really well -- handle data (okay, two things, depending on how you look at it -- storing and retrieving data).
Since every way of accessing the database from any other language is at least one step removed from the database itself, everything is slower than the RDBMS itself, if for no other reason than the language's interpreter has to first connect to the database at least once, before it can do anything.
That said, there are a few different layers available when dealing with databases in PHP:
Raw queries using PHP's built in mysql_* functions.
Basic database abstraction layers (ie - PDO)
Basic query builders (ie - Laravel's Query Builder)
Active Record pattern ORMs (ie - Eloquent)
Stateless/transactional ORMs (ie - Doctrine)
Assuming perfectly optimized queries fed into the given method by the developer, raw queries will be fastest, followed by the basic DBAL, followed by anything built on top of the basic DBAL. Where query builders and the ORMs built on them fall will depend on whether the query builder is, itself, built on top of another DBAL (in this case, I think it puts Eloquent one more layer removed than Doctrine, because Eloquent is built on Query Builder, which is built on PDO). This is because each one is an abstraction layer over the previous, so the path of the code, when executed, has to run through the stack.
The question then becomes how much of a difference are we talking? That depends entirely on the queries you're feeding into the system, as well as the quality of the system, itself. What are you looking for to show differences? How fast it can do a basic SELECT? Or how well it can do some crazy multi-JOIN query? What determines "speed" for the purpose of your thesis? Only you can really decide that, because you have more information than anyone here. For the sake of thoroughness, you're probably looking at basic SELECTs, complex queries that include things like JOINs, ORDER BYs, and GROUP BYs, and INSERT and UPDATE commands.
I will tell you this, though -- any test to show speed differences will likely have be on thousands or tens of thousands of transactions, at least, in order to show any significant differences, because on an individual transaction level, we're talking microseconds and possibly even nanoseconds in differences.
In actual industry use, then, how do we decide what route to go? Speed and ease of writing and maintaining the code. In that aspect, ORMs or DBALs will very often beat out raw queries. The fractions of a second per script run lost to the abstraction overhead is recuperated thousands upon thousands of times over in developer costs for time spent writing and maintaining the code in question.
In fact, by the time you get to the point where ORM vs DBAL vs raw queries actually matters, odds are good that you're starting to question whether your original database, language interpreter, or server is up to par with your software's demands. This is actually the issue that Facebook started facing a couple of years ago, at which point they started offloading some of their PHP to C, because C is faster in certain cases. It's also why they've created a completely new interpreter for PHP code (HipHop Virtual Machine, or HHVM), which is quite a bit faster than the original PHP engine.

Related

My Database Design skills stink. Where to seek remedy?

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?
Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!
I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
http://www.amazon.com/Server-Query-Performance-Tuning-Distilled/dp/1590594215/ref=sr_1_2?s=books&ie=UTF8&qid=1334154710&sr=1-2
http://www.amazon.com/Performance-Tuning-Server-Dynamic-Management/dp/1906434476/ref=sr_1_12?s=books&ie=UTF8&qid=1334154710&sr=1-12
MySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-ebook/dp/B0028N4W7Y/ref=sr_1_3?ie=UTF8&qid=1334154504&sr=8-3
Oracle
http://www.amazon.com/Oracle-Database-Release-Performance-Techniques/dp/0071780262/ref=sr_1_2?s=books&ie=UTF8&qid=1334154909&sr=1-2
General performance Tuning
http://www.amazon.com/SQL-Performance-Tuning-Peter-Gulutzan/dp/0201791692/ref=sr_1_18?s=books&ie=UTF8&qid=1334154964&sr=1-18
First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.

Optimizations employed by ORM's

I'm teaching Java EE, especially JPA, Spring and Spring MVC. As I have not so much experience in large projects, it is difficult to know what to present to students about optimisation of ORM.
At the present time, I present some classic optimisation tricks:
prepared statements (most of ORM implicitely use them by default)
first and second-level caches
"write first, optimize later"
it is possible to switch off ORM and send SQL commands directly to the database for very frequent, specialized and costly requests
Is there any other point the community see about other way to optimize ORM ? I'm especially interested by DAO patterns...
From the point of developer, there are following optimization cases he must deal with:
Reduce chattiness between ORM and DB. Low chatiness is important, since each roundtrip between ORM and database implies network interaction, and thus its length varies between 0.1 and 1ms at least - independently of query complxity (note that may be 90% of queries are normally fairly simple). Particular case is SELECT N+1 problem: if processing of each row of some query result requires an additional query to be executed (so 1 + count(...) queries are executed at total), developer must try to rewrite the code in such a way that nearly constant count of queries is executed. CRUD sequence batching and future queries are other examples of optimization reducing the chattines (described below).
Reduce query complexity. Usually ORM is helpless here, so this is solely a developer's headache. But APIs allowing to execute SQL commands directly are frequently intended to be used also in this case.
So I can enlist few more optimizations:
Future queries: an API allowing to delay an execution of query until the moment when its result will be necessary. If there are several future queries scheduled at this moment, they're executed alltogether as a single batch. So the main benefit of this is reduction of # of roundtrip to database (= reduction of chattiness between ORM and DB). Many ORMs implement this, e.g. NHibernate.
CRUD sequence batching: nearly the same, but when INSERT, UPDATE and DELETE statements are batched together to reduce the chattines. Again, implemented by many ORM tools.
Combination of above two cases - so-called "generalized batching". AFAIK, so far this is implemented only by DataObjects.Net (the ORM my team works on).
Asynchronous generalized batching: if batch requires no immediate reply, it is executed asynchronously (but certainly, in sync with other batches sent by the same session, i.e. underlying connection is anyway used synchronously). Brings noticeable benefits when there are lots of CRUD statements: the code modifying persistent entities is executed in parallel with DB-side operation. AFAIK, no ORM implements this optimization so far.
All these cases fit under "write first, optimize later" rule (or "express intention first, optimize later").
Another well-known optimization-related API is prefetch API ("Prefetch paths"). The idea behind is to fetch a graph of objects that is expected to be processed further with minimal count of queries (or, better, in minimal time). So this API addresses "SELECT N+1" problem. Again, this part is normally expected to be implemented in any serious ORM product.
All above optimizations are safe from the point of transaction isolation - i.e. they don't break it. Caching-related optimizations normally aren't safe from this point: you must carefully configure caching to ensure you won't get stale objects when getting actual content is important (e.g. on security checks or on some real-time interaction). There are lots of techniques here, starting from usage of built-in caches in finishing with integration with distributed caches (memcached, etc.). Any approach solving the problem is good here; personally I would expect an open API allowing to integrate any with cache I prefer.
P.S. I'm a .NET fanboy, as well as one of DataObjects.Net and ORMeter.NET developers. So I don't know how exactly similar features are implemented in Java, but I'm familiar with the range of available solutions.
How about N+1 queries for collections? For example, see here: ORM Select n + 1 performance; join or no join
Regarding Spring and Spring MVC, you might find this interesting.
It's in C++, not Java, but it shows how to reduce UI source code w.r.t. Spring by an order of magnitude.
Lazy-loading is via proxys is probably one of the killer features in ORM's.
Additionally, Hibernate is also able to proxy out requests like object.collection.count and optimizes them, so instead of the whole collection being retrieved only a SELECT Count(*) is issues.
You mentioned the DAO pattern, but many in the JPA camp are saying the pattern is dead (I think the Hibernate guys have blogged on this, but I can't remember the link). Have a look at Spring Roo to see how they add the ORM-related functionality directly to the domain model via static methods.
For a set of JPA optimizations, and their resulting improvements, see,
http://java-persistence-performance.blogspot.com/2011/06/how-to-improve-jpa-performance-by-1825.html

What's the current state of ORMs?

Historically I've been completely against using ORMS for all but the most basics applications.
My reasoning has and always has been that it's a very leaky abstraction ... mostly because SQL provides a very powerful way to retreive data from a relational source which usually gets messed up by the ORM so that you lost a lot of performance to gain an appearance of not having a relational backend.
I've always thought the DATA should always be kept in the Data Base, not eat up application memory which won't scale anyway. In addition the performance hit of being to generic is harmful. For example, if I need the name and address of all the clients of my database SQL provides me with an easy way to get it, in one query. With an ORM I need to get all the clients and then each name and address, even if it's lazy loaded it's gonna take a LOT longer.
That's what I think but has any of the above changed? I'm seeing a lot of ORMS like the Entity Framework, NHibernate, etc. And they seem to have a lot of popularity lately... Are they worth it? Do they solve the problems I describe above??
Please read: All Abstractions Are Failed Abstractions It should put a lot of your questions in perspective.
Performance is usually not an issue with ORM - and if you really find yourself in a situation where it is, then there usually is always the option to handcraft the SQL statements the ORM uses.
IMHO ORM give you an instant and huge development speed increase. That's why they are so popular. And using them right does not make you paint yourself in a corner. There is always the option of hand tuning the performance.
Edit:
Even though Jeff focuses on Linq to SQL all he says about abstractions and performance are equally true for NHibernate (which I know from years of real world app development). IMHO one should use by default an ORM since they are more than fast enough for the notorious 90% of situations. Reading code written for an ORM usually is more maintainable and readable especially when your code is picked up by the next developer that inherits your code. Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. Never forget about that guy!
In addition they give out of the box caching, lazy loading, unit of work, ... you name it. And I found that when I was not happy about the performance of the ORM it was MY fault. ORM do force you to adhere to good OO design practices and help you shape your Domain Model.
On the Ruby on Rails side, ActiveRecord -- essentially an ORM -- is the basis of 95% of Rails applications (made-up statistic, but it's around there). Actually, to get to that 95% we would probably need to include other ORMs for Rails, like DataMapper.
The abstraction is leaky, and a developer can always dip down to SQL as necessary. Even when you're not using SQL directly, you have to think about number of database hits, etc. For instance, in ActiveRecord, "eager loading" is used to avoid multiple database hits, so you see stuff like this (includes the related "author" field of each Post in the initial query... it does a join under the hood, I think)
for post in Post.find(:all, :include => :author)
The point is that the abstraction leaks as do all abstractions, but that's not really the point. To decide whether to use the abstraction or not, you have to consider whether it will add to or reduce your general workload. In other words, will you spend more time retrofitting your concepts to make the abstraction work, or is it ready to do what you need without much hacking (saving you time)?
I think that the abstractions that work are those that are mature: ActiveRecord has been around the block a ton (as has Hibernate), so it provides an abstract way to patch most of the leaks you would normally be worried about, without explicitly rolling your own lower-level solution (i.e., without writing SQL).
Beyond the learning curve, I think that ORMs are an amazing time-saver for most of your database access, and that most apps actually do make quite "normal" use of the DB. While it may not be your case whatsoever, eschewing an ORM for direct DB access is often a case of early, and unnecessary, optimization.
Edit: I hadn't seen this, but the Jeff quote is
Does this abstraction make our code at
least a little easier to write? To
understand? To troubleshoot? Are we
better off with this abstraction than
we were without it?
saying essentially the same thing.
Some of the more modern ORM's are really powerful tools that solve a lot of real world problems. The good ORM's don't try to hide the relational model from you, but actually leverage it to make OO programming more powerful. They really aren't abstractions in the sense that they let you ignore the "lowlevel" details of relational algebra, instead they are toolkits that let you build abstractions on the relational model and make it easier to bring in data into the imperative model, track the changes and push them back to the database. The SQL language really doesn't provide any good way to factor out common predicates into composable, reusable components to achieve businesstule level abstractions.
Sure there is a performance hit, but it's mostly a constant factor thing as you can make the ORM issue what ever SQL you would issue yourself. Like for your name and address example, in SQLAlchemy you'd just do
for name, address in session.query(Client.name, Client.address):
# process data
and you're done. But where the ORM helps you is when you have reusable relations and predicates. For instance, say you have defined a way to join to a client's favorited items, and a predicate to see if it is on sale. Then you can get the list of clients that have some of their favorite items on sale while also fetching the assigned salesperson with the following query:
potential_sales = (session.query(Client).join(Client.favorite_items)
.filter(Item.is_on_sale)
.options(eagerload(Client.assigned_salesperson)))
Atleast for me, the intent of the query is a lot faster to write, clearer and easier to understand when written like this, instead of a dozen lines of SQL.
As to any abstraction, you'll have to pay either in form of performance, or leaking. I agree with you in being against ORM's, since SQL is a clean and elegant language. I've sort of written my own little frameworks which do this things for me, but hey, then I sat there with my own ORM (but with a little more control over it than for example Hibernate). The people behind Hibernate states that it is fast. It should be able to do about 95% of the boring work against your database (simple queries, updates etc..) but gives you freedom to do the last 5% yourself if you want (you could always write your own mappings in special cases).
I think most of the popularity stems from that many programmers are lazy and want established frameworks to do the dirty boring persistence job for them (I can understand that), but the price of an abstraction will always be there. I would consider my options thoroughly before choosing to use an ORM in a serious project.

Performance gains using straight ado.net vs an ORM?

would i get any performance gains if i replace my data access part of the application from nhiberate to straight ado.net ?
i know that NHibernate uses ado.net at its core !
Short answer:
It depends on what kind of operations you perform. You probably would get a performance improvement if you write good SQL, but in some cases you might get worse performance since you lose the NHibernate caching etc.
Long answer:
As you mention, NHibernate sits on top of ADO.NET and provides a layer of abstraction. This makes your life easier in many ways, but as all layers of abstraction it has a certain performance cost.
The main case where you probably would see a performance benefit is when you are operating on many objects at once, such as updating many entities or fetching a large amount of entities. This is because of the work that the NHibernate session does to keep track of which objects are modified etc. My experience is that the performance of NHibernate degrades significantly as the amount of entities in the session grows.
NHibernate has a lot of ways to improve performance and if you really know it well, you can get it to perform quite close to ADO.NET. However, if you are not that familiar with it, you can easilly shoot yourself in the foot, performance-wise. (Select N+1 problem, etc.)
There are some situations where you could actually get worse performance when switching form NHibernate to straight ADO.NET. This is because of the fact that the NHibernate abstraction layer introduces some features that can improve performance, such as caching. NHibernate also includes functionality for optimizing the generated SQL for the current database management system. For example, if you are using SQL Server it might generate slightly different SQL than if you are using Oracle.
It is worth mentioning that it does not have to be an all or nothing kind of situation. You could use NHibernate for the 90% of your database access for which it works great, and then use straight SQL for the 10% where you do complex queries, batch inserts/updates etc.

Sqlite subqueries : in one big query or in a for loop?

I was planning to benchmark that but since it's a lot of work, I'd like to check if I didn't miss any obvious answer before.
I have a huge query that gets some more details for each row with a subquery.
Each row is then used in a ListAdapter that is plugged in a ListView, so another loop take each row one by one to make it a ListItem.
What do you think is more efficient :
Keeping the subqueries in the SQL mess, counting on the SQL engine to make optimizations .
Taking out the subqueries in the ListAdapter loop, so we lazy load the details on display : much more readable but I'm afraid too many hit would slow down the process.
Two important things :
I can't rewrite the big SQL chunk to get rid of the subqueries. I know it would be better, but I failed to do so.
As far as I can tell, a list won't contain more than 1000 items, and it's a desktop app so there is no concurrency. Is this even relevant to care about perf in that case ? If not, I'd still be interested in the answser for a hight traffic web site anyway. It's good to know...
SQlite is a surprisingly good little engine, but it's not really about extra clever optimizations, and I wouldn't really consider it for a "high traffic web site". One big plus (for uses within its limitations) is that it can run in-process, so that the overhead of multiple queries is really small compared to one big query; if that's easiest to code, for your specific use case, I would really consider it (and doing it in a "lazy load" way, as you hint, might actually make the first screen of data appear faster!). As you suspect, it's unlikely that this will be a performance bottleneck, in your use case, so going for simpler and thus more reliable coding is an important plus.
If I was doing a high-traffic site, and using a richer, "heavier" engine such as PosgtreSQL, Oracle, SQL Server, or DB2, I would trust the optimizer much more. One thing I've noticed, however, is that I can often (alas, not always) change sub-queries into joins, and that often tends to improve performance (joins make it easier for the optimizer to use good indices, I think -- I have never coded a SQL optimizer myself, but that's my impression from staring at query execution plans from many engines for alternative forms of queries... that, of course, DOES assume you have good indices!-) -- this would have to be confirmed with a benchmark of the specific case in question, of course, but it would be my initial working assumption.
What about using cursors?
I would prefer using a big query and let my SQL engine optimize my query.
Also I can't think of an example where it's better to do a loop outside SQL instead of using a "big" query or using cursors.
But the best way to know what's better is to benchmark it.
Good luck!

Resources