My Database Design skills stink. Where to seek remedy? - performance

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?

Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!

I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
General performance Tuning

First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.


NoSQL db performance testing

Let's assume you've got a nosql database - redis, cassandra, mongodb. And you need to check the overall performance for this database - various platforms, operation systems, even programming languages which are used for test. It's not tied to a specific application or schema.
What tests you want to see? Can you please help me to form requirements?
How database operates in cluster?
In broken cluster?
In cloud env?
How it can perform queries when 10k connections opened?
What tools you will use?
Is it something like JMeter->http server->database?
Jmeter->tcp app->database?
All material I've found about database performance testing is like testing database as a part of some product (specific scheme, specific env).
Have you thought about database performance testing when database is product itself?
Looking forward for you help.
In NoSQL benchmarks and performance evaluations I've put together a list of the benchmarks that are correct in the sense that they clearly define the purpose of the benchmark and compare similar features (apples-to-apples comparisons); there are way too many benchmarks out there that are failing at at least one of these fundamental requirements of a benchmark. Going through those you'll be able to extract the bits that are interesting for your own benchmark plus learn what tools have been used and get some benchmarking code too.
So far the most generic NoSQL benchmark is YCSB (Yahoo Cloud Servicing Benchmark). Recently the Cubrid blog posted the results of running this benchmark against some of the most popular NoSQL solutions and that might give you an idea of how to interpret results.
check the overall performance for this database
Unless you need to do it for fun, or you just want to get a benchmark for the sake of getting a benchmark, I would recommend to tailor a performance benchmark to the actual problem/requirements.
For example do you really need crazy fast writes? Are you ok with losing data? Do you mind spending time on configuring fail over? Do you plan to scale up or out? Are you planning for TBs of data? etc..
From the examples you gave => Redis, Cassandra and MongoDB are quite different:
Redis is mostly cache, and it is really fast, but being just a cache it would not help you much in doing medium complexity aggregation. However it is currently the best cache (my opinion) out there. "Redis + a killer DB" is an ideal combination. It also has a built in benchmark tool you can try.
Cassandra is a solid product modelled after Google Big Table (but I am sure you already know that). It scale writes well if you have lots of nodes, but if you reach TBs of data for example, it can take days to add nodes. It is also not a simplest one to get. But if you are ok to pay, there are excellent guys from Datastax who can take all the complexity away. I have a very simple Cassandra Bombardier that may help you to start off.
MongoDB is a great DB for multiple reasons: very sexy and simple query language, good documentation, huge community, etc.. Not so great in other aspects: need to spend time sharding it correctly, and then resharding it again [compare to e.g. Riak, where it is done automatically]. It is very fast (writes) if the data [not just the index] fits in RAM, it starts slow down very quickly if it does not. There is a ongoing speculation that you may lose data (from one of the Basho engineers: "I had personally spent some time finding out ways to demonstrate that MongoDB will lose writes in the face of failure"), aggregation queries may take a while given a not so large dataset. I have a Mongo Performance Playground that you may find useful.

When is it too late to optimize for performance?

I know that you shouldnt optimize too early, and you should instead aim for maintainability. My question is, at what point is it too late?
I'm working on a website, similar to yahoo answers, and my database structure is exactly what I feel it should be. Table for users, questions, answers, question_comments, answer_comments, etc.
My question is, IF the site were to grow, how would this architecture scale? I'm thinking of putting both questions and answers in a single table (posts), separating them by type, and then putting both question_comments and answer_comments in the same table (comments). I believe this is similar to stackoverflow's DB scheme.
I know what you guys are gonna say, "Dont worry about it until it becomes an actual problem". But wouldn't it be a little too late to worry about it then?
The reason why it's a bad practice to optimize early is you don't know where your bottlenecks will be until your website sees a significant amount of traffic. How your users access and interact with your site is an unknown at this point.
It's almost always best to start with a 'good' architecture (normalized database, MVC architecture, DRY, well-written frontend code, etc) and go from there. It will be much easier to scale a clean, organized architecture than one that was prematurely optimized.
At best right now you can do some load testing via ab or another load testing tool to see where your current bottlenecks are. It certainly won't find all of them, but it will find some.
If you're really worried about this (and you shouldn't be yet), install Nagios or Munin on your server to monitor performance. Use a third party tool to measure page load time daily. Once you start seeing issues then you can profile and tune.
You absolutely should optimize if a fast service is a fundamental requirement of the application.
If sub-second responses are not a requirement, than you can write clean code and optimize later.
A good example of this was JavaScript before the latest version of browsers, people who wrote nice, clean, extensible JS for their pages had terrible performance and had to start from scratch.
One huge table is generally harder to maintain. People usually cut their tables into partitions and even their databases into shards.
I don't see how putting all comments into the same table would save you a join. Really, putting questions and answers into the same table won't save you a join either, you'll just be joining by the same table.
If you want to save on joins, I'd expect you use a document-oriented NoSQL database, such as MongoDB. That's where you can store a question with all related answers and comments in a single 'record', fetchable with one operation.
Databases need to be designed with performance in mind not wait until you havea problem later. Premature optimization doesn't mean don't do it in design, it means don't get ridiculously excessive about it. However, there are known performance killers for every database backend and it is foolish to design to use one of those when a differnt technique will be faster and take the same amount of time to write code for if you are familar with it. So before designing any database, read up on performance tuning and you will never write database code the same way again.

Testing an Oracle database for common bugs/performance issues?

Are there any good scripts that I could run against my Oracle database to test for SQL defects or maybe common performance issues?
Edit: Everything in an Oracle database can be queried. From the PL/SQL packages, indexes and sql running stats. The performance books say look in this place and it will show some absolute values that need the developer to be able to interpret. Has anyone combined their knowledge to include this interpretation within the scripts?
Are you asking for the information in this book?
Are you asking about this wiki?
Or are you asking for this vendor information?
Edit. There is no magical set of queries that you simply run and set the various tuning options.
Oracle is very complicated. Changing a parameter to make one thing fast can make several other things faster or slower. Or makes makes the instance consume more real memory than you have installed. It's hard to generalize this into magical queries. You have tools, but even then, the tools give you tuning options and you may need to run different experiments.
Performance is a balance. You have to strike a balance between physical I/O time and CPU time. It's not possible to generalize this into a magical query. Your system may need faster physical I/O (data warehouses, for instance, often need this) because it can't effectively work from cache. My system may need faster processor time and will have to work in cache to achieve this.
Performance is a function of your application. No magical query of Oracle will reveal a single thing about how your application is designed to work.
Enterprise Manager and it's associated performance tools are a good place to start looking for queries that are consuming the most resources. Here you can see the plans generated for your SQL, view traces of long running queries, etc.
If you have a budget, there is Spotlight by Quest. I've only used the trial version, but I found it useful.
I would recommend checking out the book Optimizing Oracle Performance and any of Cary Millsap's other writings. It is a waste of time to think about optimizing every query. You really need an approach to finding out where your performance bottlenecks are. His Method R approach is a very good one to read up on. Also most of Tom Kyte's books go into detail about performance issues.

What's the current state of ORMs?

Historically I've been completely against using ORMS for all but the most basics applications.
My reasoning has and always has been that it's a very leaky abstraction ... mostly because SQL provides a very powerful way to retreive data from a relational source which usually gets messed up by the ORM so that you lost a lot of performance to gain an appearance of not having a relational backend.
I've always thought the DATA should always be kept in the Data Base, not eat up application memory which won't scale anyway. In addition the performance hit of being to generic is harmful. For example, if I need the name and address of all the clients of my database SQL provides me with an easy way to get it, in one query. With an ORM I need to get all the clients and then each name and address, even if it's lazy loaded it's gonna take a LOT longer.
That's what I think but has any of the above changed? I'm seeing a lot of ORMS like the Entity Framework, NHibernate, etc. And they seem to have a lot of popularity lately... Are they worth it? Do they solve the problems I describe above??
Please read: All Abstractions Are Failed Abstractions It should put a lot of your questions in perspective.
Performance is usually not an issue with ORM - and if you really find yourself in a situation where it is, then there usually is always the option to handcraft the SQL statements the ORM uses.
IMHO ORM give you an instant and huge development speed increase. That's why they are so popular. And using them right does not make you paint yourself in a corner. There is always the option of hand tuning the performance.
Even though Jeff focuses on Linq to SQL all he says about abstractions and performance are equally true for NHibernate (which I know from years of real world app development). IMHO one should use by default an ORM since they are more than fast enough for the notorious 90% of situations. Reading code written for an ORM usually is more maintainable and readable especially when your code is picked up by the next developer that inherits your code. Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. Never forget about that guy!
In addition they give out of the box caching, lazy loading, unit of work, ... you name it. And I found that when I was not happy about the performance of the ORM it was MY fault. ORM do force you to adhere to good OO design practices and help you shape your Domain Model.
On the Ruby on Rails side, ActiveRecord -- essentially an ORM -- is the basis of 95% of Rails applications (made-up statistic, but it's around there). Actually, to get to that 95% we would probably need to include other ORMs for Rails, like DataMapper.
The abstraction is leaky, and a developer can always dip down to SQL as necessary. Even when you're not using SQL directly, you have to think about number of database hits, etc. For instance, in ActiveRecord, "eager loading" is used to avoid multiple database hits, so you see stuff like this (includes the related "author" field of each Post in the initial query... it does a join under the hood, I think)
for post in Post.find(:all, :include => :author)
The point is that the abstraction leaks as do all abstractions, but that's not really the point. To decide whether to use the abstraction or not, you have to consider whether it will add to or reduce your general workload. In other words, will you spend more time retrofitting your concepts to make the abstraction work, or is it ready to do what you need without much hacking (saving you time)?
I think that the abstractions that work are those that are mature: ActiveRecord has been around the block a ton (as has Hibernate), so it provides an abstract way to patch most of the leaks you would normally be worried about, without explicitly rolling your own lower-level solution (i.e., without writing SQL).
Beyond the learning curve, I think that ORMs are an amazing time-saver for most of your database access, and that most apps actually do make quite "normal" use of the DB. While it may not be your case whatsoever, eschewing an ORM for direct DB access is often a case of early, and unnecessary, optimization.
Edit: I hadn't seen this, but the Jeff quote is
Does this abstraction make our code at
least a little easier to write? To
understand? To troubleshoot? Are we
better off with this abstraction than
we were without it?
saying essentially the same thing.
Some of the more modern ORM's are really powerful tools that solve a lot of real world problems. The good ORM's don't try to hide the relational model from you, but actually leverage it to make OO programming more powerful. They really aren't abstractions in the sense that they let you ignore the "lowlevel" details of relational algebra, instead they are toolkits that let you build abstractions on the relational model and make it easier to bring in data into the imperative model, track the changes and push them back to the database. The SQL language really doesn't provide any good way to factor out common predicates into composable, reusable components to achieve businesstule level abstractions.
Sure there is a performance hit, but it's mostly a constant factor thing as you can make the ORM issue what ever SQL you would issue yourself. Like for your name and address example, in SQLAlchemy you'd just do
for name, address in session.query(, Client.address):
# process data
and you're done. But where the ORM helps you is when you have reusable relations and predicates. For instance, say you have defined a way to join to a client's favorited items, and a predicate to see if it is on sale. Then you can get the list of clients that have some of their favorite items on sale while also fetching the assigned salesperson with the following query:
potential_sales = (session.query(Client).join(Client.favorite_items)
Atleast for me, the intent of the query is a lot faster to write, clearer and easier to understand when written like this, instead of a dozen lines of SQL.
As to any abstraction, you'll have to pay either in form of performance, or leaking. I agree with you in being against ORM's, since SQL is a clean and elegant language. I've sort of written my own little frameworks which do this things for me, but hey, then I sat there with my own ORM (but with a little more control over it than for example Hibernate). The people behind Hibernate states that it is fast. It should be able to do about 95% of the boring work against your database (simple queries, updates etc..) but gives you freedom to do the last 5% yourself if you want (you could always write your own mappings in special cases).
I think most of the popularity stems from that many programmers are lazy and want established frameworks to do the dirty boring persistence job for them (I can understand that), but the price of an abstraction will always be there. I would consider my options thoroughly before choosing to use an ORM in a serious project.

Sqlite subqueries : in one big query or in a for loop?

I was planning to benchmark that but since it's a lot of work, I'd like to check if I didn't miss any obvious answer before.
I have a huge query that gets some more details for each row with a subquery.
Each row is then used in a ListAdapter that is plugged in a ListView, so another loop take each row one by one to make it a ListItem.
What do you think is more efficient :
Keeping the subqueries in the SQL mess, counting on the SQL engine to make optimizations .
Taking out the subqueries in the ListAdapter loop, so we lazy load the details on display : much more readable but I'm afraid too many hit would slow down the process.
Two important things :
I can't rewrite the big SQL chunk to get rid of the subqueries. I know it would be better, but I failed to do so.
As far as I can tell, a list won't contain more than 1000 items, and it's a desktop app so there is no concurrency. Is this even relevant to care about perf in that case ? If not, I'd still be interested in the answser for a hight traffic web site anyway. It's good to know...
SQlite is a surprisingly good little engine, but it's not really about extra clever optimizations, and I wouldn't really consider it for a "high traffic web site". One big plus (for uses within its limitations) is that it can run in-process, so that the overhead of multiple queries is really small compared to one big query; if that's easiest to code, for your specific use case, I would really consider it (and doing it in a "lazy load" way, as you hint, might actually make the first screen of data appear faster!). As you suspect, it's unlikely that this will be a performance bottleneck, in your use case, so going for simpler and thus more reliable coding is an important plus.
If I was doing a high-traffic site, and using a richer, "heavier" engine such as PosgtreSQL, Oracle, SQL Server, or DB2, I would trust the optimizer much more. One thing I've noticed, however, is that I can often (alas, not always) change sub-queries into joins, and that often tends to improve performance (joins make it easier for the optimizer to use good indices, I think -- I have never coded a SQL optimizer myself, but that's my impression from staring at query execution plans from many engines for alternative forms of queries... that, of course, DOES assume you have good indices!-) -- this would have to be confirmed with a benchmark of the specific case in question, of course, but it would be my initial working assumption.
What about using cursors?
I would prefer using a big query and let my SQL engine optimize my query.
Also I can't think of an example where it's better to do a loop outside SQL instead of using a "big" query or using cursors.
But the best way to know what's better is to benchmark it.
Good luck!
