When is it too late to optimize for performance? - performance

I know that you shouldnt optimize too early, and you should instead aim for maintainability. My question is, at what point is it too late?
I'm working on a website, similar to yahoo answers, and my database structure is exactly what I feel it should be. Table for users, questions, answers, question_comments, answer_comments, etc.
My question is, IF the site were to grow, how would this architecture scale? I'm thinking of putting both questions and answers in a single table (posts), separating them by type, and then putting both question_comments and answer_comments in the same table (comments). I believe this is similar to stackoverflow's DB scheme.
I know what you guys are gonna say, "Dont worry about it until it becomes an actual problem". But wouldn't it be a little too late to worry about it then?
Thanks

The reason why it's a bad practice to optimize early is you don't know where your bottlenecks will be until your website sees a significant amount of traffic. How your users access and interact with your site is an unknown at this point.
It's almost always best to start with a 'good' architecture (normalized database, MVC architecture, DRY, well-written frontend code, etc) and go from there. It will be much easier to scale a clean, organized architecture than one that was prematurely optimized.
At best right now you can do some load testing via ab or another load testing tool to see where your current bottlenecks are. It certainly won't find all of them, but it will find some.
If you're really worried about this (and you shouldn't be yet), install Nagios or Munin on your server to monitor performance. Use a third party tool to measure page load time daily. Once you start seeing issues then you can profile and tune.

You absolutely should optimize if a fast service is a fundamental requirement of the application.
If sub-second responses are not a requirement, than you can write clean code and optimize later.
A good example of this was JavaScript before the latest version of browsers, people who wrote nice, clean, extensible JS for their pages had terrible performance and had to start from scratch.

One huge table is generally harder to maintain. People usually cut their tables into partitions and even their databases into shards.
I don't see how putting all comments into the same table would save you a join. Really, putting questions and answers into the same table won't save you a join either, you'll just be joining by the same table.
If you want to save on joins, I'd expect you use a document-oriented NoSQL database, such as MongoDB. That's where you can store a question with all related answers and comments in a single 'record', fetchable with one operation.

Databases need to be designed with performance in mind not wait until you havea problem later. Premature optimization doesn't mean don't do it in design, it means don't get ridiculously excessive about it. However, there are known performance killers for every database backend and it is foolish to design to use one of those when a differnt technique will be faster and take the same amount of time to write code for if you are familar with it. So before designing any database, read up on performance tuning and you will never write database code the same way again.

Related

Temporary Dashboard/Reporting Solution while Building a Data Warehouse

Our situation is that we are going to start to build a data warehouse. The data warehouse is going to take some time, if we are going to do it right. It will be built looking at individual processes and growing from there.
We only have three databases that we will be pulling data from. All three databases hold distinct information (financial info, scheduling and patient information - visits, diagnosis,etc).
I am thinking of using a dashboard/reporting tool like (as an example) http://www.jedox.com/en/, or http://www.board.com/us/ to display the information to the business. It will slowly start incoperating the DW as it is beind designed and pushed to production.
My question after all this is: What is the best way to present the data to the application (dashboard/reporter) in the backend that would be efficient, yet not time consuming where I'd rather build the Data Warehouse? Ie. views, materialized views, small seperate DB containing subset data from the main DB's, etc?
This may not be answering your question directly, but rather than find a temporary solution I would just build your warehouse faster.
First, if you can build it quickly then you don't need a temporary one; if you can't build it quickly then you won't be able to build a temporary solution quickly either. You even mentioned developing a "small separate DB containing subset data"; that's exactly what a reporting database is!
Second, any temporary solution will have to be maintained and supported too: if it's too useful then your temporary solution will become your permanent one anyway. That might actually be a good thing because if the 'temporary' solution meets your requirements then why not keep it?
Anyway, I would start by identifying one or two key reports that have high value for your users and commit to delivering them in 2 months (1 month would be even better). Develop the most basic, minimal database and ETL/reporting processes possible to deliver those reports, even if it seems like a horrible, hacked-together mess. Make sure the reports are internal ones that no one will send to an outside customer; that means you can avoid spending time on making them pretty.
After you've delivered those reports, you can now step back and look at what you did. Hopefully you will find yourself in a position where:
Your users got some useful reports very quickly
The reports are ugly but the numbers are correct
You've learned a lot about the users' needs and how they interpret and use the data
Your technical implementation is a mess, but you know that and you also know how to improve it
If #1 and #2 are true then you'll have delivered a lot of business value quickly while also setting the user expectation that correct is often more valuable than pretty (that's really helpful on a reporting project). If #3 and #4 are true then your second iteration will be a big improvement on the first one and even if you find yourself in the worst case scenario of having to re-develop the whole thing from scratch, you'll do it faster and better because you've learned so much.
This is simply agile development, of course: there's no reason you can't use rapid prototyping and incremental delivery in a data warehouse project. Like any IT solution the warehouse will continuously grow and be maintained over time so there's absolutely no reason to try to get everything complete and correct in the first version. It's highly likely that your users don't even really know what they want (in detail) so this approach helps to clarify their expectations and requirements more quickly too.

My Database Design skills stink. Where to seek remedy?

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?
Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!
I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
http://www.amazon.com/Server-Query-Performance-Tuning-Distilled/dp/1590594215/ref=sr_1_2?s=books&ie=UTF8&qid=1334154710&sr=1-2
http://www.amazon.com/Performance-Tuning-Server-Dynamic-Management/dp/1906434476/ref=sr_1_12?s=books&ie=UTF8&qid=1334154710&sr=1-12
MySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-ebook/dp/B0028N4W7Y/ref=sr_1_3?ie=UTF8&qid=1334154504&sr=8-3
Oracle
http://www.amazon.com/Oracle-Database-Release-Performance-Techniques/dp/0071780262/ref=sr_1_2?s=books&ie=UTF8&qid=1334154909&sr=1-2
General performance Tuning
http://www.amazon.com/SQL-Performance-Tuning-Peter-Gulutzan/dp/0201791692/ref=sr_1_18?s=books&ie=UTF8&qid=1334154964&sr=1-18
First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.

How to approach performance issues?

We are developing a client-server desktop application(winforms with sql server 2008, using LINQ-SQL).We are now finding many issues related to performance.These relate to querying too much data with LINQ , bad database design,not much caching etc.What do you suggest,we should do - how to go about solving these performance issues? One thing,I am doing is doing sql profiling,and trying to fix some queries.As far caching is concerned,we have static lists.But,how to keep them updated,we don't have any server side implementation.So,these lists can be stale,if someone changes data.
regards
Performance analysis without tools is fruitless, with the wrong tools frustrating. SQL Profiler is the wrong tool to rely on for what you are looking at. I think it is at best giving you a hint of what is wrong.
You need to use a code profiler to determine why/when these queries are being executed. You should be able to find one by Googling it and run it a x day trial.
The key questions are:
Are queries being run multiple times when there is no reason to at all? Is the data already in memory (even if not stored statically). This happens a lot where data is already retrieved but because of some action on the code it loads it again. Class properties are a big culprit here.
Should certain data be stored statically across the application? How volatile is that data? Can you afford to show stale data?
The only way to decide on #2 is to have hard data to examine the cost of a particular transaction. For example, if I know it takes me 1983 ms to create a new invoice, what will it be after I start caching data. After the cache is that savings significant. But recognize you can't answer that question until you know it takes 1983 ms to create an invoice.
When I profile an application transaction I focus on the big contributor and try to determine why it is so big. I look for individual methods that are slow and for any code that is executed frequently. It is often the latter, the death of a thousand cuts, that gets you.
And I wanted to add this, it is also very important to know when to stop working on a performance issue.
I found Jeff Atwood's articles on this quite interesting:
Compiled Or Bust
All Abstractions are field Abstractions
For updating, you can create a Table. I called it ListVersions.
Just store list id, name and version.
When you do some changes to a list, just increment its version. In your application, you'll just need to compare version and update only if it has changed. Update lists that have version incremented, not all.
I've described it in my answer to this question
What is the preferred method of refreshing a combo box when the data changes?
Good Luck!
A general recipe for performance issues:
Measure (wall clock time, CPU time, memory consumption etc.)
Design & implement an algorithm that you think could be faster than current code.
Measure again to assess the impact of your fix.
Many times the biggest bottle necks aren't exactly where you though they were. So, base your actions on measured data.
Try to keep the number of SQL queries small. You're more likely to get performance improvements by lowering the amount of queries than restrucrturing the SQL syntax of an individual query.
I recommed adding some server side logic instead of directly firing the SQL queries from the client. You could implement caching shared but all clients on the server side.

What's the current state of ORMs?

Historically I've been completely against using ORMS for all but the most basics applications.
My reasoning has and always has been that it's a very leaky abstraction ... mostly because SQL provides a very powerful way to retreive data from a relational source which usually gets messed up by the ORM so that you lost a lot of performance to gain an appearance of not having a relational backend.
I've always thought the DATA should always be kept in the Data Base, not eat up application memory which won't scale anyway. In addition the performance hit of being to generic is harmful. For example, if I need the name and address of all the clients of my database SQL provides me with an easy way to get it, in one query. With an ORM I need to get all the clients and then each name and address, even if it's lazy loaded it's gonna take a LOT longer.
That's what I think but has any of the above changed? I'm seeing a lot of ORMS like the Entity Framework, NHibernate, etc. And they seem to have a lot of popularity lately... Are they worth it? Do they solve the problems I describe above??
Please read: All Abstractions Are Failed Abstractions It should put a lot of your questions in perspective.
Performance is usually not an issue with ORM - and if you really find yourself in a situation where it is, then there usually is always the option to handcraft the SQL statements the ORM uses.
IMHO ORM give you an instant and huge development speed increase. That's why they are so popular. And using them right does not make you paint yourself in a corner. There is always the option of hand tuning the performance.
Edit:
Even though Jeff focuses on Linq to SQL all he says about abstractions and performance are equally true for NHibernate (which I know from years of real world app development). IMHO one should use by default an ORM since they are more than fast enough for the notorious 90% of situations. Reading code written for an ORM usually is more maintainable and readable especially when your code is picked up by the next developer that inherits your code. Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live. Never forget about that guy!
In addition they give out of the box caching, lazy loading, unit of work, ... you name it. And I found that when I was not happy about the performance of the ORM it was MY fault. ORM do force you to adhere to good OO design practices and help you shape your Domain Model.
On the Ruby on Rails side, ActiveRecord -- essentially an ORM -- is the basis of 95% of Rails applications (made-up statistic, but it's around there). Actually, to get to that 95% we would probably need to include other ORMs for Rails, like DataMapper.
The abstraction is leaky, and a developer can always dip down to SQL as necessary. Even when you're not using SQL directly, you have to think about number of database hits, etc. For instance, in ActiveRecord, "eager loading" is used to avoid multiple database hits, so you see stuff like this (includes the related "author" field of each Post in the initial query... it does a join under the hood, I think)
for post in Post.find(:all, :include => :author)
The point is that the abstraction leaks as do all abstractions, but that's not really the point. To decide whether to use the abstraction or not, you have to consider whether it will add to or reduce your general workload. In other words, will you spend more time retrofitting your concepts to make the abstraction work, or is it ready to do what you need without much hacking (saving you time)?
I think that the abstractions that work are those that are mature: ActiveRecord has been around the block a ton (as has Hibernate), so it provides an abstract way to patch most of the leaks you would normally be worried about, without explicitly rolling your own lower-level solution (i.e., without writing SQL).
Beyond the learning curve, I think that ORMs are an amazing time-saver for most of your database access, and that most apps actually do make quite "normal" use of the DB. While it may not be your case whatsoever, eschewing an ORM for direct DB access is often a case of early, and unnecessary, optimization.
Edit: I hadn't seen this, but the Jeff quote is
Does this abstraction make our code at
least a little easier to write? To
understand? To troubleshoot? Are we
better off with this abstraction than
we were without it?
saying essentially the same thing.
Some of the more modern ORM's are really powerful tools that solve a lot of real world problems. The good ORM's don't try to hide the relational model from you, but actually leverage it to make OO programming more powerful. They really aren't abstractions in the sense that they let you ignore the "lowlevel" details of relational algebra, instead they are toolkits that let you build abstractions on the relational model and make it easier to bring in data into the imperative model, track the changes and push them back to the database. The SQL language really doesn't provide any good way to factor out common predicates into composable, reusable components to achieve businesstule level abstractions.
Sure there is a performance hit, but it's mostly a constant factor thing as you can make the ORM issue what ever SQL you would issue yourself. Like for your name and address example, in SQLAlchemy you'd just do
for name, address in session.query(Client.name, Client.address):
# process data
and you're done. But where the ORM helps you is when you have reusable relations and predicates. For instance, say you have defined a way to join to a client's favorited items, and a predicate to see if it is on sale. Then you can get the list of clients that have some of their favorite items on sale while also fetching the assigned salesperson with the following query:
potential_sales = (session.query(Client).join(Client.favorite_items)
.filter(Item.is_on_sale)
.options(eagerload(Client.assigned_salesperson)))
Atleast for me, the intent of the query is a lot faster to write, clearer and easier to understand when written like this, instead of a dozen lines of SQL.
As to any abstraction, you'll have to pay either in form of performance, or leaking. I agree with you in being against ORM's, since SQL is a clean and elegant language. I've sort of written my own little frameworks which do this things for me, but hey, then I sat there with my own ORM (but with a little more control over it than for example Hibernate). The people behind Hibernate states that it is fast. It should be able to do about 95% of the boring work against your database (simple queries, updates etc..) but gives you freedom to do the last 5% yourself if you want (you could always write your own mappings in special cases).
I think most of the popularity stems from that many programmers are lazy and want established frameworks to do the dirty boring persistence job for them (I can understand that), but the price of an abstraction will always be there. I would consider my options thoroughly before choosing to use an ORM in a serious project.

Sqlite subqueries : in one big query or in a for loop?

I was planning to benchmark that but since it's a lot of work, I'd like to check if I didn't miss any obvious answer before.
I have a huge query that gets some more details for each row with a subquery.
Each row is then used in a ListAdapter that is plugged in a ListView, so another loop take each row one by one to make it a ListItem.
What do you think is more efficient :
Keeping the subqueries in the SQL mess, counting on the SQL engine to make optimizations .
Taking out the subqueries in the ListAdapter loop, so we lazy load the details on display : much more readable but I'm afraid too many hit would slow down the process.
Two important things :
I can't rewrite the big SQL chunk to get rid of the subqueries. I know it would be better, but I failed to do so.
As far as I can tell, a list won't contain more than 1000 items, and it's a desktop app so there is no concurrency. Is this even relevant to care about perf in that case ? If not, I'd still be interested in the answser for a hight traffic web site anyway. It's good to know...
SQlite is a surprisingly good little engine, but it's not really about extra clever optimizations, and I wouldn't really consider it for a "high traffic web site". One big plus (for uses within its limitations) is that it can run in-process, so that the overhead of multiple queries is really small compared to one big query; if that's easiest to code, for your specific use case, I would really consider it (and doing it in a "lazy load" way, as you hint, might actually make the first screen of data appear faster!). As you suspect, it's unlikely that this will be a performance bottleneck, in your use case, so going for simpler and thus more reliable coding is an important plus.
If I was doing a high-traffic site, and using a richer, "heavier" engine such as PosgtreSQL, Oracle, SQL Server, or DB2, I would trust the optimizer much more. One thing I've noticed, however, is that I can often (alas, not always) change sub-queries into joins, and that often tends to improve performance (joins make it easier for the optimizer to use good indices, I think -- I have never coded a SQL optimizer myself, but that's my impression from staring at query execution plans from many engines for alternative forms of queries... that, of course, DOES assume you have good indices!-) -- this would have to be confirmed with a benchmark of the specific case in question, of course, but it would be my initial working assumption.
What about using cursors?
I would prefer using a big query and let my SQL engine optimize my query.
Also I can't think of an example where it's better to do a loop outside SQL instead of using a "big" query or using cursors.
But the best way to know what's better is to benchmark it.
Good luck!

Resources