Pitfalls in prototype database design (for performance viability testing) - performance

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.

There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.

Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.

Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

Related

What algorithms could i implement to improve the general design and performance of a database?

I'm working on a project for university. Do you guys know what kind of algorithms i could implement that would help with the proper design and general performance of a database? Until now i came up with an algorithm that can help the user pick candidate keys and also an algorithm for normalization up to 3NF. Do you have any other ideas or suggestions? Thanks.
This is like asking how you can figure out how to make a car be more efficient. It's such a broad question that it's essentially unanswerable. There are so many moving parts to a car, and each one has its own problems. You really need to understand what each component is doing. In the case of databases, you need to understand the data before you try and fix it. And if you want a good answer, you have to ask the right questions.
A good question should include context on what you are working with, and what you are trying to do. And when it comes to data manipulation, the details are extremely important. How is your data represented? What kind of infrastructure are you working with? What purpose does the data serve, and what processes use this data? If you are working with floating point numbers, are your processes tolerant of small rounding errors? Would your organization even let you make changes to how the data is stored?
In general, adding algorithms to improve data performance is probably largely unnecessary. Databases are designed out of the box to be simple and efficient. If there were a known method to increase efficiency in general without any drawbacks, there's no reason why the designers of the system wouldn't have implemented it already.
I am just putting an answer because I have no way to tell this in the comment section. You need to understand a basic principle in database design and data model construction. What your database is for ? That is the main question, and believe it or not, sometimes people with experience make the same mistake.
As you were saying, 3NF could be good for OLTP systems, but it would be horrendous for Data Warehouse or Reporting Databases where the queries are huge and they work on big batch operations. In those systems denormalization offers always better results.
Once you know what you're database is for, then you can start to apply some "Best Practices" , but even here there is a lot of room for interpretation, and even worse, same principles could be good in one place but very bad in another. I am just going to provide you an example of my real experience
8 years ago I started a project and we have to design a database for a financial application. After some analysis, we decided to use a start model, or dimension-fact model. We decided to create indexes ( including bitmap ) for some tables, even though we were rebuild them during batch to avoid performance degradation.
Funny thing is that after some months, I realised that the indexes were useless, as the users were running queries that were accessing the whole data, mostly analytics and aggregation. Consequence: I drop all indexes.
Is it a good thing to do ? No, it is not, but in my scenario it was the best thing and the performance increased a lot, both in batch and also in user experience.
Summary, like an old friend of mine that that was working in Oracle Support used to tell me: "Performance is an art my friend, not a science"
There are too many database algorithms to list, but below is a structured way of thinking about classes of algorithms that affect database performance.
Algorithm analysis is a helpful way of categorizing and thinking about many database performance problems. While most performance problems are solved with best practices and trial-and-error, we'll never truly understand why one solution is better than another without understanding the algorithms behind them. Below is a list of functions that describe the algorithmic complexity of different database operations, ordered from fastest to slowest.
O(1/N) – Batching to reduce overhead for bulk collect, sequences, fetching rows
O(1) - Hashing for hash partitioning, hash clusters, hash joins
O(LOG(N)) – Index access for b-trees
1/((1-P)+P/N) – Amdahl's Law and its implications for parallelizing large data warehouse workloads
O(N) - Full table scans, hash joins (in theory)
O(N*LOG(N)) – Full table scan versus repeated index reads, sorting, global versus local indexes, gathering statistics (distinct approximations and partition birthday problems)
O(N^2) – Cross joins, nested loops, parsing
O(N!) – Join order
O(∞) – The optimizer (satisficing and avoiding the halting problem)
One suggestion - based on the way you phrased your questions and comments, you're thinking of a database as merely a place to store data. But the most interesting parts of a database happen when you think of them as joining machines. There's not much to optimize about data sitting around, the real work happens when data is combined.
The above list is based on Chapter 16 of my book, Pro Oracle SQL Development. You can read an early version of the entire chapter for free here. While the chapter mostly stands alone, it requires an advanced understanding of Oracle. But each of the topics could be the basis for a lifetime of academic study, so you only need to pick one.

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

what is a good schema design for a loan origination system?

I am designing a loan origination system which would allow it's users to create loans, draw repayment schedule of the loan depending on the loan product parameters. I should also be able to add penalty, fees etc. Rescheduling loan should be possibility. I also need a loan schedule to do fast reporting.
I have a loans table, loan product table, payment schedule table and loan history table etc. I am not able to understand how I can design ahead this schema to keep it from changing too much.
I am doing this in ruby, rails3 and datamapper.
Except in the most tightly specified applications, I'm not sure you can design a schema that won't change much. What you can do is to make schemas that are not brittle, schemas that allow for change to happen. For the most part, that means.
Include only the data you know you need to meet today's requirements
Normalize.
Write automated tests.
The first rule is akin to "do the simplest thing possible," or "you ain't gonna need it," the rule that programmers use to avoid code bloat. Smaller schemas, like smaller code bases, need less effort to change. The second (normalize) is analagous to the Don't Repeat Yourself (DRY) principle, also known as "once and only once," another rule used to make code cheaper to change. The third rule (tests) is how programmers make refactoring possible without worrying about breaking everything. By tests, I mean testing the code that uses the schema, but also testing the schema itself: triggers, rules, cascading deletes, &c. can be tested, and when tested, it is easier to change them in response to changing requirements.
There are excuses, in the database world, for breaking these rules. The reason to break rule 1 (do the simplest thing/YAGNI) is that some data will be easier to collect from the beginning, and difficult or perhaps even impossible to collect if you decide you do need it later. Still, think twice before giving in to this excuse. You can almost always deal without too much fuss with gaps in the data caused by adding columns or tables later, but if you include today data you might only need tomorrow, you will be paying for it every time you change the schema. Each bit of data you include that you end up not needing was nothing but cost with no benefit. Perhaps more significantly, extra data can have a terrible effect upon performance, since it reduces the number of records that can fit in memory. Even though databases go through great pains to give good performance when reading from disk, their best performance comes from having enough memory (or little enough data) so that all or most of the working set will fit in RAM.
The excuse for breaking rule 2 (normalization) is performance: "Data warehouse" applications sometimes require denormalization in cases where many-table joins make a database slow and cranky. I'd want to be certain it was needed before denormalizing, since it's not free: data that exists in more than once place makes the schema more difficult to change, and trades off speed of queries for more work when inserting & updating.
I don't know of an excuse for breaking rule 3 (testing), or at least not a good one, but that doesn't mean there isn't one.
Martin Fowler writes "Evolutionary Database Design". Scott Amber and Pramod Sadalage have a book on Refactoring Databases. See also a summary/cheat sheet of the book's refactorings.

Can 'moving business logic to application layer' increase performance?

In my current project, the business logic is implemented in stored procedures (a 1000+ of them) and now they want to scale it up as the business is growing. Architects have decided to move the business logic to application layer (.net) to boost performance and scalability. But they are not redesigning/rewriting anything. In short the same SQL queries which are fired from an SP will be fired from a .net function using ADO.Net. How can this yield any performance?
To the best of my understanding, we need to move business logic to application layer when we need DB independence or there is some business logic that can be better implemented in a OOP language than an RDBMS engine (like traversing a hierarchy or some image processing, etc..). In rest of the cases, if there is no complicated business logic to implement, I believe that it is better to keep the business logic in DB itself, at least the network delays between application layer and DB can be avoided this way.
Please let me know your views. I am a developer looking at some architecture decisions with a little hesitation, pardon my ignorance in the subject.
If your business logic is still in SQL statements, the database will be doing as much work as before, and you will not get better performance. (may be more work if it is not able to cache query plans as effectivily as when stored procedures were used)
To get better performance you need to move some work to the application layer, can you for example cache data on the application server, and do a lookup or a validation check without hitting the database?
Architectural arguments such as these often need to consider many trades-off, considering performance in isolation, or ideed considering only one aspect of performance such as response time tends to miss the larger picture.
There clearly some trade off between executing logic in the database layer and shipping the data back to the applciation layer and processing it there. Data-ship costs versus processing costs. As you indicate the cost and complexity of the business logic will be a significant factor, the size of the data to be shipped would be another.
It is conceivable, if the DB layer is getting busy, that offloading processing to another layer may allow greater overall throughput even if the individual responses time are increased. We could then scale the App tier in order to deal with some extra load. Would you now say that performance has been improved (greater overall throughput) or worsened (soem increase in response time).
Now consider whether the app tier might implement interesting caching strategies. Perhaps we get a very large performance win - no load on the DB at all for some requests!
I think those decisions should not be justified using architectural dogma. Data would make a great deal more sense.
Statements like "All business logic belongs in stored procedures" or "Everything should be on the middle tier" tend to be made by people whose knowledge is restricted to databases or objects, respectively. Better to combine both when you judge, and do it on the basis of measurements.
For example, if one of your procedures is crunching a lot of data and returning a handful of results, there's an argument that says it should remain on the database. There's little sense in bringing millions of rows into memory on the middle tier, crunching them, and then updating the database with another round trip.
Another consideration is whether or not the database is shared between apps. If so, the logic should stay in the database so all can use it.
Middle tiers tend to come and go, but data remains forever.
I'm an object guy myself, but I would tread lightly.
It's a complicated problem. I don't think that black and white statements will work in every case.
Well as others have already said, it depends on many factors. But from you question it seems the architects are proposing moving the stored procedures from inside DB to dynamic SQL inside the application. That sounds very dubious to me.
SQL is a set oriented language and business logic that requires massaging of large amount of data records would be better in SQL. Think complicated search and reporting type function. On the other hand line item edits with corresponding business rule validation is much better being done in a programming language. Caching of slow changing data in app tier is another advantage. This is even better if you have dedicated middle tier service that acts as a gateway to all the data. If data is shared directly among disparate applications then stored proc may be a good idea.
You also have to factor the availability/experience of SQL talent vs programming talent in the organisation.
There is realy no general answer to this question.

One complex query vs Multiple simple queries

What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).
The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.
I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.
From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.

Resources