Data export performance issue - loops or joins? [closed]

Data export performance issue - loops or joins? [closed] - performance

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I look after a system which uploads flat files generated by ABAP. We have a large file (500,000 records) generated from the HR module in SAP every day which generates a record for every person for the next year. One person gets a record if they are rostered on on a certain day or have planned leave for a given day.
This job takes over 8 hours to run and it is starting to get time critical. I am not an ABAP programmer but I was concerned when discussing this with the programmers as they kept on mentioning 'loops'.
Looking at the source it's just a bunch of single row selects inside nested loops after nested loop. Not only that it has loads of SELECT
I suggested to the programmers that they use SQL more heavily but they insist the SAP approved way is to use loops instead of SQL and use the provided SAP functions (i.e to look up the work schedule rule), and that using SQL will be slower.
Being a database programmer I never use loops (cursors) because they are far slower than SQL, and cursors are usually a giveaway that a procedural programmer has been let loose on the database.
I just can't believe that changing an existing program to use SQL more heavily than loops will slow it down. Does anyone have any insight? I can provide more info if needed.
Looking at google, I'm guessing I'll get people from both sides saying it is better.

I've read the question and I stopped when I read this:
Looking at the source it's just a bunch of single row selects inside
nested loops after nested loop. Not only that it has loads of SELECT
*.
Without knowing more about the issue this looks overkilling, because with every loop the program executes a call to the database. Maybe this was done in this way because the dataset of the selected data is too big, however it is possible to load chunks of data, then treat them and then repeat the action or you can make a big JOIN and operate over that data. This is a little tricky but trust me that this does the job.
In SAP you must use this kind of techniques when this situations happen. Nothing is more efficient than handling datasets in memory. To this I can recommend the use of sorted and/or hashed tables and BINARY SEARCH.
On the other hand, using a JOIN does not necessarily improves performance, it depends on the knowledge and use of the indexes and foreign keys in the tables. For example, if you join a table to get a description I think is better to load that data in an internal table and get the description from it with a BINARY SEARCH.
I can't tell exactly what is the formula, it depends on the case, Most of the time you have to tweak the code, debug and test and make use of transactions 'ST05' and 'SE30' to check performance and repeat the process. The experience with those issues in SAP gives you a clear sight of these patterns.
My best advice for you is to make a copy of that program and correct it according to your experience. The code that you describe can definitely be improved. What can you loose?
Hope it helps

Sounds like the import as it stands is looping over single records and importing them into a DB one at a time. It's highly likely that there's a lot of redundancy there. It's a pattern I've seen many times and the general solution we've adopted is to import data in batches...
A SQL Server stored procedure can accept 'table' typed parameters, which on the client/C# side of the database connection are simple lists of some data structure corresponding to the table structure.
A stored procedure can then receive and process multiple rows of your csv file in one call, therefore any joins you need to do are being done on sets of input data which is how relational databases are designed to be used. This is especially beneficial if you're joining out to commonly used data or have lots of foreign keys (which are essentially invoking a join in order to validate the keys you're trying to insert).
We've found that the SQL Server CPU and IO load for a given amount of import data is much reduced by using this approach. It does however require consultation with DBAs and some tuning of indexes to get it to work well.

You are correct.
Without knowing the code, in most cases it is much faster to use views or joins instead of nested loops. (there are exceptions, but they are very rare).
You can define views in SE11 or SE80 and they usually heavily reduce the communication overhead between abap server and database server.
Often there are readily defined views from SAP for common cases.
edit:
You can check where your performance goes to: http://scn.sap.com/community/abap/testing-and-troubleshooting/blog/2007/11/13/the-abap-runtime-trace-se30--quick-and-easy
Badly written parts that are used sparsely don't matter.
With the statistics you know where it hurts an where your optimization effort pays.

Related

Multiple small trigger vs single large trigger on table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Consolidate several Oracle triggers. Any performance impact?
Question:
Which one is best (for performance)? Having multiple triggers (about 7-10) one for each situation/purpose, or one trigger which handles all the situations (by using if, etc).
Detail:
We're developing enterprise application based on Oracle database. We have one table with approximately 3M rows, which is base table for our app. And there are several situations that we need to handle only with triggers. IMHO, for maintenance, it's better to have multiple triggers. But what about performance?

In my case one single trigger was much quicker than multiple small triggers. Why I don't know.

I can't see how multiple triggers would perform noticably faster. the only trade-off is they would not execute the conditional logic, saving very small fractions of a second.
This is really a design/maintenance question ... Let's say, you want to instrument the code to give you timing information: if you have one trigger, then you only need to go into one program, whereas with multiple triggers, you need to go to multiple places to instrument the code.
Also, consider using CASE statement instead of IF (It's a lot cleaner/easier to read/maintain).

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.

Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

Allowing redundant data in the DB for the sake of performance

Let's say you're designing the DB schema for the next stack overflow and more specifically the part of the schema that handles question ratings.
I assume you'd use a table like:
ratings(question_id, user_id, rating)
... that will both record ratings and make sure no user votes twice on the same question.
That table alone could handle rating data but it might result in slow queries.
Taking performance into consideration, would you consider storing the sum of ratings for each question in the questions table, even though this data would be redundant since it's derivative from the data in the ratings table?

I would generally first start with a normalized model, not de-normalizing the sum of ratings in the question table.
Then, when the application is working well enough, I would do some performance testings, to determine whether the application handles load good enough -- compared to the load I expect to have in production.
If it doesn't handle load well enough, I would check for bottlenecks -- and correct the most important ones, until the application does well.
Once the application is in production, if the website has lots opf users, it'll be time to make some additionnal optimizations.
To make things simple :
Don't over-optimize
Get your application working
Once it works, benchmark it
If / when needed, optimize
In the end, yes, maybe, de-normalizing the sum of ratings to the questions table might help ; but do you need to do it ?
That is the real question ;-)

If you're planning to pre-aggregate tables it would be worth looking at materialised views (indexed views in T-SQL).

In general - it is valid approach to store aggregate values, if you know that data is read much more frequently then written.
In this specific case I would also consider making phisical design of the answers table in the way, which makes aggregation cheap. To do so I would make clustered index defined on query_id, answer_id.
As a result only several DB pages will be read from the disk to get all answers for the specific query.

Pitfalls in prototype database design (for performance viability testing)

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle 10.2.0.3.0 running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.

There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.

Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
edit
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.

Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

One complex query vs Multiple simple queries

What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).

The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.

I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.

From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio