I have a large query that joins around 20 tables (mostly outer joins). It is using the older join syntax with commas and where conditions with (+) for outer joins.
We noticed that it is consuming a lot of server memory. We are trying several things among which one idea is to convert this query to use the newer ANSI syntax, since the ANSI syntax allows better control on the order of JOINs and also specifies the JOIN predicates explicitly as they are applied.
Does converting the query from an older syntax to the newer ANSI syntax help in reducing the amount of data processed, for such large queries spanning a good number of tables?
In my experience, it does not - it generates identical execution plans. That said, the newer JOIN syntax does allow you to things that you can't do with the old syntax. I would recommend converting it for that reason, and for clarity. The ANSI syntax is just so much easier to read (at least for me). Once converted you can then compare execution plans.
DCookie said all there is to say about ANSI syntax.
However, if you outer join 20 tables, it is no wonder you will consume a lot of server memory. Maybe if you cut down your query in smaller subqueries it might improve performance. That way not all tables have to be read in memory and then joined in memory and then filtered and then only the columns you need selected.
Reversing this order will at least save memory, although it doesn't have to improve execution speed.
As DCookie mentioned, both versions should produce identical execution plans. I would start by looking at the current query's execution plan and figuring out what is actually taking up the memory. A quick look at DBMS_XPLAN.DISPLAY_CURSOR output should be a good start. Once you know exactly what part of the query you are trying to improve, then you can analyze if switching to ANSI style joins will do anything to help you reach your end goal.
Related
I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?
There is no such best practice. Both approaches are applicable. Subqueries in UNION ALL are running as parallel jobs. So join before union will work as parallel tasks with smaller datasets, tez can optimize execution and common joined tables will be read only once in single mapper stage for each table.
Also you can avoid joins for some subqueries for example if their keys are not applicable for join.
Join with union-ed bigger dataset also may work with very high parallelism depending on your settings (bytes per reducer for example), optimizer also may rewrite query plan. So I suggest you to check both methods, measure speed, study plan and check if you can change something. Change, measure, study plan... repeat
Few more suggestions:
Try to limit datasets before joining them. If your join multiplies rows then analytics and aggregation may work slower on bigger datasets and first approach may be preferable if you can apply analytics/aggregation before union.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I look after a system which uploads flat files generated by ABAP. We have a large file (500,000 records) generated from the HR module in SAP every day which generates a record for every person for the next year. One person gets a record if they are rostered on on a certain day or have planned leave for a given day.
This job takes over 8 hours to run and it is starting to get time critical. I am not an ABAP programmer but I was concerned when discussing this with the programmers as they kept on mentioning 'loops'.
Looking at the source it's just a bunch of single row selects inside nested loops after nested loop. Not only that it has loads of SELECT
I suggested to the programmers that they use SQL more heavily but they insist the SAP approved way is to use loops instead of SQL and use the provided SAP functions (i.e to look up the work schedule rule), and that using SQL will be slower.
Being a database programmer I never use loops (cursors) because they are far slower than SQL, and cursors are usually a giveaway that a procedural programmer has been let loose on the database.
I just can't believe that changing an existing program to use SQL more heavily than loops will slow it down. Does anyone have any insight? I can provide more info if needed.
Looking at google, I'm guessing I'll get people from both sides saying it is better.
I've read the question and I stopped when I read this:
Looking at the source it's just a bunch of single row selects inside
nested loops after nested loop. Not only that it has loads of SELECT
*.
Without knowing more about the issue this looks overkilling, because with every loop the program executes a call to the database. Maybe this was done in this way because the dataset of the selected data is too big, however it is possible to load chunks of data, then treat them and then repeat the action or you can make a big JOIN and operate over that data. This is a little tricky but trust me that this does the job.
In SAP you must use this kind of techniques when this situations happen. Nothing is more efficient than handling datasets in memory. To this I can recommend the use of sorted and/or hashed tables and BINARY SEARCH.
On the other hand, using a JOIN does not necessarily improves performance, it depends on the knowledge and use of the indexes and foreign keys in the tables. For example, if you join a table to get a description I think is better to load that data in an internal table and get the description from it with a BINARY SEARCH.
I can't tell exactly what is the formula, it depends on the case, Most of the time you have to tweak the code, debug and test and make use of transactions 'ST05' and 'SE30' to check performance and repeat the process. The experience with those issues in SAP gives you a clear sight of these patterns.
My best advice for you is to make a copy of that program and correct it according to your experience. The code that you describe can definitely be improved. What can you loose?
Hope it helps
Sounds like the import as it stands is looping over single records and importing them into a DB one at a time. It's highly likely that there's a lot of redundancy there. It's a pattern I've seen many times and the general solution we've adopted is to import data in batches...
A SQL Server stored procedure can accept 'table' typed parameters, which on the client/C# side of the database connection are simple lists of some data structure corresponding to the table structure.
A stored procedure can then receive and process multiple rows of your csv file in one call, therefore any joins you need to do are being done on sets of input data which is how relational databases are designed to be used. This is especially beneficial if you're joining out to commonly used data or have lots of foreign keys (which are essentially invoking a join in order to validate the keys you're trying to insert).
We've found that the SQL Server CPU and IO load for a given amount of import data is much reduced by using this approach. It does however require consultation with DBAs and some tuning of indexes to get it to work well.
You are correct.
Without knowing the code, in most cases it is much faster to use views or joins instead of nested loops. (there are exceptions, but they are very rare).
You can define views in SE11 or SE80 and they usually heavily reduce the communication overhead between abap server and database server.
Often there are readily defined views from SAP for common cases.
edit:
You can check where your performance goes to: http://scn.sap.com/community/abap/testing-and-troubleshooting/blog/2007/11/13/the-abap-runtime-trace-se30--quick-and-easy
Badly written parts that are used sparsely don't matter.
With the statistics you know where it hurts an where your optimization effort pays.
Which of the Conditional Function is performance effective in HIVE? IF or CASE ?
I can speak from experience of working on optimizing complex queries with experts from Hortonworks. We worked on multi hundred line queries that included multiple IF/THEN and CASE. The performance difference is so small as to be unmeasurable.
Worry instead about your joins - i.e. mapside vs side data vs reduce side joins - and UDF's: those are where the performance improvements are to be found.
We did substantial tuning across a number of areas including a number of different types and skewness of joins, UDF's, and inline views. This is not an area that ever surfaced.
Unsubstantiated, but it has been reported that the if/then is actually faster. http://www.oehive.org/node/985
I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.
Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.
I did the analysis of the data and its around 5-10% of increase in dataset.
One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?
Is Hive bad at joining tables
No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.
Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.
If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.
What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.
In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.
I know that when joining across multiple tables, performance is dependent upon the order in which they are joined. What factors should I consider when joining tables?
Most modern RDBM's optimize the query based upon which tables are joined, the indexes used, table statistics, etc. They rarely, if ever, differ in their final execution plan based upon the order of the joins in the query.
SQL is designed to be declarative; you specify what you want, not (in most cases) how to get it. While there are things like index hints that can allow you to direct the optimizer to use or avoid specific indexes, by and large you can leave that work to the engine and be about the business of writing your queries.
In the end, running different versions of your queries within SQL Server Management Studio and viewing the actual execution plans is the only way to tell if order can truly make a difference.
As far as I know, the join order has no effect on query performance. The query engine will parse the query and execute it in the way it believes is the most efficient. If you want, try writing the query using different join orders and look at the execution plan. They should be the same.
See this article: http://sql-4-life.blogspot.com/2009/03/order-of-inner-joins.html