Hive union all efficiency and best practice - performance

I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?

There is no such best practice. Both approaches are applicable. Subqueries in UNION ALL are running as parallel jobs. So join before union will work as parallel tasks with smaller datasets, tez can optimize execution and common joined tables will be read only once in single mapper stage for each table.
Also you can avoid joins for some subqueries for example if their keys are not applicable for join.
Join with union-ed bigger dataset also may work with very high parallelism depending on your settings (bytes per reducer for example), optimizer also may rewrite query plan. So I suggest you to check both methods, measure speed, study plan and check if you can change something. Change, measure, study plan... repeat
Few more suggestions:
Try to limit datasets before joining them. If your join multiplies rows then analytics and aggregation may work slower on bigger datasets and first approach may be preferable if you can apply analytics/aggregation before union.

Related

Oracle: Efficient query over date interval

I have tables representing events with start and end time (stored as DATE, TIMESTAMP or TIMESTAMP WITH TIME ZONE format), e.g.
MY_TABLE:
DATA START END
A 1/11/2012 10:00 1/11/2012 12:00
B 2/12/2012 08:00 2/12/2012 16:00
And it frequently happens to have queries on such tables over time intervals, e.g.
SELECT data
FROM my_table
WHERE start BETWEEN t1 AND t2; -- usually we either use start or end time for every row.
Where t1 and t2 are DATE/TIMESTAMP values s.t. t1<= t2.
Since these queries are goin to be run on large tables, is there a better, that is, more effcient, way of performing queries like above?
Currently I don't know if any table with the structure above has a index on either of time-data columns, still it would be hardly a problem to add them. They have indexes on other non-time related columns.
I remember to have read somewehre that using analytic functions (like partition) these kind of query could be made much more efficient than simply using BETWEEN..AND. Unfortunately I can't find the link anymore and I don't know analytics functions, of which I have read just a few short introductions here and there on the net.
Since I have little time to investigate I'd like to ask you if you could confirm my theory and if you could lead me to an example related to my problem.
It goes without saying that I'm not asking you a quick answer to my problem, something to copy&paste, just a hint to understand if I'm looking in the right direction.
TIA
EDIT:
#jonearles : For the first statement I'd agree, but I'd like to know if the use of analytics functions isn't actually able to provide a more efficient query.
For the latter, yes I meant PARTITION BY clause. It occurs to me that is a silly specification, since analytical functions are expected to be used with a PARTITION BY clause.
I apologyze for the confusion, as I said before I haven't looked much into the subject.
Your query is probably fine. The simplest way to write a predicate is usually the best. That's what Oracle expects, and is most likely what Oracle is optimized for.
You probably want to look into creating objects to improve the access methods. Specifically indexes (if you're selecting a small amount of data) and partitions (if you're selecting a large amount of data).
"Partition" can mean at least three different things in Oracle. Perhaps you've confused the "partition by" clause in analytic functions with partitioning?

HBase Inner join and coprocessors

I am planning to do a project for implementing all aggregation operations in HBase. But I don’t know about its difficulty. I have only 6 months for completing that project. Should I go forward with it? I am planning to do it in java. I know that there are already some aggregation functions. But there in no INNER JOIN like queries now. I am planning to implement such type of queries. I don't know it’s a blunder or bluff.
I think technically we should distinguish two types of joins:
a) One small table + One Big Table. By small table I mean table which can be cached in memory of each node w/o seriously affecting cluster operation. In this case Join using coprocessor should be be possible by putting small table in the hash map, iterating over the node local part of the data of the big table and this way producing join results. In the Hive's term it is called "map" join http://www.facebook.com/note.php?note_id=470667928919.
b) Two big tables. I do not think it is viable to get it production quality in short time frame. I might state that such functionality is realm of MPP databases and serious part of their IP.
It is definitely harder in HBase than doing it in an RDBMS or a different Hadoop technology like PIG or Hive.

Large query, multiple tables, old vs new JOIN syntax

I have a large query that joins around 20 tables (mostly outer joins). It is using the older join syntax with commas and where conditions with (+) for outer joins.
We noticed that it is consuming a lot of server memory. We are trying several things among which one idea is to convert this query to use the newer ANSI syntax, since the ANSI syntax allows better control on the order of JOINs and also specifies the JOIN predicates explicitly as they are applied.
Does converting the query from an older syntax to the newer ANSI syntax help in reducing the amount of data processed, for such large queries spanning a good number of tables?
In my experience, it does not - it generates identical execution plans. That said, the newer JOIN syntax does allow you to things that you can't do with the old syntax. I would recommend converting it for that reason, and for clarity. The ANSI syntax is just so much easier to read (at least for me). Once converted you can then compare execution plans.
DCookie said all there is to say about ANSI syntax.
However, if you outer join 20 tables, it is no wonder you will consume a lot of server memory. Maybe if you cut down your query in smaller subqueries it might improve performance. That way not all tables have to be read in memory and then joined in memory and then filtered and then only the columns you need selected.
Reversing this order will at least save memory, although it doesn't have to improve execution speed.
As DCookie mentioned, both versions should produce identical execution plans. I would start by looking at the current query's execution plan and figuring out what is actually taking up the memory. A quick look at DBMS_XPLAN.DISPLAY_CURSOR output should be a good start. Once you know exactly what part of the query you are trying to improve, then you can analyze if switching to ANSI style joins will do anything to help you reach your end goal.

What are the deciding factors for the order of Tables when joining amongst them?

I know that when joining across multiple tables, performance is dependent upon the order in which they are joined. What factors should I consider when joining tables?
Most modern RDBM's optimize the query based upon which tables are joined, the indexes used, table statistics, etc. They rarely, if ever, differ in their final execution plan based upon the order of the joins in the query.
SQL is designed to be declarative; you specify what you want, not (in most cases) how to get it. While there are things like index hints that can allow you to direct the optimizer to use or avoid specific indexes, by and large you can leave that work to the engine and be about the business of writing your queries.
In the end, running different versions of your queries within SQL Server Management Studio and viewing the actual execution plans is the only way to tell if order can truly make a difference.
As far as I know, the join order has no effect on query performance. The query engine will parse the query and execute it in the way it believes is the most efficient. If you want, try writing the query using different join orders and look at the execution plan. They should be the same.
See this article: http://sql-4-life.blogspot.com/2009/03/order-of-inner-joins.html

How to increase Oracle CBO cost estimation for hash joins, group by's and order by's without hints

It seems that on some of the servers that we have, the cost of hash joins, group by's and order by's is too low compared to the actual cost. I.e. often execution plans with index range scans outperform the former, but on explain plan the cost shows up as higher.
Some further notes:
I already set optimizer_index_cost_adj to 20 and it's still not good enough. I do NOT want to increase the cost for pure full table scans, in fact I wouldn't mind the optimizer decreasing the cost.
I've noticed that pga_aggregate_target makes an impact on CBO cost estimates, but I definitely do NOT want to lower this parameter as we have plenty of RAM.
As opposed to using optimizer hints in individual queries, I want the settings to be global.
Edit 1: I'm thinking about experimenting with dynamic sampling, but I don't have enough intimate knowledge to predict how this could affect the overall performance, i.e. how frequently the execution plans could change. I would definitely prefer something which is very stable, in fact for some of our largest clients we have a policy of locking the all the stats (which will change with Oracle 11g SQL Plan Management).
Quite often when execution plans with index range scans outperform those with full scans + sorts or hash joins, but the CBO is picking the full scans, it's because the optimiser believes it's going to find more matching results than it actually gets in real life.
In other words, if the optimiser thinks it's going to get 1M rows from table A and 1000 rows from table B, it may very well choose full scans + sort merge or hash join; if, however, when it actually runs the query, it only gets 1 row from table A, an index range scan may very well be better.
I'd first look at some poorly performing queries and analyse the selectivity of the predicates, determine whether the optimiser is making reasonable estimates of the number of rows for each table.
EDIT:
You've mentioned that the cardinality estimates are incorrect. This is the root cause of your problems; the costing of hash joins and sorts are probably quite ok. In some cases the optimiser may be using wrong estimates because it doesn't know how much the data is correlated. Histograms on some columns may help (if you haven't already got them), and in some cases you can create function-based indexes and gather statistics on the hidden columns to provide even better data to the optimiser.
At the end of the day, your trick of specifying the cardinalities of various tables in the queries may very well be required to get satisfactory performance.

Resources