Oracle Indexes on Left Outer Joins - oracle

So I'm having some issues with proper / any use of indexes in Oracle 11Gr2 and I'm trying to get a better understanding of how my explain plan ties back to my query so that I can apply indexing properly. When running the following query:
SELECT JLOG1.JLOG_KEY,
JLOG1.SRC_CD,
JLOG1.JRNL_AMT,
CASD.CONT_NO,
SUM (NVL (VJLOG.TDTL_AMT, 0)) TDTL_SUM
FROM GL_Journal_Logs JLOG1,
GL_JLOG_Details VJLOG,
CASE_DATA CASD
WHERE VJLOG.JLOG_KEY(+) = JLOG1.JLOG_KEY
AND CASD.CASE_KEY(+) = JLOG1.CASE_KEY
AND JLOG1.JRNL_CD = '0'
AND JLOG1.SRC_CD = '2'
AND JLOG1.ACCT_IF_CD = '0'
GROUP BY JLOG1.JLOG_KEY, JLOG1.SRC_CD,JLOG1.JRNL_AMT, CASD.CONT_NO
HAVING JLOG1.JRNL_AMT <> SUM (NVL (VJLOG.TDTL_AMT, 0));
I'm getting the following explain details:
I can understand that the indexes on my join "keys" (JLOG_KEY or CASE_KEY) wouldn't necessarily apply seeing as it's an outer join (or should they?), however when creating indexes on JLOG1 (JRNL_CD, SRC_CD, ACCT_IF_CD), technically would these take effect given my "where" clause?
Should I create any indexes at all given the circumstances or is there a better way of doing this?

Depending on the cardinality of the columns in your predicates, an appropriate index might be used on the GL_JLOG_DETAILS table, avoiding a full table scan. A covering index may avoid accessing the data pages at all:
ON GL_JOURNAL_LOGS (JRNL_CD,SRC_CD,ACCT_IF_CD,JLOG_KEY,CASE_KEY,JRNL_AMT)
(You probably want the column with the most selective predicate first in that index)
Also, your query may be able to make effective use of indexes
ON GL_JLOG_DETAILS (JLOG_KEY, TDTL_AMT)
and
ON CASE_DATA (CASE_KEY, CONT_NO)
Also, be sure that the statistics on the tables and indexes are up-to-date.
Also, that (+) notation for an OUTER JOIN may be limiting the optimizer.
Oracle now supports the ANSI style joins, which may allow the optimizer more latitude in coming up with an execution plan, e.g.
FROM GL_Journal_Logs JLOG1
LEFT
JOIN GL_JLOG_Details VJLOG ON VJLOG.JLOG_KEY = JLOG1.JLOG_KEY
LEFT
JOIN CASE_DATA CASD ON CASD.CASE_KEY = JLOG1.CASE_KEY
WHERE JLOG1.JRNL_CD = '0'
AND JLOG1.SRC_CD = '2'
AND JLOG1.ACCT_IF_CD = '0'

Related

Multiple consecutive join operations on PySpark

I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)

How to perform subquery in FROM clause with Yii 1.1?

I have a fairly large and complex query being generated by Yii. To generate this query we're utilizing CDbCritera::with to eagerly load multiple related models, and we're using multiple scopes to limit the records returned. The query being generated is roughly 700 lines long, but looks something like this:
SELECT `t`.`column1` as `t0_c0`,
`t`.`column2` as `t0_c1`,
`related1`.`column1` as `t1_c0`,
...
`related9`.`column5` as `t9_c4`
FROM `model` `t`
LEFT OUTER JOIN `other_model` `related1`
ON ( `t`.`other_model_id` = `related1`.`id` )
...
LEFT OUTER JOIN `more_models` `related9`
ON ( `t`.`more_models_id` = `related9`.`id` )
WHERE
...big long WHERE clause using all of related1 - related9 to filter model...
LIMIT 10
Our database has a not insignificant amount of data, but not obscene, either. In this case the model table has about 126000 rows, every "related" model is a BELONGS_TO relationship and there is an index on t.XXX_id so the join is fairly trivial. The problem is the complexity of the WHERE clause, possessing multiple COALESCE and IF and CASE clauses. Performing the filter on our 126000 rows is taking 2.6 seconds -- far longer than we would like for an API endpoint.
The WHERE clause is divided into multiple different sections like so:
WHERE
( ... part 1 ... )
AND
( ... part 2 ... )
AND
( ... part 3 ... )
With each part corresponding to one of the scopes, and each part using one or more related models
One of the scopes filters on only a single related model, and in doing so filters our table down from 126000 rows to about 2000 rows. I found experimentally (in MySQL Workbench) that I could get our query from 2.6 seconds to 0.2 seconds by simply doing this:
SELECT `t`.`column1` as `t0_c0`,
`t`.`column2` as `t0_c1`,
`related1`.`column1` as `t1_c0`,
...
`related9`.`column5` as `t9_c4`
FROM
(
SELECT `model`.*
FROM `model`
LEFT OUTER JOIN `other_model`
ON ( `t`.`other_model_id` = `other_model`.`id` )
WHERE
( ... part 1 ... )
) `t`
LEFT OUTER JOIN `other_model` `related1`
ON ( `t`.`other_model_id` = `related1`.`id` )
...
LEFT OUTER JOIN `more_models` `related9`
ON ( `t`.`more_models_id` = `related9`.`id` )
WHERE
( ... part 2 ... )
AND
( ... part 3 ... )
LIMIT 10
This way instead of performing the very complex WHERE clause on all 126000 rows of the original model table, we perform the much simpler (and index-enhanced) WHERE clause on these 126000 rows and then perform the complex WHERE clause on only the 2000 relevant rows. The results of the two queries are identical, but using a subquery in the FROM clause causes it to run 13x faster.
The problem is, I have no idea how to do this in Yii. I know that I can use CDbCommand to build a query and even pass in raw SQL, but what I'll get back is an array of "rows" -- they won't be understood by Yii and properly converted to the right models.
Does Yii's ActiveRecord system have a way to say something like the following?
$criteria = new CDbCriteria;
$criteria->scopes = array("part1");
$subQuery = Model::model()->buildQuery($criteria);
$criteria = new CDbCriteria;
$criteria->scopes = array("part2", "part3");
$fullQuery = $subQuery->findAll($criteria);
Although not a perfect solution, I did find something that's almost as good. Break the original query into two:
Get the IDs or models you wish to select in the FROM subquery
Append a WHERE to the outer query with id in (...)
If anyone is interested I'll hunt down the code I wrote for this to post in the answer as an example, but so far this question has gotten very little attention and once I found a pseudo-decent solution I've moved on.

LEFT OUTER JOIN vs NOT EXISTS syntax with multiple tables

I am doing some performance troubleshooting on my SSRS native instance. I have what I hope is a simple syntax issue. I am troubleshooting execution plans when using LEFT OUTER JOIN and NOT EXISTS. I know the difference between the two and hope that maybe NOT EXISTS is my solution, however I have one problem. Here is my query.
SELECT [Facility]
,[CategoryDesc]
,[SubCategoryDesc]
,[ItemKey]
,[ItemDesc]
,[HeadCount]
,[Group]
,[Group Name]
,[CustomerKey]
,[Customer]
,[InvoiceNo]
,[InvoiceDate]
,[OrderNo]
,[OrderDate]
,[FiscalYear]
,[Quarter]
,[WeekNo]
,[SalesmanID]
,[Salesman]
,[ReasonCodeKey]
,[Weight]
,[Box]
,[Value]
,[OrderStatus]
,[PONumber]
,[SubCategoryKey]
,[DispatchCenterOrderKey]
,[PromotionFlag]
,[CategoryKey]
,b.UserID
FROM [FinancialData].[dbo].[FactSalesHistoryDetail] a
LEFT OUTER JOIN [FinancialData].[dbo].[DimSalesRepUserIDMap] b on b.SalesRepID = a.SalesmanID
I am hoping to use this instead:
SELECT [Facility]
,[CategoryDesc]
,[SubCategoryDesc]
,[ItemKey]
,[ItemDesc]
,[HeadCount]
,[Group]
,[Group Name]
,[CustomerKey]
,[Customer]
,[InvoiceNo]
,[InvoiceDate]
,[OrderNo]
,[OrderDate]
,[FiscalYear]
,[Quarter]
,[WeekNo]
,[SalesmanID]
,[Salesman]
,[ReasonCodeKey]
,[Weight]
,[Box]
,[Value]
,[OrderStatus]
,[PONumber]
,[SubCategoryKey]
,[DispatchCenterOrderKey]
,[PromotionFlag]
,[CategoryKey]
,b.UserID
FROM [FinancialData].[dbo].[FactSalesHistoryDetail] a
WHERE NOT EXISTS (SELECT 1 FROM FinancialData.dbo.DimSalesRepUserIDMap b WHERE b.SalesRepID = a.SalesmanID)
The problem is that the very last column "b.UserID" uses the LEFT OUTER JOIN to get it's alias. When use the last query, I get "the multi-part identifier "b.UserID" could not be bound. Obviously this is because I have removed the call to this table. If I include it this way... it takes far far too long and not what I am expecting to receive.
FROM [FinancialData].[dbo].[FactSalesHistoryDetail] a, FinancialData.dbo.DimSalesRepUserIDMap b
WHERE NOT EXISTS (SELECT 1 FROM FinancialData.dbo.DimSalesRepUserIDMap b WHERE b.SalesRepID = a.SalesmanID)
So the question is how do I format this so that I am optimizing the performance with NOT EXISTS or EXISTS, while also referencing multiple columns from other tables?
Left join is the best option at this juncture. The last query will produce results but will have performance impact. Btw, did you try cross join?

Pig Script: Null values after join and GENERATE

I have a strange dilemma in my Pig Script. I am joining multiple tables and one of the last joins is as follows:
a = JOIN O_1 by ((long)OpropID, (long)OAID) LEFT, property by ((long)GPropID, (long)prop_AID);
If I filter my result by specific data points, I get the proper results for those fields from the property table (right table in the join). Even without the filter, the resultset is correct, I'm only filtering it to test the results.
b = filter a by OpropID==12 and OAID==10;
dump b;
However, if create a subsequent GENERATE statement immediately after the join, the same fields (last two in the example below) return NULL results:
c = FOREACH a GENERATE gID, p_AID, OpropID, OAID, GPropID, prop_AID;
I've tried using $16, $17 instead of the field names; I've also used property::GPropID or property::prop_AID to no avail.
Any help at this point would be appreciated.

Oracle 9i Sub query

Hi Can any one help me out of this query forming logic
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC,CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR FROM CPP_VIEW_VIEW C WHERE UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon')='2012/Nov'
)
AND UPPER(C.CPPNO) = UPPER('123')
AND C.CPP_CODE ='CPP000000000053'
and TO_CHAR(c.CPP_DATE,'YYYY/Mon') = '2012/Nov';
Please Correct me if i formed wrong query structure, in terms of query Performance and Standards. Thanks in Advance
If you have some indexes or partitioned tables I would not use functions on columns but on variables, to be able to use indexes/select partitions.
Also I use ANSI 92 SQL syntax. You don't specify(or not directly) a join contition between cpp_prcnt and cpp_view so it is actually a cartesian product(cross join)
SELECT C.CPPID, c.CPP_AMT_MANUAL
FROM CPP_PRCNT CC
CROSS JOIN CPP_VIEW c
WHERE
CC.CPPYR IN (
SELECT C.YEAR
FROM CPP_VIEW_VIEW C
WHERE C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
)
AND C.CPPNO = '123'
AND C.CPP_CODE ='CPP000000000053'
AND trunc(c.CPP_DATE,'MM')=to_date('2012/Nov','YYYY/Mon')
If you show us the definition of cpp_view_view(seems to be a view over cpp_view), the definition(if simple) of CPP_VIEW and what you're trying to achieve, I bet there are more things to be improved/fixed.
There are a couple of things you could improve:
if possible, get rid of the UPPER() in the comparison - this will render any indices useless. If that's not possible, consider a function-based index on UPPER(CPPNO)
do not convert your DATE column to a string to compare it with a string - do it the other way round (i.e. convert your string to a date => only one conversion needed instead of one per table row, use of indices possible)
play around with EXISTS instead of IN, as suggested by Dileep - might be faster

Resources