Reconcile two large tables efficiently in Power BI? - dax

As part of a migration project, I’m looking to reconcile two fact tables (high cardinality with approx 500k rows each- there are a lot of customer accounts and it has to be reconciled on a customer account basis ). There is a many-to-many relationship between customer columns in the two tables.
I am struggling to find an efficient way to output the customers that appear in both tables but have a difference in the value.
I’ve tried merge in Power Query but it is extremely slow- perhaps due to the volume and high cardinality factor.
I would welcome any advice on how to produce the desired output efficiently?
Input Table 1:
Customer | Type | Channel | Loan
Jones | A | Branch | 100
Taylor | B | Phone | 200
Taylor | B | Online | 60
Jerez | C | Online | 120
Murray | D | Phone | 90
Input Table 2:
Customer | Type | Loan
Jones | A | 81
Taylor | B | 285
Jerez | C | 80
Jerez | C | 40
Seinfeld | A | 140
Desired Output:
Customer is in both tables, but the difference is in loan:
Customer | Type1 | Loan1 | Loan2
Jones | A | 100 | 81
Taylor | B | 260 | 285
where Loan 1 is the loan stated in Table 1; and Loan 2 is the loan stated in Table 2.
Thanks for taking the time to look at this question.

Related

How to pivot data in Hive?

First, I've checked other topics on the subject like this one How to transpose/pivot data in hive? but that doesn't match with what I want.
So this is the table I have
| ID | Day | Status |
| 1 | 1 | A |
| 2 | 10 | B |
| 3 | 101 | A |
| 3 | 322 | B |
| 3 | 102 | C |
| 3 | 354 | D |
And i'd like to concat the different Status for each IDs ordering by the Day, in order to have this :
| ID | Status |
| 1 | A |
| 2 | B |
| 3 | A,C,B,D |
The thing is that I don't know how many status I can have, so i can't create as many columns I want for the days since I don't know how many day/status I'll have, so the answers from other topics with group_map or others, I don't know how to adapt it for my problem.
Thank's for helping me ^^
use collect_set (for distinct values) or collect_list to aggregate array and concatenate it using concat_ws:
select ID, concat_ws(',',collect_list(Status)) as Status
from table
group by ID;

Is there a way to cache and filter a table locally in PL SQL?

I’m faced with having to process a table in ORACLE 11g, that contains around 5 million records. These records are speed limits along a divided highway. There is a SpeedLimitId, HighwayId and a from mile post and to mile post to depict the area that the speed limit is applied to. Currently all the records are only on one side of the divided highway and the records need to be processed to also apply them to the other side. There is a measure equation table that lets us know which range of measure on one side of the highway equal a range of measure on the other side of the highway. This allows us to calculate the measure that the speed limit event will be on the other side by calculating the percentage of the measure value in the range on measure and then finding that same percentage of the range on the opposing side. The speed limit record can be contained to one measure equation record or it can cross several of them. Base on the information in the speed limit table and the measure equation, one or more records need to be inserted into a third table.
SPEED_LIMIT
+--------------+-----------+--------------+------------+------------+-------+
| SpeedLimitId | HighwayId | FromMilePost | ToMilePost | SpeedLimit | Lane |
+--------------+-----------+--------------+------------+------------+-------+
| 1 | 75N | 115 | 123 | 60 | South |
+--------------+-----------+--------------+------------+------------+-------+
MEASURE_EQUATION
+------------+----------------+-----------+---------+-------+----------------+-----------+---------+-------+------------------+
| EquationId | NorthHighwayId | NFromMile | NToMile | NGain | SouthHighwayId | SFromMile | SToMile | SGain | IsHighwayDivided |
+------------+----------------+-----------+---------+-------+----------------+-----------+---------+-------+------------------+
| 1 | 75N | 105 | 120 | 15 | 75S | 100 | 110 | 10 | No |
| 2 | 75N | 120 | 125 | 5 | 75S | 110 | 125 | 15 | Yes |
| 3 | 75N | 125 | 130 | 5 | 75S | 125 | 130 | 5 | No |
+------------+----------------+-----------+---------+-------+----------------+-----------+---------+-------+------------------+
Depending on information in the SPEED_LIMIT and MEASURE_EQUATION table there will be a need to insert at least one but can be as many as three records in a third table. There are a dozen or so different scenarios that can take place as a result of different values in the fields.
Using the above data you can see that the SpeedLimitId 1 is noted as being on the south side of the highway, but it is currently on the north side and that it also spans the 2 equation records with the ids of 1 and 2. In this case it spans two measure ranges as a single roadway splits off and becomes divided highway. We need to split the original records into two events and add them to third processing table and calculate the new measure for the south bound lane.
SPEED_LIMIT_PROCESSING
+--------------+-----------+-------+----------+--------+
| SpeedLimitId | HighwayId | LANE | FromMile | ToMile |
+--------------+-----------+-------+----------+--------+
| 1 | 75N | North | 115 | 120 |
| 1 | 75S | South | 110 | 119 |
+--------------+-----------+-------+----------+--------+
The methodology to calculate the measure on the south bound lane is as follows:
+--------------------+----------------------------+-----------------------------+
| | From Measure Translation | To Measure Translation |
+--------------------+----------------------------+-----------------------------+
| Event Measure as % | ((120 – 120)/5) * 100 = 0% | ((123 – 120)/5) * 100 = 60% |
| Offset Measure | ((15 * 0) / 100 = 0 | ((15 * 60) / 100) = 9 |
| Translated Measure | 110 + 0 = 110 | 110 + 9 = 119 |
+--------------------+----------------------------+-----------------------------+
My concern is to do this in the most efficient way possible. The idea would be to loop through each record in the SPEED_LIMIT table, select the corresponding records in the measure equation table and then based on information from those 2 tables I would insert records into a 3rd table. In order to limit PL/SQL context switches I planned on using "BULK COLLECT and FORALL” statements to query the event table and to run the insert statements, this would allow me to do things in batches. The missing component is how to get the corresponding records from the MEASURE_EQUATION table without having to do a sql query for every record loop in the SPEED_LIMIT table. The MEASURE_EQUATION only has about 700 records in it, so I was wondering if there is a way I can cache it in PL SQL and then filter it to the appropriate records for the current SPEED_LIMIT record.
As you can probably gleamed from my question, I’m fairly new at PL SQL and ORACLE in general, so maybe I’m going about it in the completely wrong way.

Exclude observations that do not exist for all quarters from data source in SAS Visual Analytics

I want to create a filter that will exclude observations that don't have a value for all 3 quarters (Q1, Q2, Q3 of 2016) in my dataset. Here's an example of my dataset:
Identifier | Quarter | Value
ABC3456 | 2016Q1 | 145
ABC3456 | 2016Q2 | 159
XYZ874 | 2016Q3 | 226
ABC3456 | 2016Q3 | 311
The outcome, after the application of the filter should be:
Identifier | Quarter | Value
ABC3456 | 2016Q1 | 145
ABC3456 | 2016Q2 | 159
ABC3456 | 2016Q3 | 311
because XYZ874 does not have values for Q1 and Q2 of 2016.
I have experience with Advanced filters but only to include or exclude values based on one variable. Searching through this site and the internet I can't find a solution.
Please provide an answer for SAS Visual Analytics and not SAS Enterprise Guide.
Thanks in advance.

Get duplicate rows based on one column using BIRT

I have one table in BIRT Report :
| Name | Amount |
| A | 200 |
| B | 100 |
| A | 150 |
| C | 80 |
| C | 100 |
I need to summarize this table in to another table as : I name is same and add corresponding values.
Summarized table would be :
| A | 350 |
| B | 100 |
| C | 180 |
Here A = 200 + 150 , B = 100 , C = 80 + 100
How I can summarize table from another table present in BIRT Report ?
That is quite easy. Just add another table to your report, select the same datasource as the first table (on the tab binding)
Go to the tab groups and add a group on the your 'Name' column.
You'll see the table change. It added group header row and group footer row. The header will also have an element on which you grouped (in this case name)
Now right click next to name in the amount column. Select Insert->Aggregation.
Select function SUM, expression should be amount, Aggregate On should be your newly created group.
Now you can see the results but it will be something like:
| A | 350 |
| A | 200 |
| A | 150 |
| B | 100 |
| B | 100 |
| C | 180 |
| C | 100 |
| C | 80 |
If you delete the detail row from the table, you'll have the result your after.
For you information:
Have a play with this, its good excersise. Move the new aggregation to the group footer, add a top border to that cell, put a label total in front if it and you'll have something like this:
| A | |
| A | 200 |
| A | 150 |
----------
| total | 350 |
| B | |
| B | 100 |
----------
| total | 100 |
| C | |
| C | 100 |
| C | 80 |
----------
| total | 180 |
Also, you don't have to select the datasource as the binding, you can also select your first table for the bindings:
select the table, open the tab biding, select report item and pick your first table from the dropdown.
This can create very complex situations, therefor I usually try to work from the original dataset.

How to utilize TABLE ACCESS BY INDEX ROWID

I have a problem with a query that uses a FULL TABLE SCAN.
When this query runs on our UAT env, it uses a TABLE ACCESS BY INDEX ROWID, but in prod
it uses FULL TABLE SCAN. UAT runs much better than PROD.
We have the same tables and indexes structure in prod and uat.
I already tried rebuilding and recreating the indexes but the same explain plan is used.
Table and indexes statics were also updated.
Can you help me to make this query use INDEX access instead of FUll table access?
Please see below the explain plan of our prod and uat.
EXPLAIN PLAN PROD
=====================================================================
SQL> explain plan for
SELECT ASV_ODC_BRANCH.CODE, ASV_ODC_BRANCH.DESCRIPTION, ASV_ODC_BRANCH.BRSTN, DEB.VIEW_DORMANT.ACCTNO AS DORMANT_ACCT,
DEB.VIEW_DORMANT.SHORTNAME AS DORMANT_NAME, DEB.VIEW_DORMANT.OPID_ENTRY, DEB.CUSTOMER.CUSTOMERNO,
DEB.CUSTOMER.TIME_STAMP_ENTRY
FROM ASV_ODC_BRANCH, DEB.VIEW_DORMANT, DEB.CUSTOMER
WHERE trim(ASV_ODC_BRANCH.CODE) = decode(SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 3, 1) || SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 7, 1), ’29’,
SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 4, 3), SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 3, 3)) AND
DEB.VIEW_DORMANT.ACCTNO = DEB.CUSTOMER.CUSTOMERNO AND (DEB.VIEW_DORMANT.ACCTNO = :Xacct)
ORDER BY ASV_ODC_BRANCH.CODE, DORMANT_ACCT;
Explained.
PLAN_TABLE_OUTPUT | Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
| 0 | SELECT STATEMENT | | 3 | 489 | 3601 (2)|
| 1 | SORT ORDER BY | | 3 | 489 | 3601 (2)|
| 2 | HASH JOIN | | 3 | 489 | 3600 (2)|
| 3 | MERGE JOIN CARTESIAN | | 1 | 90 | 3595 (2)|
| 4 | NESTED LOOPS | | 1 | 66 | 3592 (2)|
| 5 | **TABLE ACCESS FULL** | ACCOUNT | 1 | 56 | 3590 (2)|
| 6 | TABLE ACCESS BY INDEX ROWID| EXTENSION1 | 1 | 10 | 2 (0)|
| 7 | INDEX UNIQUE SCAN | PKEXT10 | 1 | | 1 (0)|
| 8 | BUFFER SORT | | 1 | 24 | 3593 (2)|
| 9 | TABLE ACCESS BY INDEX ROWID| CUSTOMER | 1 | 24 | 3 (0)|
| 10 | INDEX RANGE SCAN | UXCUST1 | 1 | | 2 (0)|
| 11 | TABLE ACCESS FULL | ASV_ODC_BRANCH | 334 | 24382 | 5 (0)|
**EXPLAIN PLAN UAT**
======================================================================================
SQL> explain plan for
SELECT ASV_ODC_BRANCH.CODE, ASV_ODC_BRANCH.DESCRIPTION, ASV_ODC_BRANCH.BRSTN, DEB.VIEW_DORMANT.ACCTNO AS DORMANT_ACCT,
DEB.VIEW_DORMANT.SHORTNAME AS DORMANT_NAME, DEB.VIEW_DORMANT.OPID_ENTRY, DEB.CUSTOMER.CUSTOMERNO,
DEB.CUSTOMER.TIME_STAMP_ENTRY
FROM ASV_ODC_BRANCH, DEB.VIEW_DORMANT, DEB.CUSTOMER
WHERE trim(ASV_ODC_BRANCH.CODE) = decode(SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 3, 1) || SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 7, 1), ’29’,
SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 4, 3), SUBSTR(DEB.VIEW_DORMANT.ACCTNO, 3, 3)) AND
DEB.VIEW_DORMANT.ACCTNO = DEB.CUSTOMER.CUSTOMERNO AND (DEB.VIEW_DORMANT.ACCTNO = :Xacct)
ORDER BY ASV_ODC_BRANCH.CODE, DORMANT_ACCT;
Explained.
SQL> /
PLAN_TABLE_OUTPUT
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
| 0 | SELECT STATEMENT | | 5 | 5930 | 19 (11)|
| 1 | SORT ORDER BY | | 5 | 5930 | 19 (11)|
| 2 | HASH JOIN | | 5 | 5930 | 18 (6)|
| 3 | MERGE JOIN CARTESIAN | | 2 | 2220 | 12 (0)|
| 4 | NESTED LOOPS | | 1 | 1085 | 9 (0)|
| 5 | **TABLE ACCESS BY INDEX ROWID**| ACCOUNT | 1 | 57 | 7 (0)|
| 6 | INDEX SKIP SCAN | UXACCT2 | 1 | | 6 (0)|
| 7 | TABLE ACCESS BY INDEX ROWID| EXTENSION1 | 1 | 1028 | 2 (0)|
| 8 | INDEX UNIQUE SCAN | PKEXT10 | 1 | | 1 (0)|
| 9 | BUFFER SORT | | 1 | 25 | 10 (0)|
| 10 | TABLE ACCESS BY INDEX ROWID| CUSTOMER | 1 | 25 | 3 (0)|
| 11 | INDEX RANGE SCAN | UXCUST1 | 1 | | 2 (0)|
| 12 | TABLE ACCESS FULL | ASV_ODC_BRANCH | 336 | 25536 | 5 (0)|
The difference is in
EXPLAIN PLAN PROD
| 5 | **TABLE ACCESS FULL** | ACCOUNT | 1 | 56 | 3590 (2)|
EXPLAIN PLAN UAT
| 5 | **TABLE ACCESS BY INDEX ROWID**| ACCOUNT | 1 | 57 | 7 (0)|
| 6 | INDEX SKIP SCAN | UXACCT2 | 1 | | 6 (0)|
How does it work?
Rather than restricting the search path using a predicate from the statement, Skip Scans are initiated by probing the index for distinct values of the prefix column. Each of these distinct values is then used as a starting point for a regular index search. The result is several separate searches of a single index that, when combined, eliminate the affect of the prefix column. Essentially, the index has been searched from the second level down.
The optimizer uses statistics to decide if a skip scan would be more efficient than a full table scan.
Optimizer considers his as an advantage over a FTS because
It reduces the number of indexes needed to support a range of queries. This increases performance by reducing index maintenance and decreases wasted space associated with multiple indexes.
The prefix column should be the most discriminating and the most widely used in queries. These two conditions do not always go hand in hand which makes the decision difficult. In these situations skip scanning reduces the impact of makeing the "wrong" decision.
You can consider the following
Check the optimizer mode across the environments.
Gather stats on all the tables used in the query. For example, if a table has not been analyzed since it was created, and if it has less than DB_FILE_MULTIBLOCK_READ_COUNT blocks under the high water mark, then the optimizer thinks that the table is small and uses a full table scan. Review the LAST_ANALYZED and BLOCKS columns in the ALL_TABLES table to examine the statistics.
Though your environment is similar and code is same, optimizer is going to check on the fly and choose the best available method. So do your UAT with same data setup. Since it is a UAT (almost a preproduction in most o the companies), it should be the closest to production in terms of size.
From what I can see, table DEB.VIEW_DORMANT is a view on tables ACCOUNT and EXTENSION1, and you'd like to use index UXACCT2 from the former. I guess a hint inside this request should allow you to do what you want, something like:
SELECT /*+ INDEX(D UXACCT2) */ ASV_ODC_BRANCH.CODE,
...
FROM ASV_ODC_BRANCH, DEB.VIEW_DORMANT D, DEB.CUSTOMER
...
PS: if this is a query you manage (not generated by any high-level software), I suggest you use aliases for your table as I did, that makes queries so much more readable...
Can I help you to make it a INDEX-ACCESS instead of a FTS? Probably...
Do you really want that? - Probably NOT
Why is the database behaving differently? Because you operate on different data - As your sample shows, the query returns a different number of rows, so I don't even have to ask if your production and test-data is the same. If you have different data (different amounts or different values in indexed columns) you Database-Stats will be different, your indexes will look different and so the optimizer will come to a different query-plan!
What should you do? Make sure all you indexes are up to date, your partitions are sanely arranged, all you database-stats are up-to-date and no strange tuning-settings are in place ( query-plans, environment settings...) Then the optimizer will in most cases find the best plan - and in many cases a full-table-scan is the FASTER alternative.
But if you measure the times and the optimizer clearly takes the wrong path, although it has accurate table-stats, you should file a bug-report with oracle.
If you have other reasons for wanting the optimizer to do indexed-access:
If you cannot enter an Optimizer-HINT like Emmanuel offered, you can try Profiles or Baselines, which offer nice tuning possibilities. You can write your own statement, with different WHERE-Predicates until you get a plan with index-access and use this as an SQL-Profile and link this profile to the original statement which will then use the same query-plan.

Resources