spool space issue due to inequality condition - performance

I am facing a spool space issue for one of my query.Below is the query:
SEL * from (
SEL A.ICCUSNO,
B.ACACCNO,
D.PARTY_ID,
E.PARTY_IDENTIFICATION_NUM,
E.PARTY_IDENTIFICATION_TYPE_CD,
A.ICIDTY,
A.ICIDNO AS ICIDNO,
A.ICEXPD,
ROW_NUMBER() OVER(PARTITION BY ICCUSNO ORDER BY ICEXPD DESC ) AS ICEFFD2
FROM GE_SANDBX.GE_CMCUST A
INNER JOIN GE_SANDBX.GE_CMACCT B
ON A.ICCUSNO=B.ACCUSNO
INNER JOIN GE_VEW.ACCT C
ON B.ACACCNO=C.ACCT_NUM
AND C.DATA_SOURCE_TYPE_CD='ILL'
INNER JOIN GE_VEW.PARTY_ACCT_HIST D
ON C.ACCT_ID=D.ACCT_ID
LEFT OUTER JOIN GE_VEW.GE_PI E
ON D.PARTY_ID=E.PARTY_ID
AND A.ICIDTY=E.PARTY_IDENTIFICATION_TYPE_CD
AND E.DSTC NOT IN( 'SCRM', 'BCRM')
--WHERE B.ACACCNO='0657007129'
--WHERE A.ICIDNO<>E.PARTY_IDENTIFICATION_NUM
QUALIFY ICEFFD2=1) T
where t.PARTY_IDENTIFICATION_NUM<>t.ICIDNO;
I am trying to pick one record based on expiry date ICEXPD. My inner query gives me one record per customer no ICCUSNO as below:
I
CCUSNO ACACCNO PARTY_ID Party_Identification_Num Party_Identification_Type_Cd ICIDNO ICEXPD ICEFFD2
100000013 500010207 5,862,640 1-0121-2073-7 S 1-0212-2073-4 9/20/2007 1
But i have update the table only when the PARTY_IDENTIFICATION_NUM doesn't match with the ICIDNO.
Below is the explain plan:
1) First, we lock GE_SANDBX.A for access, we lock
GE_SANDBX.B for access, we lock DP_TAB.PARTY_ACCT_HIST for
access, we lock DP_TAB.GE_PI for access, and we
lock DP_TAB.ACCT for access.
2) Next, we do an all-AMPs RETRIEVE step from DP_TAB.ACCT by way of
an all-rows scan with a condition of (
"DP_TAB.ACCT.DATA_SOURCE_TYPE_CD = 'ILL '") into Spool 3
(all_amps), which is built locally on the AMPs. The size of Spool
3 is estimated with no confidence to be 9,834,342 rows (
344,201,970 bytes). The estimated time for this step is 2.18
seconds.
3) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of a
RowHash match scan, which is joined to DP_TAB.PARTY_ACCT_HIST by
way of a RowHash match scan with no residual conditions. Spool 3
and DP_TAB.PARTY_ACCT_HIST are joined using a merge join, with a
join condition of ("Acct_Id = DP_TAB.PARTY_ACCT_HIST.ACCT_ID").
The result goes into Spool 4 (all_amps), which is redistributed by
the hash code of (DP_TAB.ACCT.Acct_Num) to all AMPs. Then we do a
SORT to order Spool 4 by row hash. The size of Spool 4 is
estimated with no confidence to be 13,915,265 rows (487,034,275
bytes). The estimated time for this step is 0.98 seconds.
4) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step from GE_SANDBX.B by way of a
RowHash match scan with no residual conditions, which is
joined to Spool 4 (Last Use) by way of a RowHash match scan.
GE_SANDBX.B and Spool 4 are joined using a merge join,
with a join condition of ("GE_SANDBX.B.ACACCNO =
Acct_Num"). The result goes into Spool 5 (all_amps) fanned
out into 18 hash join partitions, which is redistributed by
the hash code of (GE_SANDBX.B.ACCUSNO) to all AMPs. The
size of Spool 5 is estimated with no confidence to be
13,915,265 rows (2,657,815,615 bytes). The estimated time
for this step is 1.33 seconds.
2) We do an all-AMPs RETRIEVE step from GE_SANDBX.A by way
of an all-rows scan with no residual conditions into Spool 6
(all_amps) fanned out into 18 hash join partitions, which is
redistributed by the hash code of (GE_SANDBX.A.ICCUSNO)
to all AMPs. The size of Spool 6 is estimated with high
confidence to be 12,169,929 rows (5,427,788,334 bytes). The
estimated time for this step is 52.24 seconds.
3) We do an all-AMPs RETRIEVE step from
DP_TAB.GE_PI by way of an all-rows scan with a
condition of (
"(DP_TAB.GE_PI.DSTC <> 'BSCRM')
AND (DP_TAB.GE_PI.DATA_SOURCE_TYPE_CD <>
'SCRM')") into Spool 7 (all_amps), which is built locally on
the AMPs. The size of Spool 7 is estimated with low
confidence to be 161,829 rows (19,419,480 bytes). The
estimated time for this step is 1.97 seconds.
5) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of an
all-rows scan, which is joined to Spool 6 (Last Use) by way of an
all-rows scan. Spool 5 and Spool 6 are joined using a hash join
of 18 partitions, with a join condition of ("ICCUSNO = ACCUSNO").
The result goes into Spool 8 (all_amps), which is redistributed by
the hash code of (DP_TAB.PARTY_ACCT_HIST.PARTY_ID,
TRANSLATE((GE_SANDBX.A.ICIDTY )USING
LATIN_TO_UNICODE)(VARCHAR(255), CHARACTER SET UNICODE, NOT
CASESPECIFIC)) to all AMPs. The size of Spool 8 is estimated with
no confidence to be 15,972,616 rows (8,593,267,408 bytes). The
estimated time for this step is 4.37 seconds.
6) We do an all-AMPs JOIN step from Spool 7 (Last Use) by way of an
all-rows scan, which is joined to Spool 8 (Last Use) by way of an
all-rows scan. Spool 7 and Spool 8 are right outer joined using a
single partition hash join, with condition(s) used for
non-matching on right table ("NOT (ICIDTY IS NULL)"), with a join
condition of ("(PARTY_ID = Party_Id) AND ((TRANSLATE((ICIDTY
)USING LATIN_TO_UNICODE))= Party_Identification_Type_Cd)"). The
result goes into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with no confidence to be
16,053,773 rows (10,306,522,266 bytes). The estimated time for
this step is 2.11 seconds.
7) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 13 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 11 (all_amps), which is built locally on the AMPs. The
size is estimated with no confidence to be 16,053,773 rows (
18,558,161,588 bytes).
8) We do an all-AMPs RETRIEVE step from Spool 11 (Last Use) by way of
an all-rows scan with a condition of ("(Party_Identification_Num
<> ICIDNO) AND (Field_10 = 1)") into Spool 16 (group_amps), which
is built locally on the AMPs. The size of Spool 16 is estimated
with no confidence to be 10,488,598 rows (10,404,689,216 bytes).
The estimated time for this step is 2.20 seconds.
9) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 16 are sent back to the user as the result
of statement 1.
All the required stats are collected.
Thanks for your help.

Related

How to Improve Cross Join Performance in Hive TEZ?

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?
Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.
Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

simple random sampling while pulling data from warehouse(oracle engine) using proc sql in sas

I need to pull humongous amount of data, say 600-700 variables from different tables in a data warehouse...now the dataset in its raw form will easily touch 150 gigs - 79 MM rows and for my analysis purpose I need only a million rows...how can I pull data using proc sql directly from warehouse by doing simple random sampling on the rows.
Below code wont work as ranuni is not supported by oracle
proc sql outobs =1000000;
select * from connection to oracle(
select * from tbl1 order by ranuni(12345);
quit;
How do you propose I do it
Use the DBMS_RANDOM Package to Sort Records and Then Use A Row Limiting Clause to Restrict to the Desired Sample Size
The dbms_random.value function obtains a random number between 0 and 1 for all rows in the table and we sort in ascending order of the random value.
Here is how to produce the sample set you identified:
SELECT
*
FROM
(
SELECT
*
FROM
tbl1
ORDER BY dbms_random.value
)
FETCH FIRST 1000000 ROWS ONLY;
To demonstrate with the sample schema table, emp, we sample 4 records:
SCOTT#DEV> SELECT
2 empno,
3 rnd_val
4 FROM
5 (
6 SELECT
7 empno,
8 dbms_random.value rnd_val
9 FROM
10 emp
11 ORDER BY rnd_val
12 )
13 FETCH FIRST 4 ROWS ONLY;
EMPNO RND_VAL
7698 0.06857749035643605682648168347885993709
7934 0.07529612360785920635181751566833986766
7902 0.13618520865865754766175030040204331697
7654 0.14056380246495282237607922497308953768
SCOTT#DEV> SELECT
2 empno,
3 rnd_val
4 FROM
5 (
6 SELECT
7 empno,
8 dbms_random.value rnd_val
9 FROM
10 emp
11 ORDER BY rnd_val
12 )
13 FETCH FIRST 4 ROWS ONLY;
EMPNO RND_VAL
7839 0.00430658806761508024693197916281775492
7499 0.02188116061148367312927392115186317884
7782 0.10606515700372416131060633064729870016
7788 0.27865276349549877512032787966777990909
With the example above, notice that the empno changes significantly during the execution of the SQL*Plus command.
The performance might be an issue with the row counts you are describing.
EDIT:
With table sizes in the order of 150 gigs - 79 MM, any sorting would be painful.
If the table had a surrogate key based on a sequence incremented by 1, we could take the approach of selecting every nth record based on the key.
e.g.
--scenario n = 3000
FROM
tbl1
WHERE
mod(table_id, 3000) = 0;
This approach would not use an index (unless a function based index is created), but at least we are not performing a sort on a data set of this size.
I performed an explain plan with a table that has close to 80 million records and it does perform a full table scan (the condition forces this without a function based index) but this looks tenable.
None of the answers posted or comments helped my cause, it could but we have 87 MM rows
Now I wanted the answer with the help of sas: here is what I did: and it works. Thanks all!
libname dwh path username pwd;
proc sql;
create table sample as
(select
<all the variables>, ranuni(any arbitrary seed)
from dwh.<all the tables>
<bunch of where conditions goes here>);
quit);

Split amount into multiple rows if amount>=$10M or <=$-10B

I have a table in oracle database which may contain amounts >=$10M or <=$-10B.
99999999.99 chunks and also include remainder.
If the value is less than or equal to $-10B, I need to break into one or more 999999999.99 chunks and also include remainder.
Your question is somewhat unreadable, but unless you did not provide examples here is something for start, which may help you or someone with similar problem.
Let's say you have this data and you want to divide amounts into chunks not greater than 999:
id amount
-- ------
1 1500
2 800
3 2500
This query:
select id, amount,
case when level=floor(amount/999)+1 then mod(amount, 999) else 999 end chunk
from data
connect by level<=floor(amount/999)+1
and prior id = id and prior dbms_random.value is not null
...divides amounts, last row contains remainder. Output is:
ID AMOUNT CHUNK
------ ---------- ----------
1 1500 999
1 1500 501
2 800 800
3 2500 999
3 2500 999
3 2500 502
SQLFiddle demo
Edit: full query according to additional explanations:
select id, amount,
case
when amount>=0 and level=floor(amount/9999999.99)+1 then mod(amount, 9999999.99)
when amount>=0 then 9999999.99
when level=floor(-amount/999999999.99)+1 then -mod(-amount, 999999999.99)
else -999999999.99
end chunk
from data
connect by ((amount>=0 and level<=floor(amount/9999999.99)+1)
or (amount<0 and level<=floor(-amount/999999999.99)+1))
and prior id = id and prior dbms_random.value is not null
SQLFiddle
Please adjust numbers for positive and negative borders (9999999.99 and 999999999.99) according to your needs.
There are more possible solutions (recursive CTE query, PLSQL procedure, maybe others), this hierarchical query is one of them.

Sql stored prcedure take's more time to execute whn records are getting increased is there any way to optimize it

I have 6,00,000 records and i want to fetch 10 records from them as i want to display only 10 records in the grid my stored procedure is working properly when i m fetching records between 1-10000 E.G (500-510) after that the execution time is increased when the row number is increased E.G if i fetch record b/w 1,00,000-1,00,010 it takes more time to execute
can any one please help me i have used ROW_NUMBER() to get the number row number and used between to retrieve data.
please give a optimized way to get records
The stored procedure creats a sql query as given below
I have 6,00,000 records and i want to fetch 10 records from them as i want to display only 10 records in the grid my stored procedure is working properly when i m fetching records between 1-10000 E.G (500-510) after that the execution time is increased when the row number is increased E.G if i fetch record b/w 1,00,000-1,00,010 it takes more time to execute
can any one please help me i have used ROW_NUMBER() to get the number row number and used between to retrieve data.
please give a optimized way to get records
The stored procedure create a sql query as given below
SELECT FuelClaimId from
( SELECT fc.FuelClaimId,ROW_NUMBER() OVER ( order by fc.FuelClaimId ) AS RowNum
from FuelClaims fc
INNER JOIN Vehicles v on fc.VehicleId =v.VehicleId
INNER JOIN Drivers d on d.DriverId =v.OfficialID
INNER JOIN Departments de on de.DepartmentId =d.DepartmentId
INNER JOIN Provinces p on de.ProvinceId =p.ProvinceId
INNER JOIN FuelRates f on f.FuelRateId =fc.FuelRateId
INNER JOIN FuelClaimStatuses fs on fs.FuelClaimStatusId= fc.statusid
INNER JOIN LogsheetMonths l on l.LogsheetMonthId =f.LogsheetMonthId
Where fc.IsDeleted = 0) AS MyDerivedTable WHERE MyDerivedTable.RowNum BETWEEN
600000 And 600010
Try this instead:
SELECT TOP 10 fc.FuelClaimId
FROM FuelClaims fc
INNER JOIN Vehicles v ON fc.VehicleId = v.VehicleId
INNER JOIN Drivers d ON d.DriverId = v.OfficialID
INNER JOIN Departments de ON de.DepartmentId = d.DepartmentId
INNER JOIN Provinces p ON de.ProvinceId = p.ProvinceId
INNER JOIN FuelRates f ON f.FuelRateId = fc.FuelRateId
INNER JOIN FuelClaimStatuses fs ON fs.FuelClaimStatusId = fc.statusid
INNER JOIN LogsheetMonths l ON l.LogsheetMonthId = f.LogsheetMonthId
WHERE fc.IsDeleted = 0 AND fc.FuelClaimId BETWEEN 600001 AND 600010
ORDER BY fc.FuelClaimId
Also BETWEEN is inclusive so BETWEEN 10 and 20 actually returns 10,11,12,13,14,15,16,17,18,19 and 20 so 11 rows not 10. As identity values usually start at 1 you really want BETWEEN 11 AND 20 (hence 600001 in the above)
The above query should fix your issue where your performance degrades as you query the larger range of items.
While it won't always return 10 records the fix for that is:
WHERE fc.IsDeleted = 0 AND fc.FuelClaimId > #LastMaxFuelClaimId
Where #LastMaxFuelClaimId is the previous MAX FuelClaimId you had returned from the previous query execution.
Edit: The reason why it keeps getting slower is because it has to read more and more of the table to read the next chunk, it doesn't skip reading the first 600,000 records it reads them all and then only returns the next 10 hence each time you query it reads all the previous records all over again, the above does not suffer from the same problem.
You should post an execution plan but a probable cause of performance problems would be inadequate or lack of indexing.
Make sure you have
an index on all your foreign key relations
a covering index on the fields you retrieve and select from
Covering Index
CREATE INDEX IX_FUELCLAIMS_FUELCLAIMID_ISDELETED
ON dbo.FuelClaims (FuelClaimId, VehicleID, IsDeleted)

Oracle disk space used by a table

I have a table in an Oracle db that gets a couple of million new rows every month. Each row has a column which states the date when it was created.
I'd like to run a query that gets the disk space growth over the last 6 months. In other words, the result would be a table with two columns where each row would have the month's name and disk space used during that month.
Thanks,
This article reports a method of getting the table growth: http://www.dba-oracle.com/t_table_growth_reports.htm
column "Percent of Total Disk Usage" justify right format 999.99
column "Space Used (MB)" justify right format 9,999,999.99
column "Total Object Size (MB)" justify right format 9,999,999.99
set linesize 150
set pages 80
set feedback off
select * from (select to_char(end_interval_time, 'MM/DD/YY') mydate, sum(space_used_delta) / 1024 / 1024 "Space used (MB)", avg(c.bytes) / 1024 / 1024 "Total Object Size (MB)",
round(sum(space_used_delta) / sum(c.bytes) * 100, 2) "Percent of Total Disk Usage"
from
dba_hist_snapshot sn,
dba_hist_seg_stat a,
dba_objects b,
dba_segments c
where begin_interval_time > trunc(sysdate) - &days_back
and sn.snap_id = a.snap_id
and b.object_id = a.obj#
and b.owner = c.owner
and b.object_name = c.segment_name
and c.segment_name = '&segment_name'
group by to_char(end_interval_time, 'MM/YY'))
order by to_date(mydate, 'MM/YY');
DBA_TABLES (or the equivalent) gives an AVG_ROW_LEN, so you could simply multiply that by the number of rows created per month.
The caveats to that are, it assumes that the row length of new rows is similar to that of existing rows. If you've got a bunch of historical data that were 'small' (eg 50 bytes) but new rows are larger (150 bytes), then the estimates will be too low.
Also, how do updates figure into things ? If a row starts at 50 bytes and grows to 150 two months later, how do you account for those 100 bytes ?
Finally, tables don't grow for each row insert. Every so often the allocated space will fill up and it will go and allocate another chunk. Depending on the table settings, that next chunk may be, for example, 50% of the existing table size. So you might not physically grow for three months and then have a massive jump, then not grow for another six months.

Resources