Best way to process 1400 million records in SQL

Best way to process 1400 million records in SQL - performance

I humbly request you to help me with the below scenario, In my project I have to process time series data from a table. We are using Azure SQL Server.
The table dbo.batch_events has 1400 million rows, please see the below screenshot for the table structure and sample data:
Source table structure and sample data
I have to pivot the equipment name from the equipment_name column in the dbo.batch_events table into column names and load the pivoted value into dbo.time_series table .
If I pivot the equipment name column, I am getting 685 columns to be created in the dbo.time_series table.
Please see the below screenshot showing the target table (dbo.time_series) structure and expected output for the sample data shown in the source table.
Target table structure and sample data
Please kindly advise what is the best approach and method to write the query in SQL.
The query I wrote is taking 25 hours to process the 1400 million records and load them into the target table.
I have created day wise partition on time_stamp column for the source table (dbo.batch_events) and created two nonclustered indexes - one on equipment name and the other on time stamp column.
I humbly request you to advise me on the best approach to write the query for this scenario.
The stored procedure I have created process one month of data at a time; for one month, we have around 120 million rows to deal with.
Used While loop to put entry for start_date and end_date for each month into dbo.Iteration_ctrl table, by getting the minimum and maximum date from the dbo.Batch_events table. So, I have 12 entries in this table, each entry representing one month.
Used while loop to iterate through the 12 start_date and end_date entries in the dbo.Iteration_ctrl table, and used pivot query inside the while loop to load the data into dbo.Time_series table.
Please see the stored procedure I have written, which is taking 25 hours (which seems to me inefficient). Any help would be appreciated please.
DECLARE #MIN_TIME DATETIME, #MIN_TIMESTAMP DATETIME;
DECLARE #MAX_TIME DATETIME, #MAX_TIMESTAMP DATETIME;
DECLARE #DATE DATETIME, #ROWCOUNT INT, #TOTALCOUNT INT;
SELECT #MIN_TIME = MIN(Time_stamp)
FROM [dbo].[BATCH_EVENTS]
SELECT #MAX_TIME = MAX(Time_stamp)
FROM [dbo].[BATCH_EVENTS]
PRINT 'INSERT INTO TABLE [dbo].[ITERATION_CTRL] HAS STARTED ' + CAST(GETDATE() AS nvarchar(30))
WHILE #MIN_TIME < #MAX_TIME
BEGIN
SELECT #DATE = DATEADD(MM, 01, #MIN_TIME)
SELECT #DATE = CASE WHEN #DATE > #MAX_TIME THEN #MAX_TIME ELSE #DATE END
INSERT INTO dbo.ITERATION_CTRL
SELECT #MIN_TIME, #DATE
PRINT 'INSERTION INTO TABLE [dbo].[ITERATION_CTRL] HAS ENDED FOR'+ CAST(#MIN_TIME AS nvarchar(30)) + ' -' + CAST( #DATE AS nvarchar(30)) +' NO OF ROWS INSERTED :'+ CAST( ##ROWCOUNT AS nvarchar(30)) +' ' + CAST(GETDATE() AS nvarchar(30))
SELECT #MIN_TIME = DATEADD(SS, 01, #DATE)
END
PRINT 'INSERT INTO TABLE [dbo].[Time_series_data] HAS STARTED ' + CAST(GETDATE() AS nvarchar(30))
SELECT #TOTALCOUNT = COUNT(*) FROM dbo.ITERATION_CTRL
SELECT #ROWCOUNT = 1
WHILE #ROWCOUNT <= #TOTALCOUNT
BEGIN
SELECT
#MIN_TIMESTAMP = MIN_DATE,
#MAX_TIMESTAMP = MAX_DATE
FROM dbo.ITERATION_CTRL
WHERE ID = #ROWCOUNT
BEGIN TRANSACTION
INSERT INTO dbo.Time_series_data
SELECT *
FROM
(SELECT
[Event_name], [Time_Stamp],
[Start_time], [End_time], [Duration],
[Value] AS [Sensor_Value],
Equipment_name
FROM
[dbo].[BATCH_EVENTS] BE
WHERE
Time_stamp >= [Start_time] AND Time_stamp <= [End_time]
AND Time_stamp BETWEEN #MIN_TIMESTAMP AND #MAX_TIMESTAMP) t
PIVOT
(MAX([Sensor_Value])
FOR Equipment_Name IN ([MY1102], [MY1138], [MY1180],
[MY1164], [MY1176], [MY204],
[MY324], [MY64B6])
ORDER BY
[Time_Stamp], [Event_name]
COMMIT TRANSACTION
SELECT #ROWCOUNT = #ROWCOUNT + 1
--PRINT #MIN_TIMESTAMP, #MAX_TIMESTAMP
PRINT 'INSERTION INTO TABLE [Time_series_data] HAS ENDED NO OF ROWS INSERTED :'+ CAST( ##ROWCOUNT AS nvarchar(30)) +' For duration ' + CAST( #MIN_TIMESTAMP AS nvarchar(30))+ ' '+ CAST( #MAX_TIMESTAMP AS nvarchar(30))+' Time '+ CAST(GETDATE() AS nvarchar(30));
END
END

Related

Transferring data to a test table

There is a table contact_history with 1.244.000.000 number of data (from 04.03.22-05.06.2022) and with fields contact_dt and contact_dttm. I tried to transfer all the data to test using contact_dt with script:
**DECLARE
dat date;
begin
dat:= TO_DATE('04.03.2022', 'dd.mm.yyyy');
while dat<= TO_DATE('05.06.2022', 'dd.mm.yyyy') loop
INSERT /*+ append enable_parallel_dml parallel(16)*/
INTO CONTACT_HISTORY_TEST ct
SELECT -- + parallel(16)
ch.sas_contact_id,
ch.contact_source,
ch.client_id,
ch.contact_dttm,
ch.contact_dt,
ch.sas_contact_error_desc,
ch.sas_contact_status
FROM CONTACT_HISTORY ch
WHERE ch.contact_dt = dat;
commit;
dat:= dat+1;
end loop;
end;**
There is such a problem that when SELECT COUNT(*) FROM CONTACT_HISTORY_TEST shows only 1.200.000.000 data in the test table, when in general table 1.244.000.000.
And there is such a moment that when checking
SELECT COUNT(*)
FROM CONTACT_HISTORY
WHERE CONTACT_DT>= TO_DATE('04.03.2021', 'dd.mm.yyyy')
AND CONTACT_DT<= TO_DATE('05.06.2022', 'dd.mm.yyyy');
SELECT COUNT(*)
FROM CONTACT_HISTORY_TEST
WHERE CONTACT_DT>= TO_DATE('04.03.2021', 'dd.mm.yyyy')
AND CONTACT_DT<= TO_DATE('05.06.2022', 'dd.mm.yyyy')
In both tables, there are 1.200.000.000 data, please tell me where the remaining 44 million data have gone and how can I completely transfer the data from the table or how to do it right?

I presume that contact_dt column contains date values that have time component; for example, it isn't just 04.03.2021, but 04.03.2021 13:23:45.
Code you posted handles "start" of the period correctly as 04.03.2021 actually represents 04.03.2021 00:00:00.
However, the last day of that period isn't handled correctly - you're missing (almost) the whole last day because you copied only rows whose contact_dt is equal to 05.06.2022 00:00:00. What about eg. 05.06.2022 08:32:13?
Therefore, modify something. If contact_dt column is indexed, you shouldn't truncate it, so the simplest option is to change this
while dat <= TO_DATE('05.06.2022', 'dd.mm.yyyy') loop
to
while dat < TO_DATE('06.06.2022', 'dd.mm.yyyy') loop
As #APC commented, where clause should then also be fixed to
where ch.contact_dt >= dat and ch.contact_dt < dat + 1
To verify number of rows and date values, run the following code in both schemas and then post the result (edit the question, not as a comment):
alter session set nls_date_format = 'dd.mm.yyyy hh24:mi:ss';
select min(contact_dt) min_dat, max(contact_dt) max_dat, count(*) cnt
from contact_history;

same query in procedure take different time

i have a same code in procedure with only 1 different (date variable).
in my code i have 27 p_dt variable. Code below is not full.
when i run procedure with p_dt it cost more than 10hours but when i write to_date('01.01.2020','dd.mm.yyyy') instead p_dt it cost 300 sec
create or replace ru.t_maha(p_dt in date default trunc(sysdate) -1) as
begin
delete from t_maha_1
where dt = to_date(add_months(trunc(p_dt,'MONTH'),-1),'dd.mm.yyyy');
commit;
insert into t_maha_1
with scheta_Snt as (
select
inn,
add_months(trunc(p_dt,'MONTH'),-1) || last_Day(add_months(trunc(p_dt,'MONTH'),-1)) interval,
sum(case when t.dt_open between add_months(trunc(p_dt,'MONTH'),-1) and last_Day(add_months(trunc(p_dt,'MONTH'),-1)) then value_nat end ) scheta_snt
from fct_Carry t
)
select * from scheta_snt
join scheta_pop (same subquery but for another calculate)
join dep_snt (same subquery but for another calculate)
join dep_pop (same subquery but for another calculate)

Fastest way of doing field comparisons in the same table with large amounts of data in oracle

I am recieving information from a csv file from one department to compare with the same inforation in a different department to check for discrepencies (About 3/4 of a million rows of data with 44 columns in each row). After I have the data in a table, I have a program that will take the data and send reports based on a HQ. I feel like the way I am going about this is not the most efficient. I am using oracle for this comparison.
Here is what I have:
I have a vb.net program that parses the data and inserts it into an extract table
I run a procedure to do a full outer join on the two tables into a new table with the fields in one department prefixed with '_c'
I run another procedure to compare the old/new data and update 2 different tables with detail and summary information. Here is code from inside the procedure:
DECLARE
CURSOR Cur_Comp IS SELECT * FROM T.AEC_CIS_COMP;
BEGIN
FOR compRow in Cur_Comp LOOP
--If service pipe exists in CIS but not in FM and the service pipe has status of retired in CIS, ignore the variance
If(compRow.pipe_num = '' AND cis_status_c = 'R')
continue
END IF
--If there is not a summary record for this HQ in the table for this run, create one
INSERT INTO t.AEC_CIS_SUM (HQ, RUN_DATE)
SELECT compRow.HQ, to_date(sysdate, 'DD/MM/YYYY') from dual WHERE NOT EXISTS
(SELECT null FROM t.AEC_CIS_SUM WHERE HQ = compRow.HQ AND RUN_DATE = to_date(sysdate, 'DD/MM/YYYY'))
-- Check fields and update the tables accordingly
If (compRow.cis_loop <> compRow.cis_loop_c) Then
--Insert information into the details table
INSERT INTO T.AEC_CIS_DET( Fac_id, Pipe_Num, Hq, Address, AutoUpdatedFl,
DateTime, Changed_Field, CIS_Value, FM_Value)
VALUES(compRow.Fac_ID, compRow.Pipe_Num, compRow.Hq, compRow.Street_Num || ' ' || compRow.Street_Name,
'Y', sysdate, 'Cis_Loop', compRow.cis_loop, compRow.cis_loop_c);
-- Update information into the summary table
UPDATE AEC_CIS_SUM
SET cis_loop = cis_loop + 1
WHERE Hq = compRow.Hq
AND Run_Date = to_date(sysdate, 'DD/MM/YYYY')
End If;
END LOOP;
END;
Any suggestions of an easier way of doing this rather than an if statement for all 44 columns of the table? (This is run once a week if it matters)
Update: Just to clarify, there are 88 columns of data (44 of duplicates to compare with one suffixed with _c). One table lists each field in a row that is different so one row can mean 30+ records written in that table. The other table keeps tally of the number of discrepencies for each week.

First of all I believe that your task can be implemented (and should be actually) with staight SQL. No fancy cursors, no loops, just selects, inserts and updates. I would start with unpivotting your source data (it is not clear if you have primary key to join two sets, I guess you do):
Col0_PK Col1 Col2 Col3 Col4
----------------------------------------
Row1_val A B C D
Row2_val E F G H
Above is your source data. Using UNPIVOT clause we convert it to:
Col0_PK Col_Name Col_Value
------------------------------
Row1_val Col1 A
Row1_val Col2 B
Row1_val Col3 C
Row1_val Col4 D
Row2_val Col1 E
Row2_val Col2 F
Row2_val Col3 G
Row2_val Col4 H
I think you get the idea. Say we have table1 with one set of data and the same structured table2 with the second set of data. It is good idea to use index-organized tables.
Next step is comparing rows to each other and storing difference details. Something like:
insert into diff_details(some_service_info_columns_here)
select some_service_info_columns_here_along_with_data_difference
from table1 t1 inner join table2 t2
on t1.Col0_PK = t2.Col0_PK
and t1.Col_name = t2.Col_name
and nvl(t1.Col_value, 'Dummy1') <> nvl(t2.Col_value, 'Dummy2');
And on the last step we update difference summary table:
insert into diff_summary(summary_columns_here)
select diff_row_id, count(*) as diff_count
from diff_details
group by diff_row_id;
It's just rough draft to show my approach, I'm sure there is much more details should be taken into account. To summarize I suggest two things:
UNPIVOT data
Use SQL statements instead of cursors

You have several issues in your code:
If(compRow.pipe_num = '' AND cis_status_c = 'R')
continue
END IF
"cis_status_c" is not declared. Is it a variable or a column in AEC_CIS_COMP?
In case it is a column, just put the condition into the cursor, i.e. SELECT * FROM T.AEC_CIS_COMP WHERE not (compRow.pipe_num = '' AND cis_status_c = 'R')
to_date(sysdate, 'DD/MM/YYYY')
That's nonsense, you convert a date into a date, simply use TRUNC(SYSDATE)
Anyway, I think you can use three single statements instead of a cursor:
INSERT INTO t.AEC_CIS_SUM (HQ, RUN_DATE)
SELECT comp.HQ, trunc(sysdate)
from AEC_CIS_COMP comp
WHERE NOT EXISTS
(SELECT null FROM t.AEC_CIS_SUM WHERE HQ = comp.HQ AND RUN_DATE = trunc(sysdate));
INSERT INTO T.AEC_CIS_DET( Fac_id, Pipe_Num, Hq, Address, AutoUpdatedFl, DateTime, Changed_Field, CIS_Value, FM_Value)
select comp.Fac_ID, comp.Pipe_Num, comp.Hq, comp.Street_Num || ' ' || comp.Street_Name, 'Y', sysdate, 'Cis_Loop', comp.cis_loop, comp.cis_loop_c
from T.AEC_CIS_COMP comp
where comp.cis_loop <> comp.cis_loop_c;
UPDATE AEC_CIS_SUM
SET cis_loop = cis_loop + 1
WHERE Hq IN (Select Hq from T.AEC_CIS_COMP)
AND trunc(Run_Date) = trunc(sysdate);
They are not tested but they should give you a hint how to do it.

"BETWEEN" SQL Keyword for Oracle Dates -- Getting an error in Oracle

I have dates in this format in my database "01-APR-12" and the column is a DATE type.
My SQL statement looks like this:
SELECT DISTINCT c.customerno, c.lname, c.fname
FROM customer c, sales s
WHERE c.customerno = s.customerno AND s.salestype = 1
AND (s.salesdate BETWEEN '01-APR-12' AND '31-APR-12');
When I try to do it that way, I get this error -- ORA-01839: date not valid for month specified.
Can I even use the BETWEEN keyword with how the date is setup in the database?
If not, is there another way I can get the output of data that is in that date range without having to fix the data in the database?
Thanks!

April has 30 days not 31.
Change
SELECT DISTINCT c.customerno, c.lname, c.fname
FROM customer c, sales s
WHERE c.customerno = s.customerno AND s.salestype = 1
AND (s.salesdate BETWEEN '01-APR-12' AND '31-APR-12');
to
SELECT DISTINCT c.customerno, c.lname, c.fname
FROM customer c, sales s
WHERE c.customerno = s.customerno AND s.salestype = 1
AND (s.salesdate BETWEEN '01-APR-12' AND '30-APR-12');
and you should be good to go.

In case the dates you are checking for range from 1st day of a month to the last day of a month then you may modify the query to avoid the case where you have to explicitly check the LAST day of the month
SELECT DISTINCT c.customerno, c.lname, c.fname
FROM customer c, sales s
WHERE c.customerno = s.customerno
AND s.salestype = 1 AND (s.salesdate BETWEEN '01-APR-12' AND LAST_DAY(TO_DATE('APR-12', 'MON-YY'));
The LAST_DAY function will provide the last day of the month.

The other answers are missing out on something important and will not return the correct results. Dates have date and time components. If your salesdate column is in fact a date that includes time, you will miss out on any sales that happened on April 30 unless they occurred exactly at midnight.
Here's an example:
create table date_temp (temp date);
insert into date_temp values(to_date('01-APR-2014 15:12:00', 'DD-MON-YYYY HH24:MI:SS'));
insert into date_temp values(to_date('30-APR-2014 15:12:00', 'DD-MON-YYYY HH24:MI:SS'));
table DATE_TEMP created.
1 rows inserted.
1 rows inserted.
select * from date_temp where temp between '01-APR-2014' and '30-APR-2014';
Query Result: 01-APR-14
If you want to get all records from April that includes those with time-components in the date fields, you should use the first day of the next month as the second side of the between clause:
select * from date_temp where temp between '01-APR-2014' and '01-MAY-2014';
01-APR-14
30-APR-14

modifying query resultset to include all dates in a range

I have a query which I run on a table TXN_DEC(id, resourceid, usersid, date, eventdesc) which return distinct count of users for a given date-range and resourceid, group by date and eventdesc (each resource can have 4 to 5 eventdesc)
if there is no value of distinct users count on a date in the range, for an eventdesc, then it skips that date row in the resultset.
I need to have all date rows in my resultset or collection such that if there is no value of count for a date,eventdesc combination, then its value is set to 0 but that date still exists in the collection..
How do I go about getting such a collection
I know getting the final dataset entirely from the query result would be too complicated,
but I can use collections in groovy to modify and populate my map/list to get the data in the required format
something similar to following: if
input date range = 5th Feb to 3 March 2011
DataMap = [dateval: '02/05/2011' eventdesc: 'Read' dist_ucnt: 23,
dateval: '02/06/2011' eventdesc: 'Read' dist_ucnt: 23,
dateval: '02/07/2011' eventdesc: 'Read' dist_ucnt: 0, -> this row was not present in query resultset, but row exists in the map with value 0
....and so on till 3 march 2011 and then whole range repeated for each eventdesc
]

If you want all dates (including those with no entries in your TXN_DEC table) for a given range, you could use Oracle to generate your date range and then use an outer join to your existing query. Then you would just need to fill in null values. Something like:
select
d.dateInRange as dateval,
'Read' as eventdesc,
nvl(td.dist_ucnt, 0) as dist_ucnt
from (
select
to_date('02-FEB-2011','dd-mon-yyyy') + rownum - 1 as dateInRange
from all_objects
where rownum <= to_date('03-MAR-2011','dd-mon-yyyy') - to_date('02-FEB-2011','dd-mon-yyyy') + 1
) d
left join (
select
date,
count(distinct usersid) as dist_ucnt
from
txn_dec
where eventDesc = 'Read'
group by date
) td on td.date = d.dateInRange
That's my purely Oracle solution since I'm not a Groovy guy (well, actually, I am a pretty groovy guy...)
EDIT: Here's the same version wrapped in a stored procedure. It should be easy to call if you know the API....
create or replace procedure getDateRange (
p_begin_date IN DATE,
p_end_date IN DATE,
p_event IN txn_dec.eventDesc%TYPE,
p_recordset OUT SYS_REFCURSOR)
AS
BEGIN
OPEN p_recordset FOR
select
d.dateInRange as dateval,
p_event as eventdesc,
nvl(td.dist_ucnt, 0) as dist_ucnt
from (
select
p_begin_date + rownum - 1 as dateInRange
from all_objects
where rownum <= p_end_date - p_begin_date + 1
) d
left join (
select
date,
count(distinct usersid) as dist_ucnt
from
txn_dec
where eventDesc = p_event
group by date
) td on td.date = d.dateInRange;
END getDateRange;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Best way to process 1400 million records in SQL - performance

Related

Transferring data to a test table

same query in procedure take different time

Fastest way of doing field comparisons in the same table with large amounts of data in oracle

"BETWEEN" SQL Keyword for Oracle Dates -- Getting an error in Oracle

modifying query resultset to include all dates in a range

Categories

Resources