Best way to create tables for huge data using oracle - oracle

Functional requirement
We kinda work for devices. Each device, roughly speaking, has its unique identifier, an IP address, and a type.
I have a routine that pings all devices that has an IP address.
This routine is nothing more than a C# console application, which runs every 3 minutes trying to ping the IP address of each device.
The result of the ping I need to store in the database, as well as the date of verification (regardless of the result of the ping).
Then we got into the technical side.
Technical part:
Assuming my ping and bank structuring process is ready from the day 01/06/2016, I need to do two things:
Daily extraction
Extraction in real time (last 24 hours)
Both should return the same thing:
Devices that are unavailable for more than 24 hours.
Devices that are unavailable for more than 7 days.
Understood to be unavailable the device to be pinged AND did not responded.
It is understood by available device to be pinged AND answered successfully.
What I have today and works very badly:
A table with the following structure:
create table history (id_device number, response number, date date);
This table has a large amount of data (now has 60 million, but the trend is always grow exponentially)
** Here are the questions: **
How to achieve these objectives without encountering problems of slowness in queries?
How to create a table structure that is prepared to receive millions / billions of records within my corporate world?

Partition the table based on date.
For partitioning strategy consider performance vs maintanence.
For easy mainanence use automatic INTERVAL partitions by month or week.
You can even do it by day or manually pre-define 2 day intervals.
You query only needs 2 calendar days.
select id_device,
min(case when response is null then 'N' else 'Y' end),
max(case when response is not null then date end)
from history
where date > sysdate - 1
group by id_device
having min(case when response is null then 'N' else 'Y' end) = 'N'
and sysdate - max(case when response is not null then date end) > ?;
If for missing responses you write a default value instead of NULL, you may try building it as an index-organized table.
You need to read about Oracle partitioning.
This statement will create your HISTORY table partitioned by calendar day.
create table history (id_device number, response number, date date)
PARTITION BY RANGE (date)
INTERVAL(NUMTOYMINTERVAL(1, 'DAY'))
( PARTITION p0 VALUES LESS THAN (TO_DATE('5-24-2016', 'DD-MM-YYYY')),
PARTITION p1 VALUES LESS THAN (TO_DATE('5-25-2016', 'DD-MM-YYYY'));
All your old data will be in P0 partition.
Starting 5/24/2016 a new partition will be automatically created each day.
HISTORY now is a single logical object but physically it is a collection of identical tables stacked on top of each other.
Because each partitions data is stored separately, when a query asks for one day worth of data, only a single partition needs to be scanned.

Related

Very slow UPDATE with PutDatabaseRecord processor

I have 2 PutDatabaseRecord which are writing to Oracle DB.
This is part of the schema:
The red square there ins't error, but information messages, thus I put the processor to INFO mode. So it tells about fetching the schema, and "Commit because of batch size" with SQL update command.
The problem is, that first of them: ml_task, processes several millions records in few minutes, but the other one: ml_sales - stack with 1000 records several hours... This stacks both at night, when DB is much loaded and at a day time.
In a same moment, update goods.ml_[name] set load_date = sysdate statements takes same time via the SQLDeveloper interface - both for ml_task and ml_sales. It takes around of 10 minutes as night and several minutes at a day.
Both of them working with same pooling service.
This is the configuration of the bottom part of the service:
Both of the processors have same configuration, except the table name and update keys.
I tried set Max Batch Size to zero, it have no influence.
Both of the processors configured run in one thread,but I tried to configure 10 threads - there is no influence.
Also, there isn't lack of connections to the DB at all, I have around of 10 processors, every one uses one thread, so 50 connections, I think is enough.
There are around 5 millions record they are processing.
This is JSON of ml_task:
[{"DOC_ID": 1799041400,"LINE_D":694098344,"LOAD_DATE":"16-Jul-21"} ... ]
This is something similar to Nifi's update for ml_task:
update goods.ml_task
set load_date = sysdate
where doc_id = ?
and line_id = ?;
This is the table:
Name Null? Type
------------- -------- ------
DOC_ID NOT NULL NUMBER
LINE_ID NOT NULL NUMBER
ORG_ID NOT NULL NUMBER
NMCL_ID NOT NULL NUMBER
ASSORTMENT_ID NOT NULL NUMBER
START_DATE NOT NULL DATE
END_DATE DATE
ITEMS_QNT NUMBER
MODIFY_DATE DATE
LOAD_DATE DATE
This is a JSON of ml_sales:
[{"REP_DATE":"06-Jul-21","NMCL_ID":336793,"ASSORT_ID":7,"RTT_ID":92,"LOAD_DATE":"16-Jul-21"} ... ]
Request for ml_sales is such as:
update goods.ml_sales set load_date = sysdate
where nmcl_id = ?
and assort_id = ?
and rtt_id = ?
and rep_date = ?;
And the table:
Name Null? Type
----------- -------- ----------
REP_DATE NOT NULL DATE
NMCL_ID NOT NULL NUMBER(38)
ASSORT_ID NOT NULL NUMBER
RTT_ID NOT NULL NUMBER(38)
OUT_ITEMS NUMBER
MODIFY_DATE DATE
LOAD_DATE DATE
What can be reason for so slow update of ml_sales?
UPDATE
I put all the circuits to STOP except the problematic one, I committed all sessions in SQLDeveloper... and it is the same result... very long...
UPDATE
As I mentioned above, I actually copied the ml_sales table to other scheme and this didn't influenced the result.
I guess there were problem in lack of indexes in Oracle. I asked our DBA's to check the problem in DB side, they fixed it, now the UPDATE works as fast as for ml_task table, but unfortunately I still don't know what exactly was the problem, thus out DBA left for vocation. So, again, very possible that it were indexes problem.

Explanation of an SQL query

I recently got some help for an oracle query and don't quite understand how it works and thus can't get it to work with my data. Is anyone able to explain the logic of what is happening in logical steps and what variables are actually taken from an existing table's columns? I am looking to select data from a table of readings (column names are: day, hour, volume) and find the average reading of volume for each hour of each day (thus GROUP BY day, hour), by going back to all readings for that hour/day combination in the past (as far back as my dataset goes) and writing out the average for it. Once that is done, it will write the results to a different table with the same column names (day, hour, volume). Except when I write it back on a per hour basis, 'volume' will be the average for that hour of the day in the past. For example, I want to find what the average was for all Wednesdays at 7pm in the past, and output the average to a new record. Assuming these 3 columns were used and in reference to the code below, I am not sure how "hours" differs to "hrs" and what the t1 variable represents. Any help is appreciated.
INSERT INTO avg_table (days, hours, avrg)
WITH xweek
AS (SELECT ds, LPAD (hrs, 2, '0') hrs
FROM ( SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7),
( SELECT LEVEL - 1 hrs
FROM DUAL
CONNECT BY LEVEL <= 24))
SELECT t1.ds, t1.hrs, AVG (volume)
FROM xweek t1, tables t2
WHERE t1.ds = TO_CHAR (t2.day(+), 'D')
AND t1.hrs = t2.hour(+)
GROUP BY t1.ds, t1.hrs;
I'd re-write this slightly so it makes more sense (to me at least).
To break it down bit by bit, CONNECT BY is a hierarchical (recursive) query. This is a common "cheat" to generate rows. In this case 7 to represent each day of the week, numbered 1 to 7.
SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7
The next one generates the hours 0 to 23 to represent midnight to 11pm. These are then joined together in the old style in a Cartesian or CROSS JOIN. This means that every possible combination of rows is returned, i.e. it generates every hour of every day for a single week.
The WITH clause is described in the documentation on the SELECT statement, it is commonly known as a Common Table Expression (CTE), or in Oracle the Subquery Factoring Clause. This enables you to assign a name to a sub-query and reference that single sub-query in multiple places. It can also be used to keep code clean or generate temporary tables in memory for ready access. It's not required in this case but it does help to separate the code nicely.
Lastly, the + is Oracle's old notation for outer joins. They are mostly equivalent but there are a few very small differences that are described in this question and answer.
As I said at the beginning I would re-write this to conform to the ANSI standard because I find it more readable
insert into avg_table (days, hours, avrg)
with xweek as (
select ds, lpad(hrs, 2, '0') hrs
from ( select level ds
from dual
connect by level <= 7 )
cross join ( select level - 1 hrs
from dual
connect by level <= 24 )
)
select t1.ds, t1.hrs, avg(volume)
from xweek t1
left outer join tables t2
on t1.ds = to_char(t2.day, 'd')
and t1.hrs = t2.hour
group by t1.ds, t1.hrs;
To go into slightly more detail the t1 variable represents an alias for the CTE week1, it's so you don't have to type the entire thing each time. hrs is an alias for the generated expression, as you reference it explicitly you need to call it something. HOURS is a column in your own table.
As to whether this is doing the correct thing I'm not sure, you imply you only want it for a single day rather than the entire week so only you can decide if this is correct? I also find it a little strange that you need the HOURS column in your table to be a character left-padded with 0s lpad(hrs, 2, '0'), once again, only you know if this is correct.
I would highly recommend playing about with this yourself and working out how everything goes together. You also seem to be missing some of the basics, get a text book or look around on the internet, or Stack Overflow, there's plenty of examples.

Querying a data warehouse data involving time dimension

I have two tables for time dimension
date (unique row for each day)
time of the day (unique row for each minute in a day)
Given this schema what would a query look like if one wants to retrieve facts for last X hours where X can be any number greater than 0.
Things start to be become tricky when the start time and end time happen to be in two different days of the year.
EDIT: My Fact table does not have a time stamp column
Fact tables do have (and should have) original timestamp in order to avoid weird by-time queries which happen over the boundary of a day. Weird means having some type of complicated date-time function in the WHERE clause.
In most DWs these type of queries are very rare, but you seem to be streaming data into your DW and using it for reporting at the same time.
So I would suggest:
Introduce the full timestamp in the fact table.
For the old records, re-create the timestamp from the Date and Time keys.
DW queries are all about not having any functions in the WHERE clause, or if a function has to be used, make sure it is SARGABLE.
You would probably be better served by converting the Start Date and End Date columns to TIMESTAMP and populating them.
Slicing the table would require taking the appropriate interval BETWEEN Start Date AND End Date. In Oracle the interval would be something along the lines of SYSDATE - (4/24) or SYSDATE - NUMTODSINTERVAL(4, 'HOUR')
This could also be rewritten as:
Start Date <= (SYSDATE - (4/24)) AND End Date >= (SYSDATE - (4/24))
It seems to me that given the current schema you have, that you will need to retrieve the appropriate time IDs from the time dimension table which meet your search criteria, and then search for matching rows in the fact table. Depending on the granularity of your time dimension, you might want to check the performance of doing either (SQL Server examples):
A subselect:
SELECT X FROM FOO WHERE TIMEID IN (SELECT ID FROM DIMTIME WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) AND DATEID IN (SELECT ID FROM DIMDATE WHERE DATE = GETDATE())
An inner join:
SELECT X FROM FOO INNER JOIN DIMTIME ON TIMEID = DIMTIME.ID WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) INNER JOIN DIMDATE ON DATEID = DIMDATE.ID WHERE DATE = GETDATE()
Neither of these are truly attractive options.
Have you considered that you may be querying against a cube that is intended for roll-up analysis and not necessarily for "last X" analysis?
If this is not a "roll-up" cube, I would agree with the other posters in that you should re-stamp your fact tables with better keys, and if you do in fact intend to search off of hour frequently, you should probably include that in the fact table as well, as any other attempt will probably make the query non-sargable (see What makes a SQL statement sargable?).
Microsoft recommends at http://msdn.microsoft.com/en-us/library/aa902672%28v=sql.80%29.aspx that:
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date.
Best luck!

Oracle 10g - Determine the average of concurrent connections

Is it possible to determine the average of concurrent connections on a 10g large database installation?
Any ideas??
This is probably more of a ServerFault question.
On a basic level, you could do this by regularly querying v$session to count the number of current sessions, store that number somewhere, and average it over time.
But there are already good utilities available to help with this. Look into STATSPACK. Then look at the scripts shown here to get you started.
Alternatively you could install a commercial monitoring application like Spotlight on Oracle.
If you have Oracle Enterprise Manager set up you can create a User Defined Metric which records SELECT COUNT(*) FROM V$SESSION. Select Related Links -> User Defined Metrics to set up a new User Defined Metric. Once it collects some data you can get the data out in raw form or it will do some basic graphing for you. As a bonus you can also set up alerting if you want to be e-mailed when the metric reaches a certain value.
The tricky bit is recording the connections. Oracle doesn't do this by default, so if you haven't got anything in place then you won't have a historical record.
The easiest way to start recording connections is with Oracle's built in audit functionality. It's as simple as
audit session
/
We can see the records of each connection in a view called dba_audit_session.
Now what? The following query uses a Common Table Expression to generate a range of datetime values which span 8th July 2009 in five minute chunks. The output of the CTE is joined to the audit view for that date; A count is calulated for each connection which spans a five minute increment.
with t as
( select to_date('08-JUL-2009') + ((level-1) * (300/86400)) as five_mins
from dual connect by level <= 288)
select to_char(t.five_mins, 'HH24:MI') as five_mins
, sum(case when t.five_mins between timestamp and logoff_time
then 1
else 0 end) as connections
from t
, dba_audit_session ssn
where trunc(ssn.timestamp) = to_date('08-JUL-2009')
group by to_char(t.five_mins, 'HH24:MI')
order by t.five_mins
/
You can then use this query as the input into a query which calculates the average number of connections.
This is a fairly crude implementation: I choose five minute increments out of display considerations , but obviously the finer grained the increment the more accurate the measure. Be warned: if you make the increments too fined grained and you have a lot of connections the resultant cross join will take a long time to run!

History records, missing records, filling in the blanks

I have a table that contains a history of costs by location. These are updated on a monthly basis.
For example
Location1, $500, 01-JAN-2009
Location1, $650, 01-FEB-2009
Location1, $2000, 01-APR-2009
if I query for March 1, I want to return the value for Feb 1, since March 1 does not exist.
I've written a query using an oracle analytic, but that takes too much time (it would be fine for a report, but we are using this to allow the user to see the data visually through the front and and switch dates, requerying takes too long as the table is something like 1 million rows).
So, the next thought I had was to simply update the table with the missing data. In the case above, I'd simply add in a record identical to 01-FEB-2009 except set the date to 01-MAR-2009.
I was wondering if you all had thoughts on how to best do this.
My plan had been to simply create a cursor for a location, fetch the first record, then fetch the next, and if the next record was not for the next month, insert a record for the missing month.
A little more information:
CREATE TABLE MAXIMO.FCIHIST_BY_MONTH
(
LOCATION VARCHAR2(8 BYTE),
PARKALPHA VARCHAR2(4 BYTE),
LO2 VARCHAR2(6 BYTE),
FLO3 VARCHAR2(1 BYTE),
REGION VARCHAR2(4 BYTE),
AVG_DEFCOST NUMBER,
AVG_CRV NUMBER,
FCIDATE DATE
)
And then the query I'm using (the system will pass in the date and the parkalpha). The table is approx 1 million rows, and, again, while it takes a reasonable amount of time for a report, it takes way too long for an interactive display
select location, avg_defcost, avg_crv, fcimonth, fciyear,fcidate from
(select location, avg_defcost, avg_crv, fcimonth, fciyear, fcidate,
max(fcidate) over (partition by location) my_max_date
from FCIHIST_BY_MONTH
where fcidate <='01-DEC-2008'
and parkalpha='SAAN'
)
where fcidate=my_max_date;
The best way to do this is to create a PL/SQL stored procedure that works backwards from the present and runs queries that fail to return data. Each month that it fails to return data it inserts a row for the missing data.
create or replace PROCEDURE fill_in_missing_data IS
cursor have_data_on_date is
select locaiton, trunc(date_filed) have_date
from the_table
group by location, trunc(date_field)
order by desc 1
;
a_date date;
day_offset number;
n_days_to_insert number;
BEGIN
a_date := trunc(sysdate);
for r1 in fill_in_missing_data loop
if r1.have_date < a_date then
-- insert dates in a loop
n_days_to_insert := a_date - r1.have_date; -- Might be off by 1, need to test.
for day_offset in 1 .. n_days_to_insert loop
-- insert missing day
insert into the_table ( location, the_date, amount )
values ( r1.location, a_date-day_offset, 0 );
end loop;
end if;
a_date := r1.have_date;
-- this is a little tricky - I am going to test this and update it in a few minutes
end loop;
END;
Filling in the missing data will (if you are careful) make the queries much simpler and run faster.
I would also add a flag to the table to indicate that the data is missing data filled in so that if
you need to remove it (or create a view without it) later you can.
I have filled in missing data and also filled in dummy data so that outer join were not necessary so as to improve query performance a number of times. It is not "clean" and "perfect" but I follow Leflar's #1 Law, "always go with what works."
You can create a job in Oracle that will automatically run at off-peak times to fill in the missing data. Take a look at: This question on stackoverflow about creating jobs.
What is your precise use case underlying this request?
In every system I have worked on, if there is supposed to be a record for MARCH and there isn't a record for MARCH the users would like to know that fact. Apart from anything they might want to investigate why the MARCH record is missing.
Now if this is basically a performance issue then you ought to tune the query. Or if it presentation issue - you want to generate a matrix of twelve rows and that is difficult if a doesn't have a record for some reason - then that is a different matter, with a variety of possible solutions.
But seriously, I think it is a bad practice for the database to invent replacements for missing records.
edit
I see from your recent comment on your question that is did turn out to be a performance issue - indexes fixed the problem. So I feel vindicated.

Resources