Performance issue on SQLite closure table implementation - performance

I start by saying that I am fairly new to SQL and database systems, so please excuse me for any possible noobie mistakes that I may be doing.
I am using a closure table to insert hierarchical data in a SQLite database. I am using C# (.NET 4.6.1) and SQLite precompiled 32-bit DLL (x86) for SQLite version 3.26.0. The hierarchical data inserted contains ~240000 elements, and the max tree depth is not greater than 7.
My hierarchical element table is:
CREATE TABLE element (elementId INTEGER PRIMARY KEY, parentId INTEGER, elementName TEXT, FOREIGN KEY (parentId) REFERENCES element(elementId));
And my closure table is defined by:
CREATE TABLE hierarchy (parentId INTEGER, childId INTEGER, depth INTEGER, FOREIGN KEY(parentId) REFERENCES element(elementId), FOREIGN KEY(childId) REFERENCES element(elementId));
The elements are inserted using a classical stack, which begins with the “root” element, to which the element’s members are added during the treatment inside a transaction using:
INSERT INTO element VALUES (<ELEMENT_ID>, <PARENT_ID>'<ELEMENT_NAME');
And I initialize the closure table with the “self” relationship, using:
INSERT INTO hierarchy(parentId, childId, depth) VALUES (<ELEMENT_ID>, <ELEMENT_ID>, 0);
These inserts cause no issues, taking a few seconds to execute.
Next, I go through all the elements again using the same stack method to build the closure table (NOTE: I could probably do this at the same time of the previous instructions, however I do it on a separate loop to isolate the performance issue) using the following code (inside another transaction):
INSERT INTO hierarchy SELECT p.parentId, c.childId, p.depth+c.depth+1 FROM hierarchy p, hierarchy c WHERE p.childId=<PARENT_ID> AND c.parentId=<ELEMENT_ID>;
However, this query takes HOURS, maybe even days, to execute. It's execution time is also increasingly longer. I know that it inserts a lot of elements in the closure table (one entry per relationship between the current element and all its ascendants), but I wish to know if anything can be done to improve the performance here?
Thanks

You need indexes on child keys and parent keys. Also wrap everything in a transaction.
Better yet, use a single recursive CTE to generate the closure table, something like
with recursive
closure as (
select elementId, elementId as parentId, 0 as depth from element
union all
select closure.elementId, element.parentId, 1 + depth as depth
from closure, element where closure.parentId = element.elementId
)
select * from closure
To actually create the table use something like create table hierarchy as to put the above result into a table.

Related

Joining tables with table type slows down calculation?

I have a big calculation that joins together about 10 tables and calculates some values from the result. I want to write a function that allows me to replace one of the tables that are joined (lets call it Table A) with a table (type) I give as an input parameter.
I have defined row and table types for table A like
create or replace TYPE t_tableA_row AS OBJECT(*All Columns of Table A*);
create or replace TYPE t_tableA_table as TABLE OF t_tableA_row;
And the same for the types of the calculation I need as an output of the function.
My functions looks like this
create or replace FUNCTION calculation_VarInput (varTableA t_tableA_table)
RETURN t_calculationResult_table AS
result_ t_calculationResult_table;
BEGIN
SELECT t_calculationResult_row (*All Columns of Calculation Result*)
BULK COLLECT INTO result_
FROM (*The calculation*)
RETURN result_;
END;
If I test this function with the normal calculation that just uses Table A(ignoring the input parameter), it works fine and takes about 3 Second. However, if I replace Table A with varTableA (the input parameter that is a table type of Table A), the calculation takes so long I have never seen it finish.
When I use table A for the calculation it looks like this
/*Inside the calculation*/
*a bunch tables being joined*
JOIN TableA A On A.Value = B.SomeOtherValue
JOIN *some other tables*
When I use varTableA its
/*Inside the calculation*/
*a bunch tables being joined*
JOIN TABLE(varTableA ) A On A.Value = B.SomeOtherValue
JOIN *some other tables*
Sorry for not posting the exact code but the calculation is huge and would really bloat this post.
Any ideas why using the table type when joining makes the calculation so much slower when compared to using the actual table?
Your function encapsulates some selection logic in a function and so hides information from the optimizer. This may lead the optimizer to make bad or inefficient decisions.
Oracle has gathered statistics for TableA so the optimizer knows how many rows it has, what columns are indexed and so on. Consequently it can figure out the best access path for the table. It has no stats for TABLE(varTableA ) so it assumes it will return 8192 (i.e. 8k) rows. This could change the execution plan if say the original TableA returned 8 rows. Or 80000. You can check this easily enough by running EXPLAIN PLAN for both versions of query.
If that is the problem add a /*+ cardinality */ to the query which accurately reflects the number of rows in the function's result set. The hint (hint, not function) tells the optimizer the number of rows it should use in its calculation.
I don't want to actually change the values in my tables permanently, I just want to know what the calculation result would be if some values were different.
Why not use a view instead? A simple view which selects from TableA and applies the required modifications in its projection. Of course I know nothing about your data and how you want to manipulate it, so this may be impractical for all sorts of reasons. But it's where I would start.

how to save a query result in a temporary table within a procedure

i'm quite new at oracle so i apologize in advance for the simple question.
So i have this procedure, in wich i run a query, i want to save the query result for further use, specifically i want run a for loop wich will take row by row my selection and copy some of the values in another table. The purpose is to populate a child table ( a weak entity ) starting from a parent table.
For the purpose let's imagine i have a query :
select *
from tab
where ...
now i want to save the selection with a local scope and therefore with a lifespan confined to the procedure itself ( like a local variable in a C function basically ). How can i achieve such a result ?
Basically i have a class schedule table composed like this :
Schedule
--------------------------------------------------------
subject_code | subject_name | class_starting_date | starting hour | ending hour | day_of_week
so i made a query to get all the subjects scheduled for the current accademic year, and i need to use the function next_day on each row of the result-set to populate a table of the actual classes scheduled for the next week.
My tought was :
I get the classes that need to be scheduled for the next week with a query, save the result somewhere and then trough a for loop using next_day ( because i need the actual date in wich the class take place ) populate the "class_occurence" table. I'm not sure that this is the correct way of thinking, there could be something to perform this job without saving the result first, maybe a cursor, who konws...
Global Temporary tables are a nice solution.. As long as you know the structure of the data to be inserted (how many columns and what datatype) you can insert into the global temp table. Data can only be seen by the session that does the inserts. Data can be dropped or committed by using some of the options.
CREATE GLOBAL TEMPORARY TABLE my_temp_table (
column1 NUMBER,
column2 NUMBER
) ON COMMIT DELETE ROWS;
This has worked great for me where I need to have data aggregated but only for a short period of time.
Edit: the data is local and temporary, the temp table is always there.
If you want to have the table in memory in the procedure that is another solution but somewhat more sophisticated.

Should I use Oracle's sys_guid() to generate guids?

I have some inherited code that calls SELECT SYS_GUID() FROM DUAL each time an entity is created. This means that for each insertion there are two calls to Oracle, one to get the Guid, and another to insert the data.
I suppose that there may be a good reason for this, for example - Oracle's Guids may be optimized for high-volume insertions by being sequential and thus they maybe are trying to avoid excessive index tree re-balancing.
Is there a reason to use SYS_GUID as opposed to building your own Guid on the client?
Why roll your own if you already have it provided to you. Also, you don't need to grab it first and then insert, you can just insert:
create table my_tab
(
val1 raw(16),
val2 varchar2(100)
);
insert into my_tab(val1, val2) values (sys_guid(), 'Some data');
commit;
You can also use it as a default value for a primary key:
drop table my_tab;
create table my_tab
(
val1 raw(16) default sys_guid(),
val2 varchar2(100),
primary key(val1)
);
Here there's no need for setting up a before insert trigger to use a sequence (or in most cases even caring about val1 or how its populated in the code).
More maintenance for sequences also. Not to mention the portability issues when moving data between systems.
But, sequences are more human friendly imo (looking at and using a number is better than a 32 hex version of a raw value, by far). There may be other benefits to sequences, I haven't done any extensive comparisons, you may wish to run some performance tests first.
If your concern is two database calls, you should be able to call SYS_GUID() within your INSERT statement. You could even use a RETURNING clause to get the value that Oracle generated, so that you have it in your application for further use.
SYS_GUID can be used as a default value for a primary key column, which is often more convenient than using a sequence, but note that the values will be more or less random and not sequential. On the plus side, that may reduce contention for hot blocks, but on the minus side your index inserts will be all over the place as well. We generally recommend against this practice.
for reference click here
I have found no reason to generate a Guid from Oracle. The round trip between Oracle and the client for every Guid is likely slower than the occasional index rebalancing that occurs is random value inserts.

Why does the Execution Plan include a user-defined function call for a computed column that is persisted?

I have a table with 2 computed columns, both of which has "Is Persisted" set to true. However, when using them in a query the Execution Plan shows the UDF used to compute the columns as part of the plan. Since the column data is calculated by the UDF when the row is added/updated why would the plan include it?
The query is incredibly slow (>30s) when these columns are included in the query, and lightning fast (<1s) when they are excluded. This leads me to conclude that the query is actually calculating the column values at run time, which shouldn't be the case since they are set to persisted.
Am I missing something here?
UPDATE: Here's a little more info regarding our reasoning for using the computed column.
We are a sports company and have a customer who stores full player names in a single column. They require us to allow them to search player data by first and/or last name separately. Thankfully they use a consistent format for player names - LastName, FirstName (NickName) - so parsing them is relatively easy. I created a UDF that calls into a CLR function to parse the name parts using a regular expression. So obviously calling the UDF, which in turn calls a CLR function, is very expensive. But since it is only used on a persisted column I figured it would only be used during the few times a day that we import data into the database.
The reason is that the query optimizer does not do a very good job at costing user-defined functions. It decides, in some cases, that it would be cheaper to completely re-evaluate the function for each row, rather than incur the disk reads that might be necessary otherwise.
SQL Server's costing model does not inspect the structure of the function to see how expensive it really is, so the optimizer does not have accurate information in this regard. Your function could be arbitrarily complex, so it is perhaps understandable that costing is limited this way. The effect is worst for scalar and multi-statement table-valued functions, since these are extremely expensive to call per-row.
You can tell whether the optimizer has decided to re-evaluate the function (rather than using the persisted value) by inspecting the query plan. If there is a Compute Scalar iterator with an explicit reference to the function name in its Defined Values list, the function will be called once per row. If the Defined Values list references the column name instead, the function will not be called.
My advice is generally not to use functions in computed column definitions at all.
The reproduction script below demonstrates the issue. Notice that the PRIMARY KEY defined for the table is nonclustered, so fetching the persisted value would require a bookmark lookup from the index, or a table scan. The optimizer decides it is cheaper to read the source column for the function from the index and re-compute the function per row, rather than incur the cost of a bookmark lookup or table scan.
Indexing the persisted column speeds the query up in this case. In general, the optimizer tends to favour an access path that avoids re-computing the function, but the decision is cost-based so it is still possible to see a function re-computed for each row even when indexed. Nevertheless, providing an 'obvious' and efficient access path to the optimizer does help to avoid this.
Notice that the column does not have to be persisted in order to be indexed. This is a very common misconception; persisting the column is only required where it is imprecise (it uses floating-point arithmetic or values). Persisting the column in the present case adds no value and expands the base table's storage requirement.
Paul White
-- An expensive scalar function
CREATE FUNCTION dbo.fn_Expensive(#n INTEGER)
RETURNS BIGINT
WITH SCHEMABINDING
AS
BEGIN
DECLARE #sum_n BIGINT;
SET #sum_n = 0;
WHILE #n > 0
BEGIN
SET #sum_n = #sum_n + #n;
SET #n = #n - 1
END;
RETURN #sum_n;
END;
GO
-- A table that references the expensive
-- function in a PERSISTED computed column
CREATE TABLE dbo.Demo
(
n INTEGER PRIMARY KEY NONCLUSTERED,
sum_n AS dbo.fn_Expensive(n) PERSISTED
);
GO
-- Add 8000 rows to the table
-- with n from 1 to 8000 inclusive
WITH Numbers AS
(
SELECT TOP (8000)
n = ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns AS C1
CROSS JOIN master.sys.columns AS C2
CROSS JOIN master.sys.columns AS C3
)
INSERT dbo.Demo (N.n)
SELECT
N.n
FROM Numbers AS N
WHERE
N.n >= 1
AND N.n <= 5000
GO
-- This is slow
-- Plan includes a Compute Scalar with:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
-- QO estimates calling the function is cheaper than the bookmark lookup
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the computed column
-- Notice the actual plan also calls the function for every row, and includes:
-- [dbo].[Demo].sum_n = Scalar Operator([[dbo].[fn_Expensive]([dbo].[Demo].[n]))
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Query now uses the index, and is fast
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Drop the index
DROP INDEX uq1 ON dbo.Demo;
GO
-- Don't persist the column
ALTER TABLE dbo.Demo
ALTER COLUMN sum_n DROP PERSISTED;
GO
-- Show again, as you would expect
-- QO has no option but to call the function for each row
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Index the non-persisted column
CREATE UNIQUE INDEX uq1 ON dbo.Demo (sum_n);
GO
-- Fast again
-- Persisting the column bought us nothing
-- and used extra space in the table
SELECT
MAX(sum_n)
FROM dbo.Demo;
GO
-- Clean up
DROP TABLE dbo.Demo;
DROP FUNCTION dbo.fn_Expensive;
GO

How to protect a running column within Oracle/PostgreSQL (kind of MAX-result locking or something)

I'd need advice on following situation with Oracle/PostgreSQL:
I have a db table with a "running counter" and would like to protect it in the following situation with two concurrent transactions:
T1 T2
SELECT MAX(C) FROM TABLE WHERE CODE='xx'
-- C for new : result + 1
SELECT MAX(C) FROM TABLE WHERE CODE='xx';
-- C for new : result + 1
INSERT INTO TABLE...
INSERT INTO TABLE...
So, in both cases, the column value for INSERT is calculated from the old result added by one.
From this, some running counter handled by the db would be fine. But that wouldn't work because
the counter values or existing rows are sometimes changed
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
With some other databases this can be handled with SERIALIZABLE isolation state but at least with Oracle&Postgre the phantom reads are prevented but as the result the table ends up with two distinct rows with same counter value. This seems to have to do with the predicate locking, locking "all the possible rows covered by the query" - some other db:s end up to lock the whole table or something..
SELECT ... FOR UPDATE -statements seem to be for other purposes and don't even seem to work with MAX() -function.
Setting an UNIQUE contraint on the column would probably be the solution but are there some other ways to prevent the situation?
b.r. Touko
EDIT: One more option could probably be manual locking even though it doesn't appear nice to me..
Both Oracle and PostgreSQL support what's called sequences and the perfect fit for your problem. You can have a regular int column, but define one sequence per group, and do a single query like
--PostgreSQL
insert into table (id, ... ) values (nextval(sequence_name_for_group_xx), ... )
--Oracle
insert into table (id, ... ) values (sequence_name_for_group_xx.nextval, ... )
Increments in sequences are atomic, so your problem just wouldn't exist. It's only a matter of creating the required sequences, one per group.
the counter values or existing rows are sometimes changed
You should to put a unique constraint on that column if this would be a problem for your app. Doing so would guarantee a transaction at SERIALIZABLE isolation level would abort if it tried to use the same id as another transaction.
One more option could probably be manual locking even though it doesn't appear nice to me..
Manual locking in this case is pretty easy: just take a SHARE UPDATE EXCLUSIVE or stronger lock on the table before selecting the maximum. This will kill concurrent performance, though.
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
This leads me to the Right Solution for this problem: sequences. Set up several sequences, one for each "value group" you want to get IDs in their own range. See Section 9.15 of The Manual for the details of sequences and how to use them; it looks like they're a perfect fit for you. Sequences will never give the same value twice, but might skip values: if a transaction gets the value '2' from a sequence and aborts, the next transaction will get the value '3' rather than '2'.
The sequence answer is common, but might not be right. The viability of this solution depends on what you actually need. If what you semantically want is "some guaranteed to be unique number" then that is what a sequence is for. However, if what you want is to make sure that your value increases by exactly one on each insert (as you have asked), then DO NOT USE A SEQUENCE! I have run into this trap before myself. Sequences are not guaranteed to be sequential! They can skip numbers. Depending on what sort of optimizations you have configured, they can skip LOTS of numbers. Even if you have things configured just right so that you shouldn't skip any numbers, that is not guaranteed, and is not what sequences are for. So, you are only asking for trouble if you (mis)use them like that.
One step better solution is to bundle the select into the insert, like so:
INSERT INTO table(code, c, ...)
VALUES ('XX', (SELECT MAX(c) + 1 AS c FROM table WHERE code = 'XX'), ...);
(I haven't test run that query, but I'm pretty sure it should work. My apologies if it doesn't.) But, doing something like that reflects the semantic intent of what you are trying to do. However, this is inefficient, because you have to do a scan for MAX, and the inference I am taking from your sample is that you have a small number of code values relative to the size of the table, so you are going to do an expensive, full table scan on every insert. That isn't good. Also, this doesn't even get you the ACID guarantee you are looking for. The select is not transactionally tied to the insert. You can't "lock" the result of the MAX() function. So, you could still have two transactions running this query and they both do the sub-select and get the same max, both add one, and then both try to insert. It's a much smaller window, but you may still technically have a race condition here.
Ultimately, I would challenge that you probably have the wrong data model if you are trying to increment on insert. You should insert with a unique key, most commonly a sequence value (at least as an easy, surrogate key for any natural key). That gets the data safely inserted. Then, if you need a count of things, then have one table that stores your counts.
CREATE TABLE code_counts (
code VARCHAR(2), --or whatever
count NUMBER
);
If you really want to store the code count of each item as it is inserted, the separate count table also allows you to do so correctly, transactionally, like so:
UPDATE code_counts SET count = count + 1 WHERE code = 'XX' RETURNING count INTO :count;
INSERT INTO table(code, c, ...) VALUES ('XX', :count, ...);
COMMIT;
The key is that the update locks the counter table and reserves that value for you. Then your insert uses that value. And all of that is committed as one transactional change. You have to do this in a transaction. Having a separate count table avoids the full table scan of doing SELECT MAX().... In essense, what this does is re-implements a sequence, but it also guarantees you sequencial, ordered use.
Without knowing your whole problem domain and data model, it is hard to say, but abstracting your counts out to a separate table like this where you don't have to do a select max to get the right value is probably a good idea. Assuming, of course, that a count is what you really care about. If you are just doing logging or something where you want to make sure things are unique, then use a sequence, and a timestamp to sort by.
Note that I'm saying not to sort by a sequence either. Basically, never trust a sequence to be anything other than unique. Because when you get to caching sequence values on a multi-node system, your application might even consume them out of order.
This is why you should use the Serial datatype, which defers the lookup of C to the time of insert (which uses table locks i presume). You would then not specify C, but it would be generated automatically. If you need C for some intermediate calculation, you would need to save first, then read C and finally update with the derived values.
Edit: Sorry, I didn't read your whole question. What about solving your other problems with normalization? Just create a second table for each specific type (for each x where A='x'), where you have another auto increment. Manually edited sequences could be another column in the same table, which uses the generated sequence as a base (i.e if pk = 34 you can have another column mypk='34Changed').
You can create sequential collumn by using sequence as default value:
First, you have to create the sequence counter:
CREATE SEQUENCE SEQ_TABLE_1 START WITH 1 INCREMENT BY 1;
So, you can use it as default value:
CREATE TABLE T (
COD NUMERIC(10) DEFAULT NEXTVAL('SEQ_TABLE_1') NOT NULL,
collumn1...
collumn2...
);
Now you don't need to worry about sequence on inserting rows:
INSERT INTO T (collumn1, collumn2) VALUES (value1, value2);
Regards.

Resources