Fastest access to collection in PLSQL for updating values associated with a record - oracle

I'm trying to cache data to memory for fast access on 3 different keys in PLSQL. The problem is not really existent in any language that has pointers, but i'm struggling with PLSQL as of there is none that I'm aware of. I need to do this because I have a very large looping function that updates data in a very fine grained way and would last quite an eternity otherwise.
The basic idea is that I have a collection in memory sorted by key_1. I'd want to make changes to a value of the first record which would influence the key_1 value of the record itself and several other specific values of records in the collection which have the same key_2 and has any key_3 values as the record I modified. After the modification I'd just bubble sort the modified first row to it's place instead of using a time consuming query.
So basically a record looks like this:
create type t_num_tbl is table of number;
create type rec_type as object
(
key_1 number,
key_2 varchar2(30),
key_3 t_num_tbl
);
and the collection is like this:
create type rec_typetbl is table of rec_type;
v_rectbl rec_typetbl := rec_typetbl();
If I modify a record I'd have to give out a select/update that looks something like this to be able to modify the associated records:
SELECT *
FROM table(v_rectbl)t
WHERE t.key_2 = modifiedrec.key_2
AND
(SELECT count(*)
FROM table(t.key_3)
JOIN table(modifiedrec.key_3) USING (column_value)) > 1;
The main problem here is that the data is not indexed in the memory and the access is just not fast enough for my purpose.
Are there any solutions in PLSQL that could compare to the performance of using a pointer array in a record to the associated elements of the collection? The associations are known beforehand since key_2, key_3 values don't change.

First, I can recommend against your design and would rather see you use the RDBMS the way it is designed (i.e. indexed access).
Having said that, every Oracle table has a rownum pseudo column that is a pointer to a row (i.e. that's how indexes internally reference specific rows in a table). If you have the record, you can save the rownum in your data structure to get back to it quickly (don't persist the rownum long term as oracle can change the rownum when tables/rows are re-organized).

Related

PL/SQL: Looping through a list string

Please forgive me if I open a new thread about looping in PL/SQL but after reading dozens of existing ones I'm still not able to perform what I'd like to.
I need to run a complex query on a view of a table and the only way to shorten running time is to filter through a where clause based on a variable to which such table is indexed (otherwise the system ends up doing a full scan of the table which runs endlessly)
The variable the table is indexed on is store_id (string)
I can retrieve all the store_id I want to query from a separate table:
e.g select distinct store_id from store_anagraphy
Then I'd like to make a loop that iterate queries with the store_id identified above
e.g select *complex query from view_of_sales where store_id = 'xxxxxx'
and append (union) all the result returned by each of this queries
Thank you very much in advance.
Gianluca
In theory, you could write a pipelined table function that ran multiple queries in a loop and made a series of pipe row calls to return the results. That would be pretty unusual but it could be done.
It would be far, far more common, however, to simply combine the two queries and run a single query that returns all the rows you want
select something
from your_view
where store_id in (select distinct store_id
from store_anagraphy)
If you are saying that you have tried this query and Oracle is choosing to do a table scan rather than using the index then what you really have is a tuning problem. Most likely, statistics on one or more objects are inaccurate which leads Oracle to expect that this query would return more rows than it really will thus favoring the table scan. You should be able to fix that by fixing the statistics on the objects. In a pinch, you could also use hints to force an index to be used.

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Best way to identify a handful of records expected to have a flag set to TRUE

I have a table that I expect to get 7 million records a month on a pretty wide table. A small portion of these records are expected to be flagged as "problem" records.
What is the best way to implement the table to locate these records in an efficient way?
I'm new to Oracle, but is a materialized view an valid option? Are there such things in Oracle such as indexed views or is this potentially really the same thing?
Most of the reporting is by month, so partitioning by month seems like an option, but a "problem" record may be lingering for several months theorectically. Otherwise, the reporting shuold be mostly for the current month. Would you expect that querying across all month partitions to locate any problem record would cause significant performance issues compared to usinga single table?
Your general thoughts of where to start would be appreciated. I realize I need to read up and I'll do that but I wanted to get the community thought first to make sure I read the right stuff.
One more thought: The primary key is a GUID varchar2(36). In order of magnitude, how much of a performance hit would you expect this to be relative to using a NUMBER data type PK? This worries me but it is out of my control.
It depends what you mean by "flagged", but it sounds to me like you would benefit from a simple index, function based index, or an indexed virtual column.
In all cases you should be careful to ensure that all the index columns are NULL for rows that do not need to be flagged. This way your index will contain only the rows that are flagged (Oracle does not - by default - index rows in B-Tree indexes where all index column values are NULL).
Your primary key being a VARCHAR2 GUID should make no difference, at least with regards to the specific flagging of rows in this question, indexes will point to rows via Oracle internal ROWIDs.
Indexes support partitioning, so if your data is already partitioned, your index could be set to match.
Simple column index method
If you can dictate how the flagging works, or the column already exists, then I would simply add an index to it like so:
CREATE INDEX my_table_problems_idx ON my_table (problem_flag)
/
Function-based index method
If the data model is fixed / there is no flag column, then you can create a function-based index assuming that you have all the information you need in the target table. For example:
CREATE INDEX my_table_problems_fnidx ON my_table (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
Now if you use the same logic in your SELECT statement, you should find that it uses the index to efficiently match rows.
SELECT *
FROM my_table
WHERE CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END IS NOT NULL
/
This is a bit clunky though, and it requires you to use the same logic in queries as the index definition. Not great. You could use a view to mask this, but you're still duplicating logic in at least two places.
Indexed virtual column
In my opinion, this is the best way to do it if you are computing the value dynamically (available from 11g onwards):
ALTER TABLE my_table
ADD virtual_problem_flag VARCHAR2(1) AS (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
CREATE INDEX my_table_problems_idx ON my_table (virtual_problem_flag)
/
Now you can just query the virtual column as if it were a real column, i.e.
SELECT *
FROM my_table
WHERE virtual_problem_flag = 'Y'
/
This will use the index and puts the function-based logic into a single place.
Create a new table with just the pks of the problem rows.

T-SQL Performance dilemma when looping through a Big table (details inside)

Let's say I have a Big and a Bigger table.
I need to cycle through the Big table, that is indexed but not sequential (since it is a filter of a sequentially indexed Bigger table).
For this example, let's say I needed to cycle through about 20000 rows.
Should I do 20000 of these
set #currentID = (select min(ID) from myData where ID > #currentID)
or
Creating a (big) temporary sequentially indexed table (copy of the Big table) and do 20000 of
#Row = #Row + 1
?
I imagine that doing 20000 filters of the Bigger table just to fetch the next ID is heavy, but so must be filling a big (Big sized) temporary table just to add a dummy identity column.
Is the solution somewhere else?
For example, if I could loop through the results of the select statement (the filter of the Bigger table that originates "table" (actually a resultset) Big) without needing to create temporary tables, it would be ideal, but I seem to be unable to add something like an IDENTITY(1,1) dummy column to the results.
Thanks!
You may want to consider finding out how to do your work set based instead of RBAR. With that said, for very big tables, you may want to not make a temp table so that you are sure that you have live data if you suspect that the proc may run for a while in production. If your proc fails, you'll be able to pick up where you left off. If you use a temp table then if your proc crashes, then you could lose data that hasn't been completed yet.
You need to provide more information on what your end result is, It is only very rarely necessary to do row-by-row processing (and almost always the worst possible choice from a performance perspective). This article will get you started on how to do many tasks in a set-based manner:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them
If you just want a temp table with an identity, here are two methods:
create table #temp ( test varchar (10) , id int identity)
insert #temp (test)
select test from mytable
select test, identity(int) as id into #temp from mytable
I think a join will serve your purposes better.
SELECT BIG.*, BIGGER.*, -- Add additional calcs here involving BIG and BIGGER.
FROM TableBig BIG (NOLOCK)
JOIN TableBigger BIGGER (NOLOCK)
ON BIG.ID = BIGGER.ID
This will limit the set you are working with to. But again it comes down to the specifics of your solution.
Remember too, you can do bulk inserts and bulk updates in this manner too.

can oracle types be updated like tables?

I am converting GTT's to oracle types as explained in an excellent answer by APC. however, some GTT's are being updated based on a select query from another table. For example:
UPDATE my_gtt_1 c
SET (street, city, STATE, zip) = (SELECT src.unit_address,
src.unit_city,
src.unit_state,
src.unit_zip_code
FROM (SELECT mbr.ROWID row_id,
unit_address,
RTRIM(a.unit_city) unit_city,
RTRIM(a.unit_state) unit_state,
RTRIM(a.unit_zip_code) unit_zip_code
FROM table_1 b,
table_2 a,
my_gtt_1 mbr
WHERE type = 'ABC'
AND id = b.ssn_head
AND a.h_id = b.h_id
AND row_id >= v_start_row
AND row_id <= v_end_row) src
WHERE c.ROWID = src.row_id)
WHERE state IS NULL
OR state = ' ';
if my_gtt_1 was not a global temporary table but an oracle collection type then is it possible to do updates this complex? Or in these cases we are better off using the global temporary table?
you can not perform set UPDATE operations on object types. You will have to do it row by row, as in:
FOR i IN l_tab.FIRST..l_tab.LAST LOOP
SELECT src.unit_address,
src.unit_city,
src.unit_state,
src.unit_zip_code
INTO l_tab(i).street,
l_tab(i).city,
l_tab(i).STATE,
l_tab(i).zip
FROM (your_query) src;
END LOOP;
You should therefore try to do all computations at creation time (where you can BULK COLLECT). Obviously, if your process needs many steps you might find that a global temporary table outperforms an in-memory structure.
From the last questions you have asked, it seems you are trying to replace all global temporary tables with object tables. I would suggest caution because in general, they are not interchangeable:
Objects tables are in-memory structures: you don't want to load a million+ rows table into memory. They are mainly used as a buffer: you load a few (100 for example) rows into the structure, perform what you need to do with these rows then load the next batch. You can not easily treat this structure as a regular table: for example you can only search this structure efficiently with the standard indexing key (you cannot search by rowid in your example unless you define the structure to be indexed by rowid).
Temporary tables on the other hand are very similar to ordinary tables. You can load millions of rows in them, perform joins, complex set operations. You can index the temporary table for further optimization.
In my opinion, the change your are trying to conduct will take a massive overhaul of your logic and it may not perform better. In general, you would not replace GTT with object tables. You may be able to remove GTT with significant gain in performance by using SET operations directly (perform massive UPDATE/DELETE/INSERT on your data directly without a staging table).
I would suggest performing benchmarks before choosing a solution (this is probably what you are doing right now :)
I think this part of APC's answer to your previous question is relevant here:
Global temporary tables are also good
if we have a lot of intermediate
processing which is just too
complicated to be solved with a single
SQL query. Especially if that
processing must be applied to subsets
of the retrieved rows.
You cannot update the in-memory data with an UPDATE statement like you can a GTT; you would need to write procedural code to locate and change the array elements in question.

Resources