Delta Detection in Oracle table with 2 billion records using composite key

Delta Detection in Oracle table with 2 billion records using composite key - oracle

Initial Load on Day 1
id
key
fkid
1
0
100
1
1
200
2
0
300
Load on Day 2
id
key
fkid
1
0
100
1
1
200
2
0
300
3
1
400
4
0
500
Need to find delta records
Load on Day 2
id
key
address
3
1
400
4
0
500
Problem Statement
Need to find delta records in minimum time with following facts
1: I have to process around 2 billion records initially from a table as mentioned below
2: Also need to find delta with minimal time so that I can process it quickly
Questions :
1: Will it be a time consuming process to identify delta especially during production downtime ?
2: How long should it take to identify delta with 3 numeric columns in a table out of which
id & key forms a composite key.
Solution tried :
1: Use full join and extract delta with case nvl condition but looks to be costly.
nvl(node1.id, node2.id) id,
nvl(node1.key, node2.key) key,
nvl(node1.fkid, node2.fkid) fkid
FROM
TABLE_DAY_1 node1
FULL JOIN TABLE_DAY_2 node2 ON node2.id = node1.id
WHERE
node2.id IS NULL
OR node1.id IS NULL;```

You need two separate statements to handle this, one to detect new & changed rows, a separate one to detect deleted rows.
While it is cumberson to write, the fastest comparison is field-by-field, so:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ *
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id = node2.id(+)
AND (node2.id IS NULL -- new rows
OR node1.col1 <> node2.col2 -- changed val on non-nullable col
OR NVL(node1.col3,' ') <> NVL(node2.col3,' ') -- changed val on nullable string
OR NVL(node1.col4,-1) <> NVL(node2.col4,-1) -- changed val on nullable numeric, etc..
)
Then for deleted rows:
SELECT /*+ parallel(8) full(node1) full(node2) USE_HASH(node1 node) */ node2.id
FROM table_day_1 node1,
table_day_2 node2
WHERE node1.id(+) = node2.id
AND node1.id IS NULL -- deleted rows
You will want to make sure Oracle does a full table scan. If you have lots of CPUs and parallel query is enabled on your database, make sure the query uses parallel query (hence the hint). And you want a hash join between them. Work with your DBA to ensure you have enough temporary space to pull this off, and enough PGA to at least handle this with a single pass workarea rather than multipass.

Related

Clickhouse query with a LIMIT clause inefficiently reads too many rows

I'm querying Clickhouse with a query that has ORDER BY and LIMIT 1, and the ORDER BY matches the table's sort order. The query returns 1 row as expected, however, 50+ rows were scanned to return the result.
I would expect ClickHouse to scan only 1 row as the ORDER BY is in the table's sort order. What's happening here and what can I do to fix this?
SELECT * FROM comp_intel_scrapes
order by
client_slug,
client_hotel_id,
argset_id,
scrape_datetime,
preferred_country,
preferred_currency,
adults,
children,
nights,
min_checkin_date,
max_checkin_date
limit 1
----
Elapsed: 0.004s
Read: 54 rows (8.84KB)
By the way, Clickhouse.com's cloud is being used here.

It depends on a table engine.
Primary index is sparse https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-design/
Because of this CH is unable to read less than one granule ~8192 rows.

Using oracle loop to concatanete strings

I have someting like this
id day descrition
1 1 hi
1 1 today
1 1 is a beautifull
1 1 day
1 2 exemplo
1 2 for
1 2 this case
I need to do a funtion that for each day concatenate the descrtiomn colunm and return the result like this
id day descrition
1 1 hi today is a beautifull thay
1 2 exemplo for this case
Anny ideia about how can i do this usisng a loop in a function in oracle

You need a way of determining which order the values should be aggregated. The snippet below will rely on the implicit order in which Oracle reads the rows from the datafiles - if you have row movement enabled then you may get inconsistent results as the rows can be read in different orders as they are relocated in the underlying datafiles.
SELECT LISTAGG( description, ' ' ) WITHIN GROUP ( ORDER BY ROWNUM ) AS description
FROM your_table
GROUP BY id, day
It would be better to have another column that stores the order within each day.

Cassandra Timing out because of TTL expiration

Im using a DataStax Community v 2.1.2-1 (AMI v 2.5) with preinstalled default settings+ increased read time out to 10sec here is the issue
create table simplenotification_ttl (
user_id varchar,
real_time timestamp,
insert_time timeuuid,
read boolean,
msg varchar, PRIMARY KEY (user_id, real_time, insert_time));
Insert Query:
insert into simplenotification_ttl (user_id, real_time, insert_time, read)
values ('test_3',14401440123, now(),false) using TTL 800;
For same 'test_3' I inserted 33,000 tuples. [This problem does not happen for 24,000 tuples]
Gradually i see
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
15681
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
count
-------
12737
(1 rows)
cqlsh:notificationstore> select count(*) from simplenotification_ttl where user_id = 'test_3';
**errors={}, last_host=127.0.0.1**
I have experimented this many times even on different tables. Once this happens, even if i insert with same user_id and do a retrieval with limit 1. It times out.
I require TTL to work properly ie give count 0 after speculated time. How to solve this issue?
Thanks
[My other node related setup is using m3.large with 2 nodes EC2Snitch]

You're running into a problem where the number of tombstones (deleted values) is passing a threshold, and then timing out.
You can see this if you turn on tracing and then try your select statement, for example:
cqlsh> tracing on;
cqlsh> select count(*) from test.simple;
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------+--------------+--------------+----------------
...snip...
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 23:36:59,324 | 172.31.0.85 | 123932
Scanned 1 rows and matched 1 | 23:36:59,325 | 172.31.0.85 | 124575
Timed out; received 0 of 1 responses for range 2 of 4 | 23:37:09,200 | 172.31.13.33 | 10002216
You're kind of running into an anti-pattern for Cassandra where data is stored for just a short time before being deleted. There are a few options for handling this better, including revisiting your data model if needed. Here are some resources:
The cassandra.yaml configuration file - See section on tombstone settings
Cassandra anti-patterns: Queues and queue-like datasets
About deletes
For your sample problem, I tried lowering the gc_grace_seconds setting to 300 (5 minutes). That causes the tombstones to be cleaned up more frequently than the default 10 days, but that may or not be appropriate based on your application. Read up on the implications of deletes and you can adjust as needed for your application.

oracle connect by multiple parents

I am facing an issue using connect by.
I have a query through which I retrieve a few columns including these three:
ID
ParentID
ObjectID
Now for the same ID and parentID, there are multiple objects associated e.g.
ID ParentID ObjectID
1 0 112
1 0 113
2 0 111
2 0 112
3 1 111
4 1 112
I am trying to use connect by but I'm unable to get the result in a proper hierarchy. I need it the way it is showed below. Take an ID-parentID combo, display all rows with that ID-parentID and then all the children of this ID i.e. whose parentID=ID
ID ParentID ObjectID
1 0 112
1 0 113
3 1 111
4 1 112
2 0 111
2 0 112
select ID,parent_id, object_id from table start with parent_id=0
connect by prior id=parent_id order by id,parent_id
Above query is not resulting into proper hierarchy that i need.

Well, your problem appears to be that you are using a non-normalized table design. If a given ID always has the same ParentID, that relationship shouldn't be indicated separately in all these rows.
A better design would be to have a single table showing the parent child relationships, with ID as a primary key, and a second table showing the mappings of ID to ObjectID, where I presume both columns together would comprise the primary key. Then you would apply your hierarchical query against the first table, and join the results of that to the other table to get the relevant objects for each row.
You can emulate this with your current table structure ...
with parent_child as (select distinct id, parent_id from table),
tree as (select id, parent_id from parent_child
start with parent_id = 0
connect by prior id = parent_id )
select id, table.parent_id, table.object_id
from tree join table using (id)

Here's a script that runs. Not ideal but will work -
select * from (select distinct test.id,
parent_id,
object_id,
connect_by_root test.id root
from test
start with test.parent_id = 0
connect by prior test.id = parent_id)
order by root,id

First of all Thanks to all who tried helping me.
Finally i changed my approach as applying hierarchy CONNECT BY clause to inner queryw ith multiple joins was not working for me.
I took following approach
Get the hierarchical data from First table i.e. table with ID-ParentID. Select Query table1 using CONNECT BY. It will give the ID in proper sequence.
Join the retrieved List of ID.
Pass the above ID as comma seperated string in select query IN Clause to second table with ID-ObjectID.
select * from table2 where ID in (above Joined string of ID) order by
instr('above Joined string of ID',ID);
ORDER BY INSTR did the magic. It will give me the result ordered by the IN Clause data and IN Clause string is prepared using the hierarchical query. Hence it will obviously be in sequence.
Again Thanks all for the help!
Note: Above approach has one constraint : ID passed as comma separated string in IN Clause. IN Clause has a limit of characters inside it. I guess 1000 chars. Not sure.
But as i am sure on the data of First table that it will not be so much so as to cross limit of 1000 chars. Hence i chose above approach.

How to otimize select from several tables with millions of rows

Have the following tables (Oracle 10g):
catalog (
id NUMBER PRIMARY KEY,
name VARCHAR2(255),
owner NUMBER,
root NUMBER REFERENCES catalog(id)
...
)
university (
id NUMBER PRIMARY KEY,
...
)
securitygroup (
id NUMBER PRIMARY KEY
...
)
catalog_securitygroup (
catalog REFERENCES catalog(id),
securitygroup REFERENCES securitygroup(id)
)
catalog_university (
catalog REFERENCES catalog(id),
university REFERENCES university(id)
)
Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.
I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.
But next query return this rows in 2 sec:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).
Plan for first query (using PLSQL Developer):
http://habreffect.ru/66c/f25faa5f8/plan2.jpg
Plan for second query:
http://habreffect.ru/f91/86e780cc7/plan1.jpg
What are the ways to optimize the query I have?

The indexes that can be useful and should be considered deal with
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
So the following fields can be interesting for indexes
c: id, root
cs: catalog, securitygroup
cu: catalog, university
So, try creating
(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)
and
(catalog_university.catalog, catalog_university.university)
EDIT:
I missed the ORDER BY - these fields should also be considered, so
(catalog.name, catalog.id)
might be beneficial (or some other composite index that could be used for sorting and the conditions - possibly (catalog.root, catalog.name, catalog.id))
EDIT2
Although another question is accepted I'll provide some more food for thought.
I have created some test data and run some benchmarks.
The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:
test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
?column? | ?column? | ?column?
----------+----------+----------
500000 | 1497501 | 500000
(1 row)
Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM
First I rewrote the query to
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50
AND cs.catalog = c.id
AND cu.catalog = c.id
AND cs.securitygroup < 200
AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100
note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)
If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 ms on a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)
This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).
Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.
Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:
CREATE INDEX test1 ON catalog (name, id);
This makes execution time on the query - 22ms on a cold database.
And that's the point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalized data with proper indexes should take less then 100ms on decent hardware.
I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).
So, the conclusion is
if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...

First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem
Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.
If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.
You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.
Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.
Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.

The Oracle cost-based optimizer makes use of all the information that it has to decide what the best access paths are for the data and what the least costly methods are for getting that data. So below are some random points related to your question.
The first three tables that you've listed all have primary keys. Do the other tables (catalog_university and catalog_securitygroup) also have primary keys on them?? A primary key defines a column or set of columns that are non-null and unique and are very important in a relational database.
Oracle generally enforces a primary key by generating a unique index on the given columns. The Oracle optimizer is more likely to make use of a unique index if it available as it is more likely to be more selective.
If possible an index that contains unique values should be defined as unique (CREATE UNIQUE INDEX...) and this will provide the optimizer with more information.
The additional indexes that you have provided are no more selective than the existing indexes. For example, the index on (catalog.id, catalog.name, catalog.owner) is unique but is less useful than the existing primary key index on (catalog.id). If a query is written to select on the catalog.name column, it is possible to do and index skip scan but this starts being costly (and most not even be possible in this case).
Since you are trying to select based in the catalog.root column, it might be worth adding an index on that column. This would mean that it could quickly find the relevant rows from the catalog table. The timing for the second query could be a bit misleading. It might be taking 2 seconds to find 50 matching rows from catalog, but these could easily be the first 50 rows from the catalog table..... finding 50 that match all your conditions might take longer, and not just because you need to join to other tables to get them. I would always use create table as select without restricting on rownum when trying to performance tune. With a complex query I would generally care about how long it take to get all the rows back... and a simple select with rownum can be misleading
Everything about Oracle performance tuning is about providing the optimizer enough information and the right tools (indexes, constraints, etc) to do its job properly. For this reason it's important to get optimizer statistics using something like DBMS_STATS.GATHER_TABLE_STATS(). Indexes should have stats gathered automatically in Oracle 10g or later.
Somehow this grew into quite a long answer about the Oracle optimizer. Hopefully some of it answers your question. Here is a summary of what is said above:
Give the optimizer as much information as possible, e.g if index is unique then declare it as such.
Add indexes on your access paths
Find the correct times for queries without limiting by rowwnum. It will always be quicker to find the first 50 M&Ms in a jar than finding the first 50 red M&Ms
Gather optimizer stats
Add unique/primary keys on all tables where they exist.

The use of rownum is wrong and causes all the rows to be processed. It will process all the rows, assigned them all a row number, and then find those between 0 and 50. When you want to look for in the explain plan is COUNT STOPKEY rather than just count
The query below should be an improvement as it will only get the first 50 rows... but there is still the issue of the joins to look at too:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Also, assuming this for a web page or something similar, maybe there is a better way to handle this than just running the query again to get the data for the next page.

try to declare a cursor. I dont know oracle, but in SqlServer would look like this:
declare #result
table (
id numeric,
name varchar(255)
);
declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for
--Select
select distinct
c.id, c.name
From [catalog] c
inner join university u
on u.catalog = c.id
and u.university = 300
inner join catalog_securitygroup s
on s.catalog = c.id
and s.securitygroup = 200
Where
c.root = 100
Order by name
--Cursor
declare #id numeric;
declare #name varchar(255);
open __dyn_select_cursor;
fetch relative 1 from __dyn_select_cursor into #id,#name declare #maxrowscount int
set #maxrowscount = 50
while (##fetch_status = 0 and #maxrowscount <> 0)
begin
insert into #result values (#id, #name);
set #maxrowscount = #maxrowscount - 1;
fetch next from __dyn_select_cursor into #id, #name;
end
close __dyn_select_cursor;
deallocate __dyn_select_cursor;
--Select temp, final result
select
id,
name
from #result;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio