I need to hold multiple pairs of 70,000 rows and perform a comparison difference like operation between them using a minus operator. At any time there could be comparisons (table scans).
I currently have one table with this sort of design:
primary key (sequenced)
foreign key to identify set
key to identify set #1 or set #2
then the data here i need to minus against
The data would look something like this
| PK | FK | Key | Data |
| 1 | 1 | Left | Some data |
| 1 | 1 | Left | Diff data |
| 1 | 1 | Right | Some data |
My query would be:
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Left'
MINUS
SELECT data
FROM diffTable
WHERE pk = 1
AND fk = 1
AND key = 'Right'
I am fearing the full table scans will monopolise resources and subsequent inserts and minus' will suffer.
How should I design my tables and why?
create index PK_FK on diff_table
(PK, FK, Key);
The query you posted in your question would run very fast with this index.
Btw, the PK column is not, by itself, the primary key. See the other comments. Perhaps you want:
alter table diff_table
add constraint PK_FK primary key (PK, FK, Key);
maybe pick a better name...
Related
I have one to many relationships in parent and child but I want to insert in the child based on condition only but parent table always.
I think it would be helpful if you could provide us the data structure with example records like:
table 1
column1 | column 2 | column 3
value | value | value
table 2
column1 | column 2 | column 3
value | value | value
I'm running some memsql performance tests on sample data and have a very poor behavior while querying JSON data.
I have 2 tables looking very similar and containing exactly the same information (loaded from the same csv file). The difference is that the segments column is JSON vs varchar(255).
CREATE TABLE `test_events` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` JSON COLLATE utf8_bin NOT NULL,
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)
CREATE TABLE `test_events_string` (
`timestamp` datetime NOT NULL,
`user_id` int(20) NOT NULL,
`segments` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
KEY `timestamp` (`timestamp`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `user_id` (`user_id`)
And an example data is (amount of items in array vary from 1 to 20):
memsql> select * from test_events limit 1;
+---------------------+---------+------------------------+
| timestamp | user_id | segments |
+---------------------+---------+------------------------+
| 2017-01-04 00:00:00 | 26834 | [19,18,9,6,7,22,34,43] |
+---------------------+---------+------------------------+
Below are 2 queries which fetch the same info, but speed is different. Both queries have been executed twice and I copy 2nd run:
memsql> select count(*) from test_events where json_array_contains_double(segments, 42);
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (15.86 sec)
memsql> select count(*) from test_events_string where segments like '%42%';
+----------+
| count(*) |
+----------+
| 79312103 |
+----------+
1 row in set (1.96 sec)
memsql> select count(*) from test_events;
+-----------+
| count(*) |
+-----------+
| 306939340 |
+-----------+
1 row in set (0.02 sec)
So the JSON scan is 8 times slower than a %x% LIKE. Is there something which can improve it?
Maybe you can advice how to solve that business logic problem with a different approach? Basically, we log events for users and for each event we want to attach an array of ids of some entities. That array is frequently changed during user's lifecycle. We want to run queries filtering by 1 or many ids, pretty much like an example above.
Just in case, some tech specs. 3 identical bare metal servers. 1 server is for aggregator, 2 for data. Each machine has NUMA, so 4 leaf nodes total. Fast SSDs, 32Cores (2 X E5-2650v2#2.60GHz), 32GB RAM.
I'm not surprised that this is slow.
MemSQL uses a parquet based compression for columnar json, and we don't have these sorts of fast lookups quite yet (but stay tuned!).
There are a few options.
One is, if you're always going to be searching for 42, you can use a persisted column (https://docs.memsql.com/docs/persistent-computed-columns).
This seems unlikely to be your use case.
The other option is, if you are always looking at the same array, you can create a normalized table (https://en.wikipedia.org/wiki/Database_normalization).
SOmething like
create table test_events_array (timestamp datetime not null, user_id bigint not null, segment bigint, shard(user_id), key(ts) using clustered columnstore)
then doing
select count(*) from test_events_array where segment=42 will be lightning fast.
It'll also compress down to almost nothing with that schema, probably.
I have scenario where i need to search & display records from huge tables with lots of rows. I have pre-defined search criteria for my tables for which user can provide the filter & click search .
Considering a sample table :
CREATE TABLE suppliers
( supplier_name varchar2(50) NOT NULL,
address varchar2(50),
city varchar2(50) NOT NULL,
state varchar2(25),
zip_code varchar2(10),
CONSTRAINT "suppliers_pk" PRIMARY KEY (supplier_name, city)
);
INSERT INTO suppliers VALUES ('ABCD','XXXX','YYYY','ZZZZ','95012');
INSERT INTO suppliers VALUES ('EFGH','MMMM','NNNN','OOOO','95010');
INSERT INTO suppliers VALUES ('IJKL','EEEE','FFFF','GGGG','95009');
I have provided the user with search fields as the primary key - supplier_name, city
If he enters both the fields, my query performance will be good since it goes for index scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = 'ZZZZ';
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| SUPPLIERS | 1 | 102 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | suppliers_pk | 1 | | 1 (0)| 00:00:01 |
However, if he enters only one of the search field, my query performance will go bad since it goes for full table scan
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' ;
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 | 3 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| SUPPLIERS | 1 | 102 | 3 (0)| 00:00:01 |
Is there a way to force oracle to think it is a primary key search when i don't have all of the key fields in search , something like below ( which obviously is not working )
SELECT supplier_name, address, city, state, zip_code FROM suppliers where supplier_name = 'ABCD' and city = city;
Thanks.
You are thinking about this in the wrong way.
The query optimiser will choose what it thinks best execution plan for the query based on the information available at the time the query is parsed (or sometimes when the parameters changed). Generally - if you give it the right information in terms of stats etc, it usually will do a good job.
You might think that you know better than it, but remember that you won't be monitoring this for the life of the database. The data changes, you want the database to be able to react and change the execution plan when it needs to.
That said, if you are set on forcing it to use the index, you can use a hint:
SELECT /*+ INDEX(suppliers suppliers_pk) */
supplier_name, address, city, state, zip_code FROM suppliers where
supplier_name = 'ABCD' ;
A full table scan is not necessarily bad. You have only a few rows in your table, so the optimizer thinks it is better to do a FTS than an index range scan. It will start using the PK index a soon as the RDBMS thinks it is better, i.e. you have lots a rows and the restriction on a certain supplier reduces the result significantly. If you want to search on city only instead of supplier you need another index with city only (or at least starting with city). Keep in mind that you might have to update the table statistics after you have loaded your table with bulk data. It is always important to test query performance with somehow realistic amounts of data.
Index is organised first on supplier_name second on city so it is not possible to use that index for query based on city only.
Please create second index based only on city. This will help your query.
I have a table which has 70 columns, Where primary key is the combination of 15 columns (which includes number and varchar2) . Please see below query
select * from tab1 where k1=1234567889;
Plan hash value: 1179808636
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6044 | 2201K| 4585K (1)| 15:17:04 |
|* 1 | TABLE ACCESS FULL| tab1 | 6044 | 2201K| 4585K (1)| 15:17:04 |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 – filter ("K1"=30064825087)
Where tab1 is a table mentions above and k1 is a column which is part of primary key. Table is not partitioned. Table is also analyzed (table, index and columns) after data has been inserted. Output for above query returns like 100000 plus records. The problem is even after having PK on the k1 column, the query is doing full table scan, which is not acceptable. On the other hand using index hints does not really speed up the process.
Please advise what would be the possible solution.
For this query:
select *
from tab1
where k1 = 1234567889;
The best index is one that has k1 as the first key in the index. There can be a composite index, by k1 has to be the first key. It sounds like you have a composite primary key and k1 is not the first key.
I would recommend that you simply define another index:
create index idx_tab1_k1 on tab1(k1);
There are several ways to avoid a full-table scan
Indexes: Ensure that indexes exist on the key value and that the index has been analyzed with dbms_stats.
Use_nl hint: You can direct that the optimizer use a nested loops join (which requires indexes).
index hint: You can specify the indexes that you want to use.
I have many tables with large amount of data. The PK is the column (TAB_ID) which has data type RAW(16). I created the hash partitions with partition key having the TAB_ID column.
My issue is: the SQL statement (select * from my_table where tab_id = 'aas1df') does not use partition pruning. If I change the column datatype to varchar2(32), partition pruning works.
Why does not partition pruning work with partition key which have datatype RAW(16)?
I'm just guessing: try select * from my_table where 'aas1df' = tab_id.
Probably the datatype conversion works other way that expected. Anyway you should use the function UTL_RAW.CAST_TO_RAW.
Edited:
Is your table partitioned by TAB_ID? If yes, then there is something wrong with your design, you usually partition table by some useful business value, NOT by surrogate key.
If you know the PK value you do not need partition pruning at all. When Oracle traverses the PK index it gets ROWID value. This ROWID contains file-number, block ID and also row number within the block. So Oracle can access directly the row.
HEXTORAW enables partition pruning.
In the sample below the Pstart and Pstop are literal numbers, implying partition pruning occurs.
create table my_table
(
TAB_ID raw(16),
a number,
constraint my_table_pk primary key (tab_id)
)
partition by hash(tab_id) partitions 16;
explain plan for
select *
from my_table
where tab_id = hextoraw('1BCDB0E06E7C498CBE42B72A1758B432');
select * from table(dbms_xplan.display(format => 'basic +partition'));
Plan hash value: 1204448714
--------------------------------------------------------------------------
| Id | Operation | Name | Pstart| Pstop |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | TABLE ACCESS BY GLOBAL INDEX ROWID| MY_TABLE | 2 | 2 |
| 2 | INDEX UNIQUE SCAN | MY_TABLE_PK | | |
--------------------------------------------------------------------------