Snowflake delete query scanning all partitions - performance

I have an ETL process that it's deleting a couple hundred thousand rows from a table with 18 billion rows using a unique hashed surrogate key like: 1801b08dd8731d35bb561943e708f7e3
delete from CUSTOMER_CONFORM_PROD.c360.engagement
where (
engagement_surrogate_key) in (
select (engagement_surrogate_key)
from CUSTOMER_CONFORM_PROD.c360.engagement__dbt_tmp
);
This is taking from 4 to 6 minutes each time on a Small warehouse. I have added a clustering key on the engagement_surrogate_key but since it's unique with high cardinality it didn't help. I have also enabled search optimization service but that also didn't help and it's still scanning all partitions. How can I speed up the deletion?

Related

Truncating a table with many subpartitions taking too long time

We have a job that loads some tables every night from our source db to target db, many of them are partitioned by range or list. Before loading a table we truncate it first and for some reason, this process is taking too long time for particular tables.
For instance,TABLE A has 62 mln rows and has been partitioned by list (column BRANCH_CODE). Number of partitions is 213. Truncating this table took 20 seconds .
TABLE B has 17 mln rows, has been range partitioned by DAY column, interval is month, every partitiion has 213 subpartitions by list (column BRANCH_CODE). So in this case, number of partitions is 60 and number of subpartitions is 12 780. Truncating this table took 15 minutes.
Is the reason of long truncate process too many partitions? Or maybe we have missed some table specs or should we set specifig storage parameters for a table?
Manually gathering fixed object and data dictionary statistics may improve the performance of metadata queries needed to support truncating 12,780 objects:
begin
dbms_stats.gather_fixed_objects_stats;
dbms_stats.gather_dictionary_stats;
end;
/
The above command may take many minutes to complete, but you generally only need to run it once after a significant change to the number of objects in your system. Adding 12,780 subpartitions can cause weird issues like this. (While you're investigating these issues, you might also want to check the space overhead associated with so many subpartitions. It's easy to waste many gigabytes of space when creating so many partitions.)

After table Partition Select query performance get slow

I am using Postgresql 9.1 and I have a table consisting of 36 column and almost 10 cr. 50 lacks record with date time stamp On this Table we have one composite primary key (DEVICE ID TEXT AND DT_DATETIME timestamp without time zone)
Now to get query performance we have partition the table day wise based on the DT_DATETIME Fild. Now After partition I see that the query data retrieval time takes more that the unpartition table. I have on the parameter called constraint_exclusion in config file.
Please any solution for the same.
Let me explain Little farther
I have 45 days GPS data in a table of size 40 GB. Every second We insert min 27 new records(2.5 million record in a day). To keep the table size at steady 45 days we delete 45th days data every night. Now This poses problem in vacuum on this table due to lock.If we have partition table we can simply drop the 45th days child table.
so by partitioning we wanted to increase query performance as well as solve locking problem. We have tried pg_repack but Twice the system load factor increased to 21 and we had to reboot the server.
Ours is a 24x7 system so there is no down time.
try to use pg_bouncer for connection management and memory management or increase RAM in your server....

Cassandra Wide Vs Skinny Rows for large columns

I need to insert 60GB of data into cassandra per day.
This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key
In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row
Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration
More specific details about my configuration:
CREATE TABLE stuff (
stuff_id text,
stuff_column text,
value blob,
PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=39600 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.
A typical request is going to need a random selection of 20 - 60 of those floats.
Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.
As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.
Test Method
I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.
Write results
Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps
The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:
Wide row (1 row per set): This could be the best solution as it prevents the request from hitting several nodes at once, and with secondary indexing or composite column names, you can quickly filter data to your needs. This is best if you need to access one set of data per request. However, doing too many multigets on wide rows can increase memory pressure on nodes, and degrade performance.
Skinny row (1000 rows per set): On the other hand, a wide row can give rise to read hotspots in the cluster. This is especially true if you need to make a high volume of requests for a subset of data that exists entirely in one wide row. In such a case, a skinny row will distribute your requests more uniformly throughout the cluster, and avoid hotspots. Also, in my experience, "skinnier" rows tend to behave better with multigets.
I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.
You'd be better off using 1 row per set with 150,000 columns per row. Using TTL is good idea to have an auto-cleaning process.

Oracle Partitioned table is taking long time to fetch

I have a partioned table based on date in oracle db, where each partition has crores of records. The front end application is build to search the data based on a date range (meanining it scans through multiple partitions). What is the best logic to get the data in quickest time?
You should create local indexes which work on partitions.
Normally we go for global indexes which work on whole table while local index is specific to partition which will make partition search faster.
Check this link to see how local indexes work: http://docs.oracle.com/cd/E11882_01/server.112/e25523/partition.htm#i461446
If local indexes don't work then query tuning might help. If that doesn't help then you shld look to redesign schema.
EDIT:
Having said all that, just one basic check to ensure that your query is not scanning all partitions. This can be achieved by including partition criteria [date in your case] as part of where clause.
Interval partitioning may help. It makes partition management much
easier, which then makes it reasonable to have thousands of partitions instead of just dozens or hundreds.
For example, if the current table is partitioned by month, a query for a week will need to read a lot of extra data. But if the table is partitioned by day
then almost no extra data will be scanned.
create table partition_test(a number primary key, b date)
partition by range (b) interval (interval '1' day)
(
partition p1 values less than (date '2000-01-01')
);
But even if this reduces the data per partition from crores to lakhs, that's still a lot of data for an application. Local indexes, as #loki suggested, may help.

When is the right time to create Indexes in Oracle?

A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.

Resources