Truncating a table with many subpartitions taking too long time

Truncating a table with many subpartitions taking too long time - oracle

We have a job that loads some tables every night from our source db to target db, many of them are partitioned by range or list. Before loading a table we truncate it first and for some reason, this process is taking too long time for particular tables.
For instance,TABLE A has 62 mln rows and has been partitioned by list (column BRANCH_CODE). Number of partitions is 213. Truncating this table took 20 seconds .
TABLE B has 17 mln rows, has been range partitioned by DAY column, interval is month, every partitiion has 213 subpartitions by list (column BRANCH_CODE). So in this case, number of partitions is 60 and number of subpartitions is 12 780. Truncating this table took 15 minutes.
Is the reason of long truncate process too many partitions? Or maybe we have missed some table specs or should we set specifig storage parameters for a table?

Manually gathering fixed object and data dictionary statistics may improve the performance of metadata queries needed to support truncating 12,780 objects:
begin
dbms_stats.gather_fixed_objects_stats;
dbms_stats.gather_dictionary_stats;
end;
/
The above command may take many minutes to complete, but you generally only need to run it once after a significant change to the number of objects in your system. Adding 12,780 subpartitions can cause weird issues like this. (While you're investigating these issues, you might also want to check the space overhead associated with so many subpartitions. It's easy to waste many gigabytes of space when creating so many partitions.)

Related

Hive partition scenario and how it impacts performance

I want to ask regarding the hive partitions numbers and how they will impact performance.
let me reflect this on a real example;
I have am external table that is expecting to have around 500M rows per day from multiple sources, and it shall have 5 partition columns.
for one day, that resulted in 250 partitions and expecting to have 1 year retention that will get around 75K.. which i suppose it is a huge number as when i checked, hive can go to 10K but after that the performance is going to be bad.. (and some one told me that partitions should not exceed 1K per table).
Mainly the queries that will select from this table
50% of them shall use the exact order of partitions..
25% shall use only 1-3 partitions and not using the other 2.
25% only using 1st partition
So do you think even with 1 month retention this may work well? or only start date can be enough.. assuming normal distribution the other 4 columns ( let's say 500M/250 partitions, for which we shall have 2M row for each partition).

I would go with 3 partition columns, since that will a) exactly match ~50% of your query profiles, and b) substantially reduce (prune) the number of scanned partitions for the other 50%. At the same time, you won't be pressured to increase your Hive MetaStore (HMS) heap memory and beef up HMS backend database to work efficiently with 250 x 364 = 91,000 partitions.
Since the time a 10K limit was introduced, significant efforts have been made to improve partition-related operations in HMS. See for example JIRA HIVE-13884, that provides the motivation to keep that number low, and describes the way high numbers are being addressed:
The PartitionPruner requests either all partitions or partitions based
on filter expression. In either scenarios, if the number of partitions
accessed is large there can be significant memory pressure at the HMS
server end.
... PartitionPruner [can] first fetch the partition names (instead of
partition specs) and throw an exception if number of partitions
exceeds the configured value. Otherwise, fetch the partition specs.
Note that partition specs (mentioned above) and statistics gathered per partition (always recommended to have for efficient querying), is what constitutes the bulk of data HMS should store and cache for good performance.

How to append the data to existing hive table without partition

I have created hive table which contains historical stock data of past 10 years. From now i have to append the data on daily bases.
I thought of creating the partition based on date but it leads many partitions approximately 3000 plus a new partition for every new date, i think this is not feasible.
Can any one suggest a best approach to store all the historical data in the table and append the new data as it comes.

As for every partitioned table, the decision on how to partition your table depends primarily on how you are going to query the table.
Another consideration is how much data you're going to have per partition, as partitions should not bee too small. Each one should be at least at as an absolute minimum as big as one HDFS block since it would otherwise take too many directories.
This said, I don't think 3000 partitions would be a problem. At a previous job we had a huge table with one partition per hour, each hour was about 20Gbytes, and we had 6 months of data, so about 4000 partitions, and it worked just fine.
In our case, most people care the most about the last week and the last day.
I suggest as first thing you research how the table is going to be used, that is, are all the 10 years be used, or just mostly the most recent data ?
As second thing, study how big is the data, consider if it may grow in size with the new loads, and see how big each partition is going to be.
Once you've determined these 2 points, you can make a decision, you could just use daily partitions (which could be fine, 3000 partitions is not bad), or you could do weekly, or monthly.

You can use this command
LOAD DATA LOCAL INPATH '<FILE_PATH>' INTO TABLE <TABLE_NAME>;
It will create new files under HDFS directory mapped to table name. Even though there are not too many partitions with it, you will still run into too many files issue.
Periodically, you need to do this:
Create stage table
Move data by running LOAD command from target table to stage table
You can run insert command into target table selecting from stage table
Now it will load data with number of files equal to number of reducers.
You can delete stage table
You can run this process at regular intervals (probably once in a month).

After table Partition Select query performance get slow

I am using Postgresql 9.1 and I have a table consisting of 36 column and almost 10 cr. 50 lacks record with date time stamp On this Table we have one composite primary key (DEVICE ID TEXT AND DT_DATETIME timestamp without time zone)
Now to get query performance we have partition the table day wise based on the DT_DATETIME Fild. Now After partition I see that the query data retrieval time takes more that the unpartition table. I have on the parameter called constraint_exclusion in config file.
Please any solution for the same.
Let me explain Little farther
I have 45 days GPS data in a table of size 40 GB. Every second We insert min 27 new records(2.5 million record in a day). To keep the table size at steady 45 days we delete 45th days data every night. Now This poses problem in vacuum on this table due to lock.If we have partition table we can simply drop the 45th days child table.
so by partitioning we wanted to increase query performance as well as solve locking problem. We have tried pg_repack but Twice the system load factor increased to 21 and we had to reboot the server.
Ours is a 24x7 system so there is no down time.

try to use pg_bouncer for connection management and memory management or increase RAM in your server....

When is the right time to create Indexes in Oracle?

A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?

If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.

TSQL Merge Performance

Scenario:
I have a table with roughly 24 million records. The table has pricing history related to individual customers and is computed daily. There are on average 6 million records for each day. Every morning a the price list is generated and a merge statement is ran to reflect the changes in their pricing.
The merge statement begins with the previous day's previous data being inserted into a variable table, that table is then merged into the actual table. The main problem is that the merge statement takes pretty long.
My real question centers around the performance of using a variable table vs physical table vs temp table. What is the best practice for large merges like this?

Thoughts
I'd consider a temp table: these have statistics which will help. A table variable is always assumed to have one row. Also, the IO can be shunted onto separate drives (assuming you have tempdb separately)
If a single transaction is not required, I'd split the MERGE too into a DELETE, UPDATE, INSERT sequence to reduce the amount of work needed in each action (which reduces the amount of rollback info needed and the amount of locking etc

Temp tables often perform better than table variables for large data sets. Additionally you can put the data into the temp table and then index it.

Check if you indexes on the tables. Indexes would be updated every time you add/delete records on that table.
Try removing the indexes before merging the records and then re-create it again after the merge.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Truncating a table with many subpartitions taking too long time - oracle

Related

Hive partition scenario and how it impacts performance

How to append the data to existing hive table without partition

After table Partition Select query performance get slow

When is the right time to create Indexes in Oracle?

TSQL Merge Performance

Categories

Resources