In my application, we have multilple global temporary tables (5-6) in an oracle function. We have indexes on the temporary tables as these can hold significant amount of data. This function is called by different users at the same time (about 30 instances). After first few calls - we are seeing timeout issues - after deleting stats on these tmp table application perform fine.
I did some research on google and it seems that global temp tables shares the stats:
http://blog.go-faster.co.uk/2009/10/global-temporary-tables-shared.html
I like to know what could be the solution for this scenario, other than dropping indexes.
Related
Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.
I have found lots of answers of HOW to drop a DB and all its tables, but nothing around why it takes ~3-4 seconds per table to drop, seemingly in SERIAL (one after the other).
I have a database with 2,414 EXTERNAL Tables pointed at parquet locations, and DROP DATABASE <db> CASCADE; can take 1-2 HOURS just to drop the metadata for the DB.
In a separate session I can repeatedly SHOW TABLES IN <deleted DB>; and watch the count of tables go down at a rate of about 1 every 3-4 seconds. This takes upwards of 2 hours before the session releases the delete and allows us to replace the DB with a new one...
504 rows selected (0.29 seconds)
...
503 rows selected (0.17 seconds)
...
502 rows selected (0.29 seconds)
...
What is taking Hive so long?
Is there a configuration I can use to make it quicker?
Is there a way I can tell what it's doing during that time?
I would think others would have encountered this problem if it were more common, so that makes me think I have a setting somewhere I can tweak to fix this (?)...
The parquets don't seem to be deleted underneath the dropped database, so it doesn't seem to have anything to do with hdfs/parquet files unless dropping an external table checks them for any reason...
Any ideas why it would be so slow?
AFAIK, it has to drop all of its references. It can be external table, but in case there are lot of partitions, stats etc from metastore. Also, if it has lot of rows, it needs to acquire specific locks.
You may want to check the metastore (mysql or equivalent) and see if you can introduce any indexes or collect stats on a periodic basis.
Is it more efficient to create an index after loading data is complete or before, or does it not matter?
For example, say I have 500 files to load into a Postgres 8.4 DB. Here are the two index creation scenarios I could use:
Create index when table is created, then load each file into table; or
Create index after all files have been loaded into the table.
The table data itself is about 45 Gigabytes. The index is about 12 Gigabytes. I'm using a standard index. It is created like this:
CREATE INDEX idx_name ON table_name (column_name);
My data loading uses COPY FROM.
Once all the files are loaded, no updates, deletes or additional loads will occur on the table (it's a day's worth of data that will not change). So I wanted to ask which scenario would be most efficient? Initial testing seems to indicate that loading all the files and then creating the index (scenario 2) is faster, but I have not done a scientific comparison of the two approaches.
Your observation is correct - it is much more efficient to load data first and only then create index. Reason for this is that index updates during insert are expensive. If you create index after all data is there, it is much faster.
It goes even further - if you need to import large amount of data into existing indexed table, it is often more efficient to drop existing index first, import the data, and then re-create index again.
One downside of creating index after importing is that table must be locked, and that may take long time (it will not be locked in opposite scenario). But, in PostgreSQL 8.2 and later, you can use CREATE INDEX CONCURRENTLY, which does not lock table during indexing (with some caveats).
I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.
We have noticed that our queries are running slower on databases that had big chunks of data added (bulk insert) when compared with databases that had the data added on record per record basis, but with similar amounts of data.
We use Sql 2005 Express and we tried reindexing all indexes without any better results.
Do you know of some kind of structural problem on the database that can be caused by inserting data in big chunks instead of one by one?
Thanks
One tip I've seen is to turn off Auto-create stats and Auto-update stats before doing the bulk insert:
ALTER DATABASE databasename SET AUTO_CREATE_STATISTICS OFF WITH NO_WAIT
ALTER DATABASE databasename SET AUTO_UPDATE_STATISTICS OFF WITH NO_WAIT
Afterwards, manually creating statistics by one of 2 methods:
--generate statistics quickly using a sample of data from the table
exec sp_createstats
or
--generate statistics using a full scan of the table
exec sp_createstats #fullscan = 'fullscan'
You should probably also turn Auto-create and Auto-update stats back on when you're done.
Another option is to check and defrag the indexes after a bulk insert. Check out Pinal Dave's blog post.
Probably SQL Server allocated new disk space in many small chunks. When doing big transactions, it's better to pre-allocate much space in both the data and log files.
That's an interesting question.
I would have guessed that Express and non-Express have the same storage layout, so when you're Googling for other people with similar problems, don't restrict yourself to Googling for problems in the Express version. On the other hand though, bulk insert is a common-place operation and performance is important, so I wouldn't consider it likely that this is a previously-undetected bug.
One obvious question: which is the clustered index? Is the clustered index also the primary key? Is the primary key unassigned when you insert, and therefore initialized by the database? If so then maybe there's a difference (between the two insert methods) in the pattern or sequence of successive values assigned by the database, which affects the way in which the data is clustered, which then affects performance.
Something else: as well as indexes, people say that SQL uses statistics (which it created as a result of runing previous queries) to optimize its execution plan. I don't know any details of that, but as well as "reindexing all indexes", check the execution plans of your queries in the two test cases to ensure that the plans are identical (and/or check the associated statistics).