I have some very large tables (to me anyway), as in millions of rows. I am loading them from a legacy system and it is taking forever. Assuming hardware is ok that is fast. How can I speed this up? I have tried exporting from one system into CSV and used Sql loader - slow. I have also tried a direct link from one system to another so there is no middle csv file, just unload from one load into another.
One person said something about pre-staging tables and that somehow could make things faster. I don't know what that is or if it could help. I was hoping for input. Thank you.
Oracle 11g is what is being used.
update: my database is clustered so I don't know if I can do anything to speed things up.
What you can try:
disabling all constraints and only enabling them after the load process
CTAS (create table as select)
What you really should do: understand what is you bottleneck. Is it network, file I/O, checking constraints ... then fix that problem. For me looking at the explain plan is most of the time the first step.
As Jens Schauder suggested, if you can connect to your source legacy system via DB link, CTAS would be the best compromise between performance and simplicity, as long as you don't need any joins on the source side.
Otherwise, you should consider using SQL*Loader and tweaking some settings. Using direct path I was able to load 100M records (~10GB) in 12 minutes on a 6 year old ProLaint.
EDIT: I used the data format defined for the Datamation sort benchmark. The generator for it is available in the Apache Hadoop distribution. It generates records with fixed width fields with 99 bytes of data plus a newline character per line of file. The SQL*Loader control file I used for the numbers quoted above was:
OPTIONS (SILENT=FEEDBACK, DIRECT=TRUE, ROWS=1000)
LOAD DATA
INFILE 'rec100M.txt' "FIX 99"
INTO TABLE BENCH (
BENCH_KEY POSITION(1:10),
BENCH_REC_NBR POSITION(13:44),
BENCH_FILLER POSITION(47:98))
What is the configuration you are using?
Does the database where the data is imported have something like a standby database coupled to it? If so, it is very likely to have a configuration with force_logging enabled?
You can check this using
SELECT FORCE_logging from v$database;
It can also be enabled at tablespace level:
SELECT TABLESPACE_name,FORCE_logging from DBA_tablespaces
If your database is running ith force_logging, or your tablespace has force_logging, this will have impact on the import speed.
If this is not the case, check if archivelog mode is enabled.
SELECT LOG_mode from v$database;
If so, it could be that the archives are not written fast enough. In that case increase the size of the online redolog files.
If the database is not running archivelog mode, it still has to write to the redo files, if not using direct path inserts. In that case, check how quick the redo's can be written. Normally, 200GB/h is very well possible, when indexes are not playing a role.
Important is to find what link is causing the lack of performance. It could be the input, it could be the output. Here I focused on the output.
Related
I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.
this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.
I have seen developers using WITH(nolock) in the query, is there any disadvantage of it?
Also, what is the default mode of execution of query? My database do not have any index.
Is there any other way to increase database select statement performance?
The common misconception with nolock is that that it places no locks on the database whilst executing. Technically it does issues a schema-stability (sch-s) lock, so the 'no' part of the lock relates to the data side of the query.
Most of the time that I see this, it is a premature optimization by a developer because they have heard it makes the query faster.
Unless you have instrumented proof and validity in accepting a dirty read (and potentially reading the same row twice) then it should not be used - it definately should not be the default approach to queries, but an exception to the rule when it can be shown that it is required.
There are numerous articles on this on the net. The main risk is that with NOLOCK you can read uncomitted data from the table (dirty reads). See, for example, http://msdn.microsoft.com/en-us/library/aa259216(v=sql.80).aspx or http://www.techrepublic.com/article/using-nolock-and-readpast-table-hints-in-sql-server/6185492
NOLOCK can be highly useful when you are reading old data from a frequently used table. Consider the following example,
You have a stored procedure to access data of inactive projects. You
don't want this stored procedure to lock the frequently used Projects
table while reading old data.
NOLOCK is also useful when dirty reads are not a problem and data is not frequently modified such as in the following cases,
Reading list of countries, currencies, etc... from a database to show
in the form. Here the data remains unchanged and a dirty read will
not cause a big problem as it will occur very rarely.
However starting with SQL server 2005 the benefits of NOLOCK is very little due to row versioning.
I'd like to figure out the best way to archive the data that is no needed anymore, in order to improve the application performance and also to save disk space. In your experience what is the best way to implement this, what kind of tools can I use? It is better to develop an specific in house application for that purpose?
One way to manage archiving:
Partition your tables on date range, e.g. monthly partitions
As partitions become obsolete, e.g. after 36 months, those partitions can now be moved to your data warehouse, which could be another database of just text files depending upon your access needs.
After moving, the obsolete partitions can be removed from your primary database, so always maintaining just (e.g.) 36 months of current data.
All this can be automated using a mix of SQL/Shell scripts.
The best way to archive old data in ORACLE database is:
Define an archive and retention policy based on date or size.
Export archivable data to an external table (tablespace) based on a defined policy.
Compress the external table and store it in a cheaper storage medium.
Delete the archived data from your active database using SQL DELETE.
Then to clean up the space execute the below commands:
alter table T_XYZ enable row movement;
alter table T_XYZ shrink space;
If you still want to free up some disk space back to the OS (As Oracle would have now reserved the total space that it was previously using), then you may have to resize the datafile itself:
SQL> alter database datafile '/home/oracle/database/../XYZ.dbf' resize 1m;
For more details, please refer:
http://stage10.oaug.org/file/sroaug080229081203621527.pdf
I would export the data to a comma-delimited file so it can be exported into almost any database. So if you change versions of Oracle or go to something else years later you can restore it without much concern.
Use the spool file feature of SQL*Plus to do this: http://cisnet.baruch.cuny.edu/holowczak/oracle/sqlplus/#savingoutput
I'm trying to test the utility of a new summary table for my data.
So I've created two procedures to fetch the data of a certain interval, each one using a different table source. So on my C# console application I just call one or another. The problem start when I want to repeat this several times to have a good pattern of response time.
I got something like this: 1199,84,81,81,81,81,82,80,80,81,81,80,81,91,80,80,81,80
Probably my Oracle 10g is making an inappropriate caching.
How I can solve this?
EDIT: See this thread on asktom, which describes how and why not to do this.
If you are in a test environment, you can put your tablespace offline and online again:
ALTER TABLESPACE <tablespace_name> OFFLINE;
ALTER TABLESPACE <tablespace_name> ONLINE;
Or you can try
ALTER SYSTEM FLUSH BUFFER_CACHE;
but again only on test environment.
When you test on your "real" system, the times you get after first call (those using cached data) might be more interesting, as you will have cached data. Call the procedure twice, and only consider the performance results you get in subsequent executions.
Probably my Oracle 10g is making a
inappropriate caching.
Actually it seems like Oracle is doing some entirely appropriate caching. If these tables are going to be used a lot then you would hope to have them in cache most of the time.
edit
In a comment on Peter's response Luis said
flushing before the call I got some
interesting results like:
1370,354,391,375,352,511,390,375,326,335,435,334,334,328,337,314,417,377,384,367,393.
These findings are "interesting" because the flush means the calls take a bit longer than when the rows are in the DB cache but not as long as the first call. This is almost certainly because the server has stored the physical records in its physical cache. The only way to avoid that, to truely run against an empty cache is to reboot the server before every test.
Alternatively learn to tune queries properly. Understanding how the database works is a good start. And EXPLAIN PLAN is a better tuning aid than the wall-clock. Find out more.