mysqldump big MyISAM table starts fast and suddenly slows down - performance

calling mysqldump for a database containing innodb & myisam tables.
Dump still runs very fast when it comes to a fat MyISAM table with 11GB size.
Fast means iotop shows me more than 70MB/s write performance.
I view the process in mytop so i know it happens at a big table.
Dump files grows up to 8GB and then suddenly the I/O is only about 1 MB/s.
Server Load is OK, no other processes running.
Tried to change my.cnf settings but nothing worked.

Performance depends on a few factors.
I had to create an alternative solution to Mysqldump for a client to make them load a 42GB dump file (with more than 1 billion rows)
For reference: originally, MySQLDump took 3.9 days on a 16 core server with 64Gb ram and a 10 disk SSD array.
Using uniVocity We loaded the same data in 90 minutes, using a 3 year old laptop. You can use it with a 30 day evaluation license to load this.
Other than that, here are a few things that may impact performance:
Check if you have this on your dump file to disable constraints:
SET #OLD_UNIQUE_CHECKS=##UNIQUE_CHECKS, UNIQUE_CHECKS=0
SET #OLD_FOREIGN_KEY_CHECKS=##FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
SET #OLD_SQL_MODE=##SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO'
If it doesn't add them or alter the create table script to remove all constraints. If you have constraints enabled (primary keys, foreign keys, etc) while running your dump load, the process will get slower over time as the database will validate these contraints on every insert against a growing number of possibilities (more PK's and FK's).
If you are using InnoDB (not exactly your case but it may help someone else), add this to your my.cfg file:
innodb_doublewrite = 0
innodb_buffer_pool_size = 8000M
# innodb_log_file_size = 512M - If I enable this one the server won't start. Couldn't identify why.
log-bin = 0
innodb_support_xa = 0
innodb_flush_log_at_trx_commit = 0

Related

How to maintain cleanup of the Sonar table PROJECT_MEASURES

We have Sonarqube version 6.7.1, with Oracle DB. We see that the table PROJECT_MEASURES is full of a huge amount of records: 130216515.
What is the best way to maintain clean up there? currently, it is causing many failures in the job with timeout for Sonarqube
Example from today 12:15 to 12:30:
430,574 rows were inserted to that table, 1,300,848 were deleted.
As we suspected, the issue came from PROJECT_MEASURES bad performance.
The steps we did to improve it:
A new index was added to the table for ANALYSIS_UUID_CUSTOM_IDX2
Afterward, rebuild the indexes
db-cache was 300MB, where the minimum we allocated was 2GB. We increased it to 4GB (the DB server has 16GB RAM)
Redologs files – size was 300MB, we increased to 1GB
Increase the cache of the sequence from the default 20 to 1000
Shrink the PROJECT_MEASURES table with the COMPACT option
after it, scans worked much faster, and all builds passed with sonarqube stage

Hadoop vs Cassandra: Which is better for the following scenario?

There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

How can I speed up loading data in Oracle tables?

I have some very large tables (to me anyway), as in millions of rows. I am loading them from a legacy system and it is taking forever. Assuming hardware is ok that is fast. How can I speed this up? I have tried exporting from one system into CSV and used Sql loader - slow. I have also tried a direct link from one system to another so there is no middle csv file, just unload from one load into another.
One person said something about pre-staging tables and that somehow could make things faster. I don't know what that is or if it could help. I was hoping for input. Thank you.
Oracle 11g is what is being used.
update: my database is clustered so I don't know if I can do anything to speed things up.
What you can try:
disabling all constraints and only enabling them after the load process
CTAS (create table as select)
What you really should do: understand what is you bottleneck. Is it network, file I/O, checking constraints ... then fix that problem. For me looking at the explain plan is most of the time the first step.
As Jens Schauder suggested, if you can connect to your source legacy system via DB link, CTAS would be the best compromise between performance and simplicity, as long as you don't need any joins on the source side.
Otherwise, you should consider using SQL*Loader and tweaking some settings. Using direct path I was able to load 100M records (~10GB) in 12 minutes on a 6 year old ProLaint.
EDIT: I used the data format defined for the Datamation sort benchmark. The generator for it is available in the Apache Hadoop distribution. It generates records with fixed width fields with 99 bytes of data plus a newline character per line of file. The SQL*Loader control file I used for the numbers quoted above was:
OPTIONS (SILENT=FEEDBACK, DIRECT=TRUE, ROWS=1000)
LOAD DATA
INFILE 'rec100M.txt' "FIX 99"
INTO TABLE BENCH (
BENCH_KEY POSITION(1:10),
BENCH_REC_NBR POSITION(13:44),
BENCH_FILLER POSITION(47:98))
What is the configuration you are using?
Does the database where the data is imported have something like a standby database coupled to it? If so, it is very likely to have a configuration with force_logging enabled?
You can check this using
SELECT FORCE_logging from v$database;
It can also be enabled at tablespace level:
SELECT TABLESPACE_name,FORCE_logging from DBA_tablespaces
If your database is running ith force_logging, or your tablespace has force_logging, this will have impact on the import speed.
If this is not the case, check if archivelog mode is enabled.
SELECT LOG_mode from v$database;
If so, it could be that the archives are not written fast enough. In that case increase the size of the online redolog files.
If the database is not running archivelog mode, it still has to write to the redo files, if not using direct path inserts. In that case, check how quick the redo's can be written. Normally, 200GB/h is very well possible, when indexes are not playing a role.
Important is to find what link is causing the lack of performance. It could be the input, it could be the output. Here I focused on the output.

Oracle SQL*loader running in direct mode is much slower than conventional path load

In the past few days I've playing around with Oracle's SQL*Loader in attempt to bulk load data into Oracle. After trying out different combination of options I was surprised to found the conventional path load runs much quicker than direct path load.
A few facts about the problem:
Number of records to load is 60K.
Number of records in target table, before load, is 700 million.
Oracle version is 11g r2.
The data file contains date, character (ascii, no conversion required), integer, float. No blob/clob.
Table is partitioned by hash. Hash function is same as PK.
Parallel of table is set to 4 while server has 16 CPU.
Index is locally partitioned. Parallel of index (from ALL_INDEXES) is 1.
There's only 1 PK and 1 index on target table. PK constraint built using index.
Check on index partitions revealed that records distribution among partitions are pretty even.
Data file is delimited.
APPEND option is used.
Select and delete of the loaded data through SQL is pretty fast, almost instant response.
With conventional path, loading completes in around 6 seconds.
With direct path load, loading takes around 20 minutes. The worst run takes 1.5 hour to
complete yet server was not busy at all.
If skip_index_maintenance is enabled, direct path load completes in 2-3 seconds.
I've tried quite a number of options but none of them gives noticeable improvement... UNRECOVERABLE, SORTED INDEXES, MULTITHREADING (I am running SQL*Loader on a multiple CPU server). None of them improve the situation.
Here's the wait event I kept seeing during the time SQL*Loader runs in direct mode:
Event: db file sequential read
P1/2/3: file#, block#, blocks (check from dba_extents that it is an index block)
Wait class: User I/O
Does anyone has any idea what has gone wrong with direct path load? Or is there anything I can further check to really dig the root cause of the problem? Thanks in advance.
I guess you are falling fowl of this
"When loading a relatively small number of rows into a large indexed table
During a direct path load, the existing index is copied when it is merged with the new index keys. If the existing index is very large and the number of new keys is very small, then the index copy time can offset the time saved by a direct path load."
from When to Use a Conventional Path Load in: http://download.oracle.com/docs/cd/B14117_01/server.101/b10825/ldr_modes.htm

Resources