What happened with MonetDB last release and COPY INTO? - monetdb

I'm using version 11.39.11 (Oct2020 SP2) of MonetDB database as part of a BI software. MonetDB feeds the fact table with a large flat file of 40 columns and 225 millions of rows. Everything works like a charm. 16Gb of RAM is enough to load that amount of data, and the amount of memory used by COPY INTO is really small.
I switched MonetDB to release 11.41.5 (Jul2021), and everything works OK, but when I have to feed the fact table with such a big flat file, MonetDB just eats all memory!
What happened to the COPY INTO statement from release 11.39.11 to 11.41.5?

Related

How to maintain cleanup of the Sonar table PROJECT_MEASURES

We have Sonarqube version 6.7.1, with Oracle DB. We see that the table PROJECT_MEASURES is full of a huge amount of records: 130216515.
What is the best way to maintain clean up there? currently, it is causing many failures in the job with timeout for Sonarqube
Example from today 12:15 to 12:30:
430,574 rows were inserted to that table, 1,300,848 were deleted.
As we suspected, the issue came from PROJECT_MEASURES bad performance.
The steps we did to improve it:
A new index was added to the table for ANALYSIS_UUID_CUSTOM_IDX2
Afterward, rebuild the indexes
db-cache was 300MB, where the minimum we allocated was 2GB. We increased it to 4GB (the DB server has 16GB RAM)
Redologs files – size was 300MB, we increased to 1GB
Increase the cache of the sequence from the default 20 to 1000
Shrink the PROJECT_MEASURES table with the COMPACT option
after it, scans worked much faster, and all builds passed with sonarqube stage

mysqldump big MyISAM table starts fast and suddenly slows down

calling mysqldump for a database containing innodb & myisam tables.
Dump still runs very fast when it comes to a fat MyISAM table with 11GB size.
Fast means iotop shows me more than 70MB/s write performance.
I view the process in mytop so i know it happens at a big table.
Dump files grows up to 8GB and then suddenly the I/O is only about 1 MB/s.
Server Load is OK, no other processes running.
Tried to change my.cnf settings but nothing worked.
Performance depends on a few factors.
I had to create an alternative solution to Mysqldump for a client to make them load a 42GB dump file (with more than 1 billion rows)
For reference: originally, MySQLDump took 3.9 days on a 16 core server with 64Gb ram and a 10 disk SSD array.
Using uniVocity We loaded the same data in 90 minutes, using a 3 year old laptop. You can use it with a 30 day evaluation license to load this.
Other than that, here are a few things that may impact performance:
Check if you have this on your dump file to disable constraints:
SET #OLD_UNIQUE_CHECKS=##UNIQUE_CHECKS, UNIQUE_CHECKS=0
SET #OLD_FOREIGN_KEY_CHECKS=##FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
SET #OLD_SQL_MODE=##SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO'
If it doesn't add them or alter the create table script to remove all constraints. If you have constraints enabled (primary keys, foreign keys, etc) while running your dump load, the process will get slower over time as the database will validate these contraints on every insert against a growing number of possibilities (more PK's and FK's).
If you are using InnoDB (not exactly your case but it may help someone else), add this to your my.cfg file:
innodb_doublewrite = 0
innodb_buffer_pool_size = 8000M
# innodb_log_file_size = 512M - If I enable this one the server won't start. Couldn't identify why.
log-bin = 0
innodb_support_xa = 0
innodb_flush_log_at_trx_commit = 0

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

PSQL = fast, remote sql = v.slow

Okay, I appreciate that the question is a tad vague, but after a day of googling, I'm getting nowhere, any help will be appreciated, and I'm willing to try anything.
The issue is that we have a PostgreSQL db, that has arount 10-15 million rows in a particular table.
We're doing a select for all the columns, based on a DateTime field in the table. No joins, just a standard select with a where clause (time >= x AND time <= y). There is an index on the field as well...
When I perform the sql using psql on the local machine, it runs in around 15-20 seconds, and brings back .5 million rows, one of which is a text field holding a large amount of data per row (a program stack trace). When we use the same sql and run it through Npgsql, or pgadmin III on windows, it takes around 2minutes.
This is leading me to think that it's a network issue. I've checked on the machine when the query is running and it's not using a huge amount of memory or CPU, and the network speed is negligible.
I've gone through the recommendations on the Postgres Site for the memory settings as well. including updating shmmax and shmall.
It's Ubuntu 10.04, PSQL 8.4, 4GB RAM, 2.8GHz Quad Xeon (virtual but dedicated resources). the machine has it's windows counterpart (2008 R2, SS2008) on there as well, but turned off. The Query returns in around 10-15 seconds using SS with the same schema and data, I know this wouldn't be a direct comparison, but wanted to show that it wasn't a disk performance issue.
So the question is... any suggestions? Are there any network settings I should be changing? Anything that I've missed? I can't give too much information about the database, but here is a EXPLAIN ANALYZE that's obfuscated...
Index Scan using "IDX_column1" on "table1" (cost=0.00..45416.20 rows=475130 width=148) (actual time=0.025..170.812 rows=482266 loops=1)
Index Cond: (("column1" >= '2011-03-14 00:00:00'::timestamp without time zone) AND ("column1" <= '2011-03-14 23:59:59'::timestamp without time zone))
Total runtime: 196.898 ms
Try setting cursor_tuple_fraction to 1 in psql and see if it changes the results. If so, then it is likely that the optimiser is picking a better plan based on only getting the top 10% or so of results compared to getting the whole lot. Istr psql uses a cursor to fetch results piece by piece rather than the "firehose" executequery method.
If this is the case, it doesn't point directly to a solution, but you will need to tweak your planner settings, and at least if you can reproduce the behaviour in psql than it may be easier to see the differences and test changes.

Resources