MonetDB Full Disk How To Manually Free Space - monetdb

My question is similar to this one, essentially I forgot a clause in a join when using MonetDB that produced an enormous result that filled the disk on my computer. Monetdb didn't clean up after this and despite freeing space and waiting 24 hours the disk is still much fuller than it should be.
See below the size of the database in monetdb (In GB):
sql>SELECT CAST(SUM(columnsize) / POWER(1024, 3) AS INT) columnSize FROM STORAGE();
+------------+
| columnsize |
+============+
| 851 |
+------------+
1 tuple
And the size of the farm on disk:
sudo du -hs ./*
3,2T ./data_warehouse
5,5M ./merovingian.log
The difference in size is unexplained and appeared suddenly after launching the query that generated an extremely large result.
I can track these files down into the merovingian.log file and the BAT directory inside warehouse where many large files named after integers and .tail or .theap can be found.:
sudo du -hs ./*
2,0T ./data_warehouse
1,3T ./merovingian.log
4,0K ./merovingian.pid
My question is how can I manually free this disk space without corrupting the database? Can any of these files be safely deleted or is there a command that can be launched to get MonetDB to free this space?
So far I've tried the following with no effect:
Restarting the database
Installing the latest version of the database (last time this happened), my current version is: MonetDB Database Server Toolkit v11.37.11 (Jun2020-SP1)
Various VACUUM and FLUSH commands documented here, (Note that VACUUM doesn't run on my version)
Checking online and reading the mailing list
Many thanks in advance for any assistance.

Normally, during the query execution, MonetDB will free up memory/files that are no longer needed. But if that doesn't happen, you can try the following manual clean up.
First, lock and stop the database (it's called warehouse?):
monetdb lock warehouse
monetdb stop warehouse
You can fairly safely remove the merovingian.log to gain 1.3T (this log file can contain useful information for debugging, but in its current size, it's a bit difficult to use). The kill command is to tell monetdbd to start a new log file:
rm /<path-to>/merovingian.log
kill -HUP `pgrep monetdbd`
Then restart the database:
monetdb release warehouse
monetdb start warehouse
During the start-up, the MonetDB server should clean up the left-over transient data files from the previous session.
Concerning the size difference between SUM(columnsize) and on-disk size:
there can be index files and string heap files. Their sizes are reported in separate columns returned by storage().
In your case, the database directory probably contains a lot of intermediate data files generated for the computation of your query.

Related

Where can I see recent HDFS usage statistics (folders, files, timestamps)?

I have been seeing intense amount of disk usage on HDFS in last 10 days. As I see in the DataNode hosts on the Hosts tab on Cloudera Manager and Disk Usage charts on HDFS service usage has been almost tripled, ~7TB to ~20TB. At first I was thinking reason for this was something I did wrong in the upgrade I performed to CM and CDH on the 6th of those 10 days but realized it has started to occur before.
I've checked the File Browser on Cloudera Manager first, but saw no difference between size numbers there and before. I also have disk usage reports of last 4 days, they say there has been no increase.
Running hdfs dfsadmin -report also returns the same.
The dfs folders on Linux confirms the increasing usage but I can't tell what has been changed because there are millions of files and I don't know how to check last modified files in thousands of nested folders. Even if I find them, I can't tell what files are those on HDFS.
Then just recently I have been informed that another user on HDFS has been splitting their large files. They own nearly 2/3 of the all data. Could it cause this much of an increase if they split them into much more that are smaller than HDFS Block Size? If so, why can't I see it on Browser/Reports?
Is there any way to check what folders and files have been modified recently in the HDFS or other things I can check/do? Any suggestion or comment appreciated.
For checking the HDFS activities, Cloudera Navigator provides an excellent information about all the events that was logged in the HDFS.
After logging into Navigator, check for the audits tab. It also allows us to filter the activities such as delete,ipaddress, username and many such things.
The normal search page also provides us to filter the block size ( whether < 256Mb, > 256 Mb) , whether file or directory, the source type, the path, the replication count and many things more.

Greenplum DCA-How to backup & restore Version V2 to V3

We have small array of greenplum DCA V1 and V3.
Trying to conduct backup/restore process steps between them.
As novice to DCA Appliances.banging my head against the wall to understand the parallel backup process in logical way.
We tried
Trying to conduct parallel backup.
using gpcrondump/gpdbrestore. But did not understand working process how it execute
on Master host
on segment host
Question is :
How parallel backup works in master-segment DCA env from version to version.
gpcrondump executes a backup in parallel. It basically coordinates the backups across all segments. By default, each segment will create a db_dumps directory in each segment's $PGDATA directory and a sub-directory under that with a date format.
For example, let's say you have 4 segments per host and hosts sdw1-4. The dumps will be created in:
/data1/gpseg0/db_dumps/20161111/
/data1/gpseg1/db_dumps/20161111/
/data2/gpseg2/db_dumps/20161111/
/data2/gpseg3/db_dumps/20161111/
This repeats across all segments.
The segment will dump only its data to this dump location. grcrondump will name the files, make sure it completes successfully, etc as each segment dumps data independently of the other segments. Thus, it is done in parallel.
The master will also have a backup directory created but there isn't much data in this location. It is mainly metadata about the backup that was executed.
The metadata for each backup is pretty important. It contains the segment id and the content id for the backup.
gpdbrestore restores a backup created by gpcrondump. It reads the files and loads it into the database. It reads those backup files and makes sure the segment id and content id match the target. So, the number of segments from a backup must match the number of segments to restore to. It also has to have the same mapping of segment id to content id.
Migration from one cluster can be done multiple ways. One way is to do a backup and then restore. This requires the same configuration in both clusters. You have to copy all of the backup files from one cluster to the other as well. Alternatively, you could backup and restore from a backup device like DataDomain.
You can also use a built-in tool call gptransfer. This doesn't use a backup but instead, uses external tables to transfer from one cluster to another. The configuration of the two clusters doesn't have to be the same when using this tool but if you are going from a larger cluster to a smaller cluster, it will not be done in parallel.
I highly recommend you reach out to your Pivotal Account Rep to get some assistance. More than likely, you have already paid for services when buying the new DCA that will cover part or all of the migration work. You will have to configure networking between the two clusters which requires some help from EMC too.
Good luck!!

Backup COPY vs BACKUPSET

Oracle has two options of backuping database, and documentation on them is very brief.
To back up to disk as image copies, use BACKUP AS COPY as shown in
BACKUP AS COPY
DEVICE TYPE DISK
DATABASE;
To back up your data
into backup sets, use the AS BACKUPSET clause. You can allow backup
sets to be created on the configured default device, or direct them
specifically to disk or tape
BACKUP AS BACKUPSET
DATABASE;
BACKUP AS BACKUPSET
DEVICE TYPE DISK
DATABASE;
What is the difference between the two, why there are these multiple options?
To put it simply, back up as copy makes a simple copy of database files(the same way Linux cp command does), whereas backup sets is a logical entity to backup pieces as a tablespace to data files. Backup pieces are in an RMAN specific binary format.
why there are these multiple options?
To give the opportunity to perform backup and recovery more effectively and efficiently. For example, you can simply switch to an image copy of a data file avoiding, possibly time consuming, restoration process. But you cannot perform incremental backups with image copies as you be able to do so with backup sets, etc.
The choice of options, of course depends on your B&R strategy.
Find out more

Restoring Incremental backups in Oracle 10g

In our application, we are planning to go for Incremental Backup due to the excess time it takes. Now we have two dump files:one is full backup and the other is incremental backup since the previous full or incremental backup.My problem is i need to merge these two dump files to get the latest data which i can then import. But i am not able to get how to merge these two backups(full backup and incremental backup).I have read about RMAN but did not get clear idea on the syntax of Restore command in RMAN.Please help me on this soon.
Exactly what do you mean when you say you have an "incremental backup"?
You talk about having "two dump files" which implies that you have the output of two different calls to the export utility. Potentially, the second export call could have used the INCTYPE parameter. That is not what most people would mean when they talk about a backup or an incremental backup. An incremental export will do a complete export of every table where any data changed between the last export and the "incremental" export. That is almost never what people want (or think they're getting) from an incremental export. If you have the output of two calls to the export utility, there is no way to merge them. You'd need to import the full export and then the incremental export (which would completely re-load all the data in most if not all of the tables). And dump files cannot be used with the RMAN utility.
When you talk about your "two dump files", it's also possible, I suppose, that you are referring to an actual RMAN full backup and a RMAN incremental backup. That would almost certainly involve more than two files and wouldn't normally be called a "dump file" but you would at least be able to restore the backups using RMAN. Can you post the RMAN backup command you used to create the backups (if you did, indeed, create physical backups using RMAN)?

MySQL Dump Limit? MySQL overall database size limit?

Client just had ~1000 rows of data (most recent, of course), just go missing in one of their tables. Doing some forensics, I found that the "last_updated_date" in all of their other rows of said table was also set to roughly the same time as the deletion occurred. This is not one of their larger tables.
Some other oddities are that the mysqldumps for the last week are all exact same size -- 10375605093 Bytes. Previous dumps grew by about .5GB each. MySQL Dump command is standard:
/path/to/mysqldump -S /path/to/mysqld2.sock --lock-all-tables -u username -ppassword database > /path-to-backup/$(date +%Y%m%d)_live_data.mysqldump
df -h on the box shows plenty of space (at least 50%) in every directory.
The data loss combined with the fact that their dumps are not increasing in size has me worried that somehow we're hitting some hardcoded limit in MySQL and (God I hope I'm wrong), data is getting corrupted. Anyone ever heard of anything like this? How can we explain the mysqldump sizes?
50% free space doesn't mean much if you're doing multiple multi-gig dumps and run out of space halfway. Unless you're storing binary data in your dumps, they are quite compressible, so I'd suggest piping mysqldump's output through gzip before outputting to a file:
mysqldump .... | gzip -9 > /path_to_backup/....
MySQL itself doesn't have any arbitrary limits that say "no more after X gigs", but there are limits imposed by the platform it's running on, detailed here.
There is no hardcoded limit to the amount of data MySQL can handle.

Resources