Mongodb incremental backups

Mongodb incremental backups - bash

I was given the task to setup incremental backups for mongodb replicaset, as start point of course I googled about it and could not find anything on mongodb docs, I did find however this question which encouraged to develop my own solution as didn't find Tayra very active.
I read about oplog and realized it was very easy to develop something to replay the log, but it turns out that I didn't have to as mongorestore does that for me.
Now I have a working solution with bash scripts and it was quite easy, that's the reason I am asking here if there is any flaw in my logic, or maybe something that will bite me in the future.
Below how I implemented that:
Full backup procedure:
lock writes on a secondary member db.fsyncLock()
Take snapshot
Record last position from oplog
db.oplog.rs.find().sort({$natural:-1}).limit(1).next().ts
Unlock writes db.fsyncUnlock()
Incremental backup procedure:
lock writes on a secondary member
Dump oplog from the recorded oplog position on full (or latest incremental ) backup:
mongodump --host <secondary> -d local -c oplog.rs -o /mnt/mongo-test_backup/1 --query '{ "ts" : { $gt : Timestamp(1437725201, 50) } }'
Record latest oplog position (same way as for full backups)
Unlock writes
Full backup restore procedure:
stop all instances of mongod
copy snapshot to data dir of the box which will be the primary, but make sure to exclude all local* and mongod.lock this restore technique is called reconfigure by breaking mirror
Start primary
reconfigure replicaset
start secondaries without any data, let them perform initial sync. Or copy the data from the new primary with fresh local database
Restore incremental backup:
When we created incremental backup it stored it like this:
/mnt/mongo-test_backup/1/local/oplog.rs.bson
/mnt/mongo-test_backup/1/local/oplog.rs.metadata.json
We're instered on oplog.rs.bson but we will have to rename it, so here are the steps:
change directory to the backup: cd /mnt/mongo-test_backup/1/local
delete the json file rm *.json
rename the bson file mv oplog.rs.bson oplog.bson
restore it :
mongorestore -h <primary> --port <port> --oplogReplay /mnt/mongo-test_backup/1/local
I have it all scripted, I may commit it on github later.
Question is if there is any flaw in the logic. I am bit suspicious as the procedure is quite straight forward and still I couldn't find it documented anywhere.

Related

Change Source Directory in Clickhouse

I'm trying to change /var/lib/clickhouse to something like /mnt/sdc/clickhouse so that i could have clickhouse in another hard disk. I've tried this steps:
‍‍1. Stop Clickhouse
2. Move directory /var/lib/clickhouse to /mnt/sdc/clickhouse
3. Replace all /var/lib/s to /mnt/sdc/ in file /etc/clickhouse-server/config.xml
4. Start Clickhouse
But the problem is /var/lib/clickhouse contains hard links so when i mv the directory, this hard links become corrupted.
Is this OK or not?
How should i change the clickhouse directory?

To copy files while preserving hard links, you can use rsync with --hard-links (or -H) option. For your setup, you should be able to run the following:
rsync -a -H /var/lib/clickhouse/ /mnt/sdc/clickhouse
Note the trailing slash after the first directory to copy the directory contents rather than the directory itself.
Then, as you mentioned, update the /var/lib/ paths to /mnt/sdc/ in /etc/clickhouse-server/config.xml, and restart ClickHouse with systemctl restart clickhouse-server.
I was able to follow these steps to migrate ClickHouse data to a new disk mount using rsync, and ClickHouse restarted successfully using the new disk (ClickHouse v22.3 on Ubuntu 18.04).

Is it possible to sync multiple clients over a central server using just rsync and POSIX shell scripting?

The scenario
I have a file server that acts as a master storage for the files to sync and I have several clients that have a local copy of the master storage. Each client may alter files from the master storage, add new ones or delete existing ones. I would like all of them to stay in sync as good as possible by regularly performing a sync operation, yet the only tool I have available everywhere for that is rsync and I can only run script code on the clients, not on the server.
The problem
rsync doesn't perform a bi-directional sync, so I have to sync from server to client as well as from client to server. This works okay for files that just changed by running two rsync operations but it fails when files have been added or deleted. If I don't use rsync with a delete option, clients cannot ever delete files as the sync from the server to the client restores them. If I use a delete option, then either the sync from server to client runs first and deletes all new files the client has added or the sync from client to server runs first and deletes all new files other clients have added to the server.
The question
Apparently rsync alone cannot handle that situation, since it is only supposted to bring one location in sync with another location. I surely neeed to write some code but I can only rely on POSIX shell scripting, which seems to make achieving my goals impossible. So can it even be done with rsync?

What is required for this scenario are three sync operations and awareness of which files the local client has added/deleted since the last sync. This awareness is essential and establishes a state, which rsync doesn't have, as rsync is stateless; when it runs it knows nothing about previous or future operations. And yes, it can be done with some simple POSIX scripting.
We will assume three variables are set:
metaDir is a directory where the client can persistently store files related to the sync operations; the content itself is not synced.
localDir is the local copy of the files to be synced.
remoteStorage is any valid rsync source/target (can be a mounted directory or an rsync protocol endpoint, with or w/o SSH tunneling).
After every successful sync, we create a file in the meta dir that lists all files in local dir, we need this to track files getting added or deleted in between two syncs. In case no such file exists, we have never ran a successful sync. In that case we just sync all files from remote storage, build such a file, and we are done:
filesAfterLastSync="$metaDir/files_after_last_sync.txt"
if [ ! -f "$metaDir/files_after_last_sync.txt" ]; then
rsync -a "$remoteStorage/" "$localDir"
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"
exit 0
fi
Why ( cd "$localDir" && find . ) | sed "s/^\.//"? Files need to be rooted at $localDir for rsync later on. If a file $localDir/test.txt exists, the generated output file line must be /test.txt and nothing else. Without the cd and an absolute path for the find command, it would contain /..abspath../test.txt and without the sed it would contain ./test.txt. Why the explicit sort call? See further downwards.
If that isn't our initial sync, we should create a temporary directory that auto-deletes itself when the script terminates, no matter which way:
tmpDir=$( mktemp -d )
trap 'rm -rf "$tmpDir"' EXIT
Then we create a file list of all files currently in local dir:
filesForThisSync="$tmpDir/files_for_this_sync.txt"
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesForThisSync"
Now why is there that sort call? The reason is that I need the file list to be sorted below. Okay, but then why not telling find to sort the list? That's because find does not guarantee to sort the same was as sort does (that is explicitly documented on the man page) and I need exactly the order that sort produces.
Now we need to create two special file lists, one containing all files that were added since last sync and one that contains all files that were deleted since last sync. Doing so is a bit tricky with just POSIX but various possibility exists. Here's one of them:
newFiles="$tmpDir/files_added_since_last_sync.txt"
join -t "" -v 2 "$filesAfterLastSync" "$filesForThisSync" > "$newFiles"
deletedFiles="$tmpDir/files_removed_since_last_sync.txt"
join -t "" -v 1 "$filesAfterLastSync" "$filesForThisSync" > "$deletedFiles"
By setting the delimiter to an empty string, join compares whole lines. Usually the output would contain all lines that exists in both files but we instruct join to only output lines of one of the files that cannot be matched with the lines of the other file. Lines that only exist in the second file must be from files have been added and lines that only exist in the first file file must be from files that have been deleted. And that's why I use sort above as join can only work correctly if the lines were sorted by sort.
Finally we perform three sync operations. First we sync all new files to the remote storage to ensure these are not getting lost when we start working with delete operations:
rsync -aum --files-from="$newFiles" "$localDir/" "$remoteStorage"
What is -aum? -a means archive, which means sync recursive, keep symbolic links, keep file permissions, keep all timestamps, try to keep ownership and group and some other (it's a shortcut for -rlptgoD). -u means update, which means if a file already exists at the destination, only sync if the source file has a newer last modification date. -m means prune empty directories (you can leave it out, if that isn't desired).
Next we sync from remote storage to local with deletion, to get all changes and file deletions performed by other clients, yet we exclude the files that have been deleted locally, as otherwise those would get restored what we don't want:
rsync -aum --delete --exclude-from="$deletedFiles" "$remoteStorage/" "$localDir"
And finally we sync from local to remote storage with deletion, to update files that were changed locally and delete files that were deleted locally.
rsync -aum --delete "$localDir/" "$remoteStorage"
Some people might think that this is too complicated and it can be done with just two syncs. First sync remote to local with deletion and exclude all files that were either added or deleted locally (that way we also only need to produce a single special file, which is even easier to produce). Then sync local to remote with deletion and exclude nothing. Yet this approach is faulty. It requires a third sync to be correct.
Consider this case: Client A created FileX but hasn't synced yet. Client B also creates FileX a bit later and syncs at once. When now client A performs the two syncs above, FileX on remote storage is newer and should replace FileX on client A but that won't happen. The first sync explicitly excludes FileX; it was added to client A and thus must be excluded to not be deleted by the first sync (client A cannot know that FileX was also added and uploaded to remote by client B). And the second one would only upload to remote and exclude FileX as the remote one is newer. After the sync, client A has an outdated FileX, despite the fact, that an updated one existed on remote.
To fix that, a third sync from remote to local without any exclusion is required. So you would also end up with a three sync operations and compared to the three ones I presented above, I think the ones above are always equally fast and sometimes even faster, so I would prefer the ones above, however, the choice is yours. Also if you don't need to support that edge case, you can skip the last sync operation. The problem will then resolve automatically on next sync.
Before the script quits, don't forget to update our file list for the next sync:
( cd "$localDir" && find . ) | sed "s/^\.//" | sort > "$filesAfterLastSync"
Finally, --delete implies --delete-before or --delete-during, depending on your version of rsync. You may prefer another or explicit specified delete operation.

Retrieving latest file in a directory from a remote server

I was hoping to crack this myself, but it seems I have fallen at the first hurdle because I can't make head nor tale of other options I've read about.
I wish to access a database file hosted as follows (i.e. the hhsuite_dbs is a folder containing several databases)
http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pdb70_08Oct15.tgz
Periodically, they update these databases, and so I want to download the lastest version. My plan is to run a bash script via cron, most likely monthly (though I've yet to even tackle the scheduling aspect of the task).
I believe the database is refreshed fortnightly, so if my script runs monthly I can expect there to be a new version. I'll then be running downstream programs that require the database.
My question is then, how do I go about retrieving this (and for a little more finesse I'd perhaps like to be able to check whether the remote file has changed in name or content to avoid a large download if unnecessary)? Is the best approach to query the name of the file, or the file property of date last modified (given that they may change the naming syntax of the file too?). To my naive brain, some kind of globbing of the pdb70 (something I think I can rely on to be in the filename) then pulled down with wget was all I had come up with so far.
EDIT Another confounding issue that has just occurred to me is that the file I want wont necessarily be the newest in the folder (as there are other types of databases there too), but rather, I need the newest version of, in this case, the pdb70 database.
Solutions I've looked at so far have mentioned weex, lftp, curlftpls but all of these seem to suggest logins/passwords for the server which I don't have/need if I was to just download it via the web. I've also seen mention of rsync, but of a cursory read it seems like people are steering clear of it for FTP uses.

Quite a few barriers in your way for this.
My first suggestion is that rather than getting the filename itself, you simply mirror the directory using wget, which should already be installed on your Ubuntu system, and let wget figure out what to download.
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
And new files will be created in the "safe" directory.
But that just gets you your mirror. You're still after is the "newest" file.
Luckily, wget sets the datestamp of files it downloads, if it can. So after mirroring, you might be able to do something like:
newestfile=$(ls -t /some/place/safe/pdb70*gz | head -1)
Note that this fails if ever there are newlines in the filename.
Another possibility might be to check the difference between the current file list and the last one. Something like this:
#!/bin/bash
base="http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/"
cd /some/place/safe/
wget --mirror -nd "$base"
rm index.html* *.gif # remove debris from mirroring an index
ls > /tmp/filelist.txt.$$
if [ -f /tmp/filelist.txt ]; then
echo "Difference since last check:"
diff /tmp/filelist.txt /tmp/filelist.txt.$$
fi
mv /tmp/filelist.txt.$$ /tmp/filelist.txt
You can parse the diff output (man diff for more options) to determine what file has been added.
Of course, with a solution like this, you could run your script every day and hopefully download a new update within a day of it being ready, rather than a fortnight later. Nice thing about --mirror is that it won't download files that are already on-hand.
Oh, and I haven't tested what I've written here. That's one monstrously large file.

`pg_tblspc` missing after installation of latest version of OS X (Yosemite or El Capitan)

I use postgres from homebrew in my OS X, but when I reboot my system, sometimes the postgres doesn't start after the reboot, and so I manually tried to start it with postgres -D /usr/local/var/postgres, but then the error occurred with the following message: FATAL: could not open directory "pg_tblspc": No such file or directory.
The last time it occurred, I couldn't get it to the original state, so I decided to uninstall the whole postgres system and then re-installed it and created users, tables, datasets, etc... It was so disgusting, but it frequently occurs on my system, say once in a few months.
So why does it lose the pg_tblspc file frequently? And is there anything that I can do to avoid the loss of the file?
I haven't upgraded my homebrew and postgres to the latest version (i.e. I've been using the same version). Also, all the things that I did on the postgres database is delete the table and populate the new data every day. I haven't changed the user, password, etc...
EDIT (mbannert):
I felt the need to add this, since the thread is the top hit on google for this issue and for many the symptom is different. Homebrewers likely will encounter this error message:
No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
So, if you just experienced this after the Yosemite upgrade you now you're covered for now reading this thread.

Solved... in part.
Apparently, Installing the latest versions of OS X (e.g. Yosemite or El Capitan) removes some directories in /usr/local/var/postgres.
To fix this you simply recreate the missing directories:
mkdir -p /usr/local/var/postgres/pg_commit_ts
mkdir -p /usr/local/var/postgres/pg_dynshmem
mkdir -p /usr/local/var/postgres/pg_logical/mappings
mkdir -p /usr/local/var/postgres/pg_logical/snapshots
mkdir -p /usr/local/var/postgres/pg_replslot
mkdir -p /usr/local/var/postgres/pg_serial
mkdir -p /usr/local/var/postgres/pg_snapshots
mkdir -p /usr/local/var/postgres/pg_stat
mkdir -p /usr/local/var/postgres/pg_stat_tmp
mkdir -p /usr/local/var/postgres/pg_tblspc
mkdir -p /usr/local/var/postgres/pg_twophase
Or, more concisely (thanks to Nate):
mkdir -p /usr/local/var/postgres/{{pg_commit_ts,pg_dynshmem,pg_replslot,pg_serial,pg_snapshots,pg_stat,pg_stat_tmp,pg_tblspc,pg_twophase},pg_logical/{mappings,snapshots}}
Rerunning pg_ctl start -D /usr/local/var/postgres now starts the server normally and, at least for me, without any data loss.
UPDATE
On my system, some of those directories are empty even when Postgres is running. Maybe, as part of some "cleaning" operation, Yosemite removes any empty directories? In any case, I went ahead and created a '.keep' file in each directory to prevent future deletion.
touch /usr/local/var/postgres/{{pg_commit_ts,pg_dynshmem,pg_replslot,pg_serial,pg_snapshots,pg_stat,pg_stat_tmp,pg_tblspc,pg_twophase},pg_logical/{mappings,snapshots}}/.keep
Note: Creating the .keep file in those directories will create some noise in your logfile, but doesn't appear to negatively affect anything else.

Donavan's answer is spot on, I just wanted to add that as i did different things with the database (e.g. rake db:test), it went looking for different directories that haven't been mentioned above and would choke when they weren't present, in my case pg_logical/mappings, so you may want to setup a terminal running:
tail -f /usr/local/var/postgres/server.log
and watch it for missing folders while you go thru your typical database activities.

This is slightly off-topic but worth noting here as part of the PostgreSQL Yosemite recovery process. I had the same issue as above AND I had an issue with PostgreSQL "seemingly" running in the background so even after adding directories I couldn't restart. I tried using pg_ctl stop -m fast to kill the PostgreSQL server but no luck. I also tried going after the process directly with kill PID but as soon as I did that a PostgreSQL process re-appeared with a different PID.
The key ended up being a .plist file that Homebrew had loaded... The fix for me ended up being:
launchctl unload /Users/me/Library/LaunchAgents/homebrew.mxcl.postgresql92.plist
After that I was able to start PostgreSQL normally.

The missing directories need to be present in your PostgreSQL data directory. The default data directory is /usr/local/var/postgres/. If you have set up a different data directory, you need to re-create the missing directories there. If you modified the homebrew-recommended .plist file that starts PostgreSQL, you can find the data directory there:
cat ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
(it's the -D option you started postgres with:)
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/postgres</string>
<string>-D</string>
<string>/usr/local/pgsql/data</string>
In the example above, you'd create the missing directories in /usr/local/pgsql/data, like so:
cd /usr/local/pgsql/data
mkdir {pg_tblspc,pg_twophase,pg_stat,pg_stat_tmp,pg_replslot,pg_snapshots,pg_logical}
mkdir pg_logical/{snapshots,mappings}

I was having this issue with a dockerized Rails application.
Instead of pg_tblspc and other directories missing from /usr/local/var/postgres, they were missing from myRailsApp/tmp/db.
You will want to use a similar version of Donovan's solution, you will just have to alter to have the correct path for your Rails app...
mkdir /myRailsApp/tmp/db/{pg_tblspc,pg_twophase,pg_stat,pg_stat_tmp,pg_replslot,pg_snapshots}/
Also, you will want to add a .keep file to make sure git doesn't disregard the empty directories.
touch /myRailsApp/tmp/db/{pg_tblspc,pg_twophase,pg_stat,pg_stat_tmp,pg_replslot,pg_snapshots}/.keep
I noticed an error with a .keep in 1 of the directories, so just read the command line output carefully and adjust as needed.

Creating the missing directories certainly works but I fixed it by reinititializing postgres db, this is a cleaner approach to avoid future problems.
NOTE: This approach will delete existing databases
$ rm -r /usr/local/var/postgres
$ initdb -D /usr/local/var/postgres

How to keep two folders automatically synchronized?

I would like to have a synchronized copy of one folder with all its subtree.
It should work automatically in this way: whenever I create, modify, or delete stuff from the original folder those changes should be automatically applied to the sync-folder.
Which is the best approach to this task?
BTW: I'm on Ubuntu 12.04
Final goal is to have a separated real-time backup copy, without the use of symlinks or mount.
I used Ubuntu One to synchronize data between my computers, and after a while something went wrong and all my data was lost during a synchronization.
So I thought to add a step further to keep a backup copy of my data:
I keep my data stored on a "folder A"
I need the answer of my current question to create a one-way sync of "folder A" to "folder B" (cron a script with rsync? could be?). I need it to be one-way only from A to B any changes to B must not be applied to A.
The I simply keep synchronized "folder B" with Ubuntu One
In this manner any change in A will be appled to B, which will be detected from U1 and synchronized to the cloud. If anything goes wrong and U1 delete my data on B, I always have them on A.
Inspired by lanzz's comments, another idea could be to run rsync at startup to backup the content of a folder under Ubuntu One, and start Ubuntu One only after rsync is completed.
What do you think about that?
How to know when rsync ends?

You can use inotifywait (with the modify,create,delete,move flags enabled) and rsync.
while inotifywait -r -e modify,create,delete,move /directory; do
rsync -avz /directory /target
done
If you don't have inotifywait on your system, run sudo apt-get install inotify-tools

You need something like this:
https://github.com/axkibe/lsyncd
It is a tool which combines rsync and inotify - the former is a tool that mirrors, with the correct options set, a directory to the last bit. The latter tells the kernel to notify a program of changes to a directory ot file.
It says:
It aggregates and combines events for a few seconds and then spawns one (or more) process(es) to synchronize the changes.
But - according to Digital Ocean at https://www.digitalocean.com/community/tutorials/how-to-mirror-local-and-remote-directories-on-a-vps-with-lsyncd - it ought to be in the Ubuntu repository!
I have similar requirements, and this tool, which I have yet to try, seems suitable for the task.

Just simple modification of #silgon answer:
while true; do
inotifywait -r -e modify,create,delete /directory
rsync -avz /directory /target
done
(#silgon version sometimes crashes on Ubuntu 16 if you run it in cron)

Using the cross-platform fswatch and rsync:
fswatch -o /src | xargs -n1 -I{} rsync -a /src /dest

You can take advantage of fschange. It’s a Linux filesystem change notification. The source code is downloadable from the above link, you can compile it yourself. fschange can be used to keep track of file changes by reading data from a proc file (/proc/fschange). When data is written to a file, fschange reports the exact interval that has been modified instead of just saying that the file has been changed.
If you are looking for the more advanced solution, I would suggest checking Resilio Connect.
It is cross-platform, provides extended options for use and monitoring. Since it’s BitTorrent-based, it is faster than any other existing sync tool. It was written on their behalf.

I use this free program to synchronize local files and directories: https://github.com/Fitus/Zaloha.sh. The repository contains a simple demo as well.
The good point: It is a bash shell script (one file only). Not a black box like other programs. Documentation is there as well. Also, with some technical talents, you can "bend" and "integrate" it to create the final solution you like.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio