Consistent LevelDB backup - leveldb

What is a good way to take a consistent backup of a leveldb folder? Currently I shutdown the process that is using LevelDB, do a cp -r and tar.gz that folder. I wonder if that is sufficient and if there is better ways of doing that?

Related

How can I load some specific files in advance on cache before actually using it?

I set up the NAS on ubunut servers, and mount it with nfs and cachefilesd.
So I want to use NAS with local cache.
The question I have is that before first access, if I want to load some files in advance to the local cache, what is the easiest way to do it.
I can think of that execute cat for all files, but is there a better way to do this?
Thanks!
I have tried to execute cat for all files, but I wonder if there exists a better way to do this.

Hadoop archive utility alternative for CEPH

I have some HAR files (Hadoop archive files) on my HDFS based storage, which have some archived data that is not frequently used.
Now we have a plan to move to CEPH based storage. So I have 2 questions:
Can I somehow use my existing HAR files on CEPH?
Does CEPH have some archive utility like HDFS has Hadoop Archive utility?
Thanks
It's been a while since I have used Hadoop but I can answer following questions:
Can I somehow use my existing HAR files on CEPH?
Although I am sure there is no official support for HAR in Ceph, I think its still possible since Ceph file system can be used as a drop-in replacement for the Hadoop File System (HDFS).
Does CEPH have some archive utility like HDFS has Hadoop Archive utility?
Since I use Ceph on daily basis, I have not come across any such archive utility in Ceph similar to HAR. As you know, HAR uses .tar extension. Therefore, what I have been doing is using compressed tarballs. For block devices I store the tarballs as Ceph RBD (rados block device) volumes. And if I am working with Objects, I archive the tarballs as RGW objects.
In order to help you further I am sharing some useful threads to dig deeper:
optimise small files performance: store small files in "superchunks" [feature]
A practical approach to efficiently store 100 billions small objects in Ceph
Storing 20 billions of immutable objects in Ceph, 75% <16KB

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.
I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.
I tried "cp /path/to/dataset /path/to/external-ssd"
Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.
Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.
Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. It's been 24 hours and it isn't quite halfway done.
Is there a better way to do this?
rsync is the preferred tool for this kind of workload. It is used for both local and network copies.
Main benefits are (excerpt from manpage):
delta-transfer algorithm, which reduces the amount of data sent
if it is interrupted for any reason, then you can restart it easily with very little cost. It can even restart part way through a large file
options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied.
Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
Regarding command usage and syntax, for local transfers is almost the same as cp:
rsync -az /path/to/dataset /path/to/external-ssd

recover deleted data from hdfs

We have a Hadoop cluster v1.2.1. We deleted one of hdfs folders by mistake, but immediately, we shut down the cluster. Is there any way to get back our data?
Even if we can get back a part of our data, it would be better than none! As the data size was so much, most probably a little data has been removed.
thanks for your help.
This could be an easy fix if you have set the fs.trash.interval > 1. If this is true, HDFS's trash option is enabled, and your files should be located in the trash directory. By default, this directory is located at /user/X/.Trash.
Otherwise, your best option is probably to find and use a data recovery tool. Some quick Googling turned up this cross-platform tool available under GNU licensing that runs from the terminal: http://www.cgsecurity.org/wiki/PhotoRec. It works on many different types of file systems, and it's possible it may work for HDFS.

Windows Batch Filesystem Backup

Update:
Ehh -- Even though this question isn't "answered", I've just emptied my pockets and purchased an SSD. My ramdisk software was going to cost just about as much anyway. I'm not particularly interested in an answer here anymore, so I'll just mark this as "answered" and go on with my life.
Thanks for the help.
I've got a program which is writing files to a ramdisk (in Windows XP) and I need to copy its data from the ramdisk to a directory on my harddrive once it has finished execution. Obviously in a ramdisk the space is limited and I need to free up as much space on the ramdisk as I can between runs. The simple solution is to copy the data folder that my program generates on the ramdisk to a location on the harddisk and recursively delete the "data" folder from the ramdisk.
There is a problem with that solution however; my program looks at the filesystem and filenames to ensure that it doesn't overwrite files (The most recent data file in the directory is 006.dat, so it will write 007.dat instead of overwriting anything). I can't just delete the files once I'm done writing data because it needs that filesystem intact to record data without over-writing the old files when I copy the data back to my hard-drive
I'd like a simple little windows batch script which I can execute after my program has finished writing data files to the ramdisk. This batch script should copy the ramdisk "data" folder to my harddisk and delete all the files from the ramdisk, then it should re-create the filesystem as it was but with all zero-byte files.
How would I go about this?
Could you simply have it delete all files EXCEPT the most recent, then you would still have 006 and your logger would generate 007?
That seems safer than creating a zero length file because you would have to make sure it wasn't copied over the real 006 on the backup.
edit: Sorry can't help with how to do this solely in batch, but there are a bunch of unix utils, specifically find and touch that are perfect for this. There are numerous windows ports of these - search on SO for options.
Robocopy.exe (free download in the windows server resource kit) can do copy from one dir to another AND has the option to watch a dir for new files and copy them when they are closed or changed

Resources