Hadoop archive utility alternative for CEPH

Hadoop archive utility alternative for CEPH - hadoop

I have some HAR files (Hadoop archive files) on my HDFS based storage, which have some archived data that is not frequently used.
Now we have a plan to move to CEPH based storage. So I have 2 questions:
Can I somehow use my existing HAR files on CEPH?
Does CEPH have some archive utility like HDFS has Hadoop Archive utility?
Thanks

It's been a while since I have used Hadoop but I can answer following questions:
Can I somehow use my existing HAR files on CEPH?
Although I am sure there is no official support for HAR in Ceph, I think its still possible since Ceph file system can be used as a drop-in replacement for the Hadoop File System (HDFS).
Does CEPH have some archive utility like HDFS has Hadoop Archive utility?
Since I use Ceph on daily basis, I have not come across any such archive utility in Ceph similar to HAR. As you know, HAR uses .tar extension. Therefore, what I have been doing is using compressed tarballs. For block devices I store the tarballs as Ceph RBD (rados block device) volumes. And if I am working with Objects, I archive the tarballs as RGW objects.
In order to help you further I am sharing some useful threads to dig deeper:
optimise small files performance: store small files in "superchunks" [feature]
A practical approach to efficiently store 100 billions small objects in Ceph
Storing 20 billions of immutable objects in Ceph, 75% <16KB

Related

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.
I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.
I tried "cp /path/to/dataset /path/to/external-ssd"
Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.
Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.
Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. It's been 24 hours and it isn't quite halfway done.
Is there a better way to do this?

rsync is the preferred tool for this kind of workload. It is used for both local and network copies.
Main benefits are (excerpt from manpage):
delta-transfer algorithm, which reduces the amount of data sent
if it is interrupted for any reason, then you can restart it easily with very little cost. It can even restart part way through a large file
options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied.
Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
Regarding command usage and syntax, for local transfers is almost the same as cp:
rsync -az /path/to/dataset /path/to/external-ssd

recover deleted data from hdfs

We have a Hadoop cluster v1.2.1. We deleted one of hdfs folders by mistake, but immediately, we shut down the cluster. Is there any way to get back our data?
Even if we can get back a part of our data, it would be better than none! As the data size was so much, most probably a little data has been removed.
thanks for your help.

This could be an easy fix if you have set the fs.trash.interval > 1. If this is true, HDFS's trash option is enabled, and your files should be located in the trash directory. By default, this directory is located at /user/X/.Trash.
Otherwise, your best option is probably to find and use a data recovery tool. Some quick Googling turned up this cross-platform tool available under GNU licensing that runs from the terminal: http://www.cgsecurity.org/wiki/PhotoRec. It works on many different types of file systems, and it's possible it may work for HDFS.

Could HDFS orphan files in datanode?

During a routine log pruning job where logs older than 60 days were being removed, a system administrator upgraded CDH from 4.3 to 4.6, (I know, I know)...
Normally, the log pruning job frees about 40% of HDFS's available storage. However, during the upgrade, datanodes went down, were rebooted, and all sorts of madness.
What's known is that HDFS received the delete commands, since the HDFS files / folders no longer exist, but disk utilization is still unchanged.
My question is, could HDFS have removed the files from the NameNode's metadata without actually fulfilling the file block deletes among the DataNodes, effectively orphaning the file blocks?

I think the namenode tells the datanodes to delete orphaned blocks, once it gets their report on the blocks they hold and it notices that some of them don't belong to any file.
If you don't want these blocks to be deleted, you can put the system in safemode and try to manually look through the disks and copy the data. There is no automated way of doing this, but a tool to list orphaned blocks may be added in the future (as suggested in this JIRA).
Additionally, you can try to check the health of the namesystem using Hadoop's fsck.

How to know if a file is "stored in the file-system caches"

The docs for NSDataReadingOptions state:
NSDataReadingUncached
A hint indicating the file should not be stored in the file-system caches. For data being read once and discarded, this option can improve performance.
That all makes sense. I'm curious if there's a way to know if a file already resided in the file-sysem caches.
For example, if I need to perform a large amount of file reading, it might make sense to prioritise reading files which already reside in the cache.

I am afraid here we can relay only some suppositions as no official documentation is available on this.
The file system of iOS is supposed to be Hierarchical File System (HFS) the same as that one of OS-X (see iOS filesystem HFS?)
HFS uses Unified Buffer Cache (UBC)
UBC caches file data in chunks rather then the whole file - this is the first point, even if part of the file is cached you can't know whether the whole file is cached
There are no any APIs or kernel commands to control the contents of UBC (so this answers your question with NO).
Some interesting links to read:
Testing the UBC
This guy tried to get some info about the contents of UBC under (jailbreaked) iPhone
The only control over the file system cache mentioned in the documentation is the same flag you also found.
Your option to ensure that a file can be accessed quickly is to map the file onto a page of the virtual memory as described in the same doc of Apple.

Change Journal for Blocks in Windows(NTFS)

I have written a backup tool that is able to backup files and images of volumes for Windows. To detect which files have changed I use the Windows Change Journal. I already use the shadow copy functionality to do a consistent copy of both the files and the volume images.
To detect which blocks have changed I use hashes at the moment. This means the whole volume has to be read once (because to see which block has changed hashes of all blocks have to be calculated).
The backup integrated into Windows 7 is able to create incremental volume images without checking all blocks. I wasn't able to find an API for a kind of block level change journal.
Does anybody know how to access this information?
(I'm willing to dive deep into NTFS internals - even reading and parsing special files)

I don't think block level change info is available anywhere. Most probably what the Windows 7 integrated backup does is it installs a File System Filter Driver like some backup products does and anti-virus software. A filter driver can intercept all file system calls and in this way know which blocks changed. If you do this you can basically build your own change journal that works block level but only for the files that you are interested in.
I would really like to know a better answer myself here.

When you say Windows Change Journal I take it you are referring to the NTFS USN? It looks very much like the Windows 7 backup uses a combination of VSC and NTFS USN to detect changes and create incremental images much like you are already doing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop archive utility alternative for CEPH - hadoop

Related

Move/copy millions of images from Macos to external drive to ubuntu server

recover deleted data from hdfs

Could HDFS orphan files in datanode?

How to know if a file is "stored in the file-system caches"

Change Journal for Blocks in Windows(NTFS)

Categories

Resources