How can I access MinIO files on the file system? - minio

On the underlying server filesystem, MinIO seems to store the content of an uploaded file (e.g. X) in a file called xl.meta in a directory bearing the original file name (e.g. X/xl.meta).
However, the file xl.meta is encoded. How can I access the original file content on the server file system itself (i.e. see the text inside a plain text file or being able to play a sound file with a respective application)?

It would not be possible since the object you are seeing on the backend fs is not the actual object, it is only the erasure coded part(s) that is split across all the disks in a given erasure set. So, you could do it if you were just using fs mode (single node, single disk) but in erasure coded environment you will need to have quorum to be able to download the object, and that via an S3 supported method and not directly from the backend. Technically not quorum, rather n/2 if you just want to read the object, but as a rule you should avoid doing anything in the backend fs.
If you happen to want to just see the contents of xl.meta, and not recover the file itself, you can use this something like mc support inspect myminio/test/syslog/xl.meta --export=json (or you can build a binary from https://github.com/minio/minio/tree/master/docs/debugging/xl-meta but using mc is probably easier).

Related

read/write to a disk without a file system

I would like to know if anybody has any experience writing data directly to disk without a file system - in a similar way that data would be written to a magnetic tape. In particular I would like to know if/how data is written in blocks, and whether a certain blocksize needs to be specified (like it does when writing to tape), and if there is a disk equivalent of a tape file mark, which separates the archives written to a tape.
We are creating a digital archive for over 1 PB of data, and we want redundancy built in to the system in as many levels as possible (by storing multiple copies using different storage media, and storage formats). Our current system works with tapes, and we have a mechanism for storing the block offset of each archive on each tape so we can restore it.
We'd like to extend the system to work with disk volumes without having to change much of the logic. Another advantage of not having a file system is that the solution would be portable across Operating Systems.
Note that the ability to browse the files on disk is not important in this application, since we are considering this for an archival copy of data which is not accessed independently. Also note that we would already have an index of the files stored in the application database, which we also write to the end of the tape/disk when it is almost full.
EDIT 27/05/2020: It seems that accessing the disk device as a raw/character device is what I'm looking for.

how does physical disk read work with volume shadow for ntfs?

my goal is to make a backup program reading a physical disk (with NTFS partitions) while using VSS for data consistency.
i use windows api's functions CreateFile with '\.\PhysicalDriveN'
as described here (basically, it allow me to access a disk as a big file)
https://support.microsoft.com/en-us/help/100027/info-direct-drive-access-under-win32
for tests i create volume shadows with this command
wmic shadowcopy call create Volume='C:\'
this is a temporary solution, i plan on using VSS via the program itself
My question is:
how are stored Volume shadows? does it stores data that have been modified since the volume shadow or does it store modification made since the last volume shadow?
in the first case:
when i read the disk, will i get consistent data (including ntfs metadata files)?
in the other case:
can i access a volume shadow the same way i would access a disk/partition? (in order to read hidden metadata files, etc)
-im am currenctly using windows 7 but planning on using it on differents version of windows server
-i've read a lot of microsoft doc about VSS but how it work seem really unclear for me (if you answer with one please explain a bit it meaning)
-i know that Volume shadows are stored in the folder "System Volume Information" as files with names like {3808876b-c176-4e48-b7ae-04046e6cc752}
"how are stored Volume shadows? does it stores data that have been modified since the volume shadow or does it store modification made since the last volume shadow?"
A hardware or software shadow copy provider uses one of the following methods for creating a shadow copy:(Answer by msdn doc)
Complete copy This method makes a complete copy (called a "full copy"
or "clone") of the original volume at a given point in time. This copy
is read-only.
Copy-on-write This method does not copy the original volume. Instead,
it makes a differential copy by copying all changes (completed write
I/O requests) that are made to the volume after a given point in time.
Redirect-on-write This method does not copy the original volume, and
it does not make any changes to the original volume after a given
point in time. Instead, it makes a differential copy by redirecting
all changes to a different volume.
"when i read the disk, will i get consistent data (including ntfs metadata files)?"
Even if an application does not have its files open in exclusive mode, it is possible—because of the finite time needed to open, back up, and close a file—that files copied to storage media may not all reflect the same application state.
"can i access a volume shadow the same way i would access a disk/partition? (in order to read hidden metadata files, etc)"
Requester Access to Shadow Copied Data
Paths on the shadow copied volume are obtained by replacing the root
of the original path with the device object. For example, given a path
on the original volume of "C:\DATABASE*.mdb" and a VSS_SNAPSHOT_PROP
instance of snapProp, you would obtain the path on the shadow copied
volume by concatenating snapProp.m_pwszSnapshotDeviceObject, "\",
and "\DATABASE*.mdb".
So i did more test and actually Shadow Volume are made at block level not file level. it mean that by using createfile with the path
\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy1 it would work in a similar way than using createfile with the path \\.\C:
So yeah you can access a shadow copy file system, it have it own boot sector, mft, etc.

fastest way to retrieve CIFS file metadata

Situation:
I am scanning a directory using NtQueryDirectoryFile(..., FileBothDirectoryInformation, ...). In addition to data returned by this call I need security data (typically returned by GetKernelObjectSecurity) and list of alternate streams (NtQueryInformationFile(..., FileStreamInformation)).
Problem:
To retrieve security and alternate stream info I need to open (and close) each file. In my tests it slows down the operation by factor of 3. Adding GetKernelObjectSecurity and NtQueryInformationFile slows it down by factor of 4 (making it 12x).
Question:
Is there a better/faster way to get this information (by either opening files faster or avoiding file open altogether)?
Ideas:
If target file system is local I could access it directly and (knowing NTFS/FAT/etc details extract info from raw data). But it isn't going to work for remote file systems.
Custom SMB client is the answer, it seems. Skipping Windows/NT API layer opens all doors.

Spark saveAsTextFile writes empty file - <directory>_$folder$ to S3

rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want

How do I create a CFSTR_FILEDESCRIPTOR of unknown size?

I have an email client that allows the user to export a folder of email as a MBOX file. The UX is that they drag the folder from inside the application to an explorer folder (e.g. the Desktop) and a file copy commences via the application adding CFSTR_FILEDESCRIPTOR and CFSTR_FILECONTENTS to the data object being dropped. The issue arises when working out how to specify the size of the "file". Because internally I store the email in a database and it takes quite a while to fully encode the output MBOX, especially if the folder has many emails. Until that encoding is complete I don't have an exact size... just an estimate.
Currently I return an IStream pointer to windows, and over-specify the size in the file descriptor (estimate * 3 or something). Then when I hit the end of my data I return a IStream::Read length less then the input buffer size. Which causes Windows to give up on the copy. In Windows 7 it leaves the "partial" file there in the destination folder which is perfect, but in XP it fails the copy completely, leaving nothing in the destination folder. Other versions may exhibit different behaviour.
Is there a way of dropping a file of unknown size onto explorer that has to be generated by the source application?
Alternatively can I just get the destination folder path and do all the copy progress + output internally to my application? This would be great, I have all the code to do it already. Problem is I'm not the process accepting the drop.
Bonus round: This also needs to work on Linux/GTK and Mac/Carbon so any pointers there would be helpful too.
Windows Explorer use three methods to detect size of stream (in order of priority):
nFileSizeHigh/Low fields of FILEDESCRIPTOR structure if FD_FILESIZE flags is present.
Calling IStream.Seek(0, STREAM_SEEK_END, FileSize).
Calling IStream.Stat. cbSize field of STATSTG structure is used as MAX file size only.
To pass to Explorer a file with unknown size it is necessary:
Remove FD_FILESIZE flags from FILEDESCRIPTOR structure.
IStream.Seek must not be implemented (must return E_NOTIMPL).
IStream.Stat must set cbSize field to -1 (0xFFFFFFFFFFFFFFFF).
Is there a way of dropping a file of unknown size onto explorer that has to be generated by the source application?
When providing CFSTR_FILEDESCRIPTOR, you don't have to provide a file size at all if you don't know it ahead of time. The FD_FILESIZE flag in the FILEDESCRIPTOR::dwFlags field is optional. Provide an exact size only if you know it, otherwise don't provide a size at all, not even an estimate. The copy will still proceed, but the target won't know the final size until IStream::Read() returns S_FALSE to indicate the end of the stream has been reached.
Alternatively can I just get the destination folder path and do all the copy progress + output internally to my application?
A drop operation does not provide any information about the target at all. And for good reason - the source doesn't need to know. A drop target does not need to know where the source data is coming from, only how to access it. The drag source does not need to know how the drop target will use the data, only how to provide the data to it.
Think of what happens if you drop a file (virtual or otherwise) onto an Explorer folder that is implemented as a Shell Namespace Extension that saves the file to another database, or uploads it to a remote server. The filesystem is not involved, so you wouldn't be able to manually copy your data to the target even if you wanted to. Only the target knows how its data is stored.
That being said, the only way I know to get the path of a filesystem folder being dropped into is to drag&drop a dummy file, and then monitor the filesystem for where the drop target creates/copies the file to. Then you can replace the dummy file with your real file. But this is not very reliable, and not very friendly to the target.

Resources