Multi-device btrfs with single data mode and disk failure - data-recovery

I had a btrfs partition on a 6 disk array without raid (metadata in raid10, but data in single), and one of the disks just died.
So I lost some of my data, ok, I knew that.
But two questions:
Is it possible to know (using metadata I suppose) what data I have lost?
Is it possible to do some kind of a "btrfs delete missing" on this kind of setup, in order to recover access in rw to my other data, or I must copy all my data on a new partition?
Edit : just to be clear, I can mount it in read only with mount -o recovery,ro,degraded
And btrfs fi df /Data
Data, single: total=6.65TiB, used=6.65TiB
System, RAID1: total=32.00MiB, used=768.00KiB
Metadata, RAID1: total=13.00GiB, used=10.99GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I'm a very very lucky guy, and I think I fixed my problem (thanks to the help of btrfs mailing list).
In my situation "btrfs-debug-tree -t 3 /dev/sda6" does not mention the missing disk anywhere (data or metadata). So there was nothing at all in the missing device.
Thus, patching the kernel with this patch allow me to mount the array in rw in degraded and a simple btrfs device remove missing did the trick.
So my array is fixed and my data seems fine (scrub in progress)
One thing I learned though is that the single mode should never ever be used.

Related

OpenZFS on Windows: less available space than capacity in single disk pool

Creating a new pool by using the instructions from readme, as follows:
zpool create -O casesensitivity=insensitive -O compression=lz4 -O atime=off -o ashift=12 tank PHYSICALDRIVE1
I get less available space showing up in file explorer and zpool, than the disk capacity itself: 1.76TiB vs 1.81TiB
zpool list and zfs list -r poolname show the difference:
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1,81T 360K 1,81T - - 0% 0% 1.00x ONLINE -
zfs list -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 300K 1,76T 96K /tank
I'm not sure of the reason. Is there something that ZFS uses the space for?
Does it ever become available for use, or is it reserved, e.g. for root like on ext4?
Because it is copy on write, even deleting stuff requires using a tiny bit of extra storage in ZFS: until the old data has been marked as free (which requires writing newly-created metadata), you can’t start allocating the space it was using to store new data. A small amount of storage is reserved so that if you completely fill your pool it’s still possible to delete stuff to free up space. If it didn’t do this, you could get wedged, which wouldn’t be fixable unless you added more disks to your pool / made one of the disks larger if you are using virtualized storage.
There are also other small overheads (metadata storage, etc.) but I think most of the holdback you’re seeing is related to the above since it doesn’t look like you’ve written anything into the pool yet.

What does "mds" use to iterate the mounted file systems?

I have been closely following the Open ZFS development scene for OS X for the last few years. Things have progressed significantly over the last several months since the sad problems that occurred with Greenbytes, etc., but I have been pleased to see that we're finally close to getting real Spotlight support in place. I noticed this email passing by the other day from Jorgen Lundman (who has put a great deal of personal time into getting this going and contributing to the community) and thought perhaps others here might be interested in chiming in on this, his, topic regarding implementing Spotlight support for ZFS on OS X:
To summarize, I think the crux of this question boils down to this:
So then, what does "mds" use to iterate the mounted file systems? I do not
think the sources for "Spotlight-800.28" was ever released so we can't just
go look and learn, like we did for xnu, and IOkit.
It doesn't use the BSD getfsstat(), more likely it asks IOKit, and for some
reason rejects the lower mounts.
And the body of the email for convenience:
Hey guys,
So one of our long-term issues in OpenZFSonOSX is to play nice with Spotlight.
We have reached the point where everything sometimes pretends to work.
For example;
# mdfind helloworld4
/Volumes/hfs1/helloworld4.jpg
/Volumes/hfs2/helloworld4.jpg
/Volumes/zfs1/helloworld4.jpg
/Volumes/zfs2/helloworld4.jpg
Great, picks it up in our regular (control group) HFS mounted filesystems,
as well as the 2 ZFS mounts.
Mounted as:
/dev/disk2 on /Volumes/zfs1 (zfs, local, journaled)
/dev/disk2s1 on /Volumes/zfs2 (zfs, local, journaled)
# diskutil list
/dev/disk1
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *42.9 GB disk1
1: ZFS 42.9 GB disk1s1
2: 6A945A3B-1DD2-11B2-99A6-080020736631 8.4 MB disk1s9
/dev/disk2
#: TYPE NAME SIZE IDENTIFIER
0: zfs_pool_proxy FEST *64.5 MB disk2
1: zfs_filesystem_proxy ssss 64.5 MB disk2s1
So you can see, the actual pool disk is /dev/disk1, and the fake nodes we
create for mounting as /dev/disk2*, as it appears to be required by
Spotlight to work at all. We internally also let the volumes auto-mount,
from issuing "diskutil mount -mountPoint %s %s".
We are not a VOLFS, so there is no ".vol/" directory, nor will mdutil -t
work. But these two points are true for MS-DOS as well, and that does work
with Spotlight.
We correctly reply to zfs.fsbundle's zfs.util for "-p" (volume name) and
"-k" (get uuid), done pre-flight to mounting by DA.
Using FSMegaInfo tool, we can confirm that stat, statfs, readdir, and
similar tests appear to match that of HFS.
So then, the problem.
The problem comes from mounting zfs inside zfs. Ie,
When we mount
/Volumes/hfs1/
/Volumes/hfs1/hfs2/
/Volumes/zfs1/
/Volumes/zfs1/zfs2/
# mdfind helloworld4
/Volumes/hfs1/helloworld4.jpg
/Volumes/hfs1/hfs2/helloworld4.jpg
/Volumes/zfs1/helloworld4.jpg
Absent is of course, "/Volumes/zfs1/zfs2/helloworld4.jpg".
Interestingly, this works
# mdfind -onlyin /Volumes/zfs1/zfs2/ helloworld4
/Volumes/zfs1/zfs2/helloworld4.jpg
And additionally, mounting in reverse:
/Volumes/hfs2/
/Volumes/hfs2/hfs1/
/Volumes/zfs2/
/Volumes/zfs2/zfs1/
# mdfind helloworld4
/Volumes/hfs2/helloworld4.jpg
/Volumes/hfs2/hfs1/helloworld4.jpg
/Volumes/zfs2/helloworld4.jpg
So whichever ZFS filesystem was mounted first, works, but not the second.
So the individual ZFS filesystems are both equal. It is as if it doesn't
realise the lower mount is its own device.
So then, what does "mds" use to iterate the mounted fileystems? I do not
think the sources for "Spotlight-800.28" was ever released so we can't just
go look and learn, like we did for xnu, and IOkit.
It doesn't use the BSD getfsstat(), more likely it asks IOKit, and for some
reason rejects the lower mounts.
Some observations:
# /System/Library/Filesystems/zfs.fs/zfs.util -k disk2
87F06909-B1F6-742F-7355-F0D597849138
# /System/Library/Filesystems/zfs.fs/zfs.util -k disk2s1
8F60C810-2D29-FCD5-2516-2D02EED4566B
# grep uu /Volumes/zfs1/.Spotlight-V100/VolumeConfiguration.plist
<key>uuid.87f06909-b1f6-742f-7355-f0d597849138</key>
# grep uu /Volumes/zfs1/zfs2/.Spotlight-V100/VolumeConfiguration.plist
<key>uuid.8f60c810-2d29-fcd5-2516-2d02eed4566b</key>
Any assistance is appreciated, the main issue tracking Spotlight is;
https://github.com/openzfsonosx/zfs/issues/116
The branch for it;
https://github.com/openzfsonosx/zfs/tree/issue116
vfs_getattr;
https://github.com/openzfsonosx/zfs/blob/issue116/module/zfs/zfs_vfsops.c#L2307
It appears to be down to some undocumented expectations in the vfs_vget method, to lookup entries based entirely on the inode number. Ie, stat /.vol/16777222/1102011
It is expected that vfs_vget sets the vnode_name correctly here, using a call like vnode_update_identity() or similar.

ext4 pointers from in-memory inode

I'm trying to retrieve in a kernel module the direct/indirect etc addresses in an ext4 file system inode. I understand that I need to look into ext_inode_info struct (I do this via container_of using the relevant vfs_inode).
But to which field am I supposed to look into?
Where can I find for example the first direct pointer? I thought it was stored in i_data array (it is in ext3_inode_info).
But for an ext4 inode when I examine the first entry in i_data, I get a sector address that is not remotely similar to the real sector holding the address of the first data block.
Any help will be appreciated.
==EDIT==
ok, so I seemed to have understood the basic problem. I have an extent-based ext4 file system. Wasn't aware of this change, and that this is enabled by default. So is there a simple way to extract the physical addresses of blocks by offset? I'm trying again as verification to look at the first physical block (logical 0), by looking at the first extent, but I get some gibberish numbers (though consistent and unique for every inode/file, so some progress was made).

How do they read clusters/cylinders/sectors from the disk?

I needed to recover the partition table I deleted accidentally. I used an application named TestDisk. Its simply mind blowing. I reads each cylinder from the disk. I've seen similar such applications which work with MBR & partitioning.
I'm curious.
How do they read
clusters/cylinders/sectors from the
disk? Is there some kind of API for this?
Is it again OS dependent? If so whats the way to for Linux & for windows?
EDIT:
Well, I'm not just curious I want a hands on experience. I want to write a simple application which displays each LBA.
Cylinders and sectors (wiki explanation) are largely obsoleted by the newer LBA (logical block addressing) scheme for addressing drives.
If you're curious about the history, use the Wikipedia article as a starting point. If you're just wondering how it works now, code is expected to simply use the LBA address (which works largely the same way as a file does - a linear array of bytes arranged in blocks)
It's easy due to the magic of *nix special device files. You can open and read /dev/sda the same way you'd read any other file.
Just use open, lseek, read, write (or pread, pwrite). If you want to make sure you're physically fetching data from a drive and not from kernel buffers you can open with the flag O_DIRECT (though you must perform aligned reads/writes of 512 byte chunks for this to work).
For *nix, there have been already answers (/dev directory); for Windows, there are the special objects \\.\PhisicalDriveX, with X as the number of the drive, which can be opened using the normal CreateFile API. To actually perform reads or writes you have then to use the DeviceIoControl function.
More info can be found in "Physical Disks and Volumes" section of the CreateFile API documentation.
I'm the OP. I'm combining Eric Seppanen's & Matteo Italia's answers to make it complete.
*NIX Platforms:
It's easy due to the magic of *nix special device files. You can open and read /dev/sda the same way you'd read any other file.
Just use open, lseek, read, write (or pread, pwrite). If you want to make sure you're physically fetching data from a drive and not from kernel buffers you can open with the flag O_DIRECT (though you must perform aligned reads/writes of 512 byte chunks for this to work).
Windows Platform
For Windows, there are the special objects \\.\PhisicalDriveX, with X as the number of the drive, which can be opened using the normal CreateFile API. To perform reads or writes simply call ReadFile and WriteFile (buffer must be aligned on sector size).
More info can be found in "Physical Disks and Volumes" section of the CreateFile API documentation.
Alternatively you can also you DeviceIoControl function which sends a control code directly to a specified device driver, causing the corresponding device to perform the corresponding operation.
On linux, as root, you can save your MBR like this (Assuming you drive is /dev/sda):
dd if=/dev/sda of=mbr bs=512 count=1
If you wanted to read 1Mb from you drive, starting at the 10th MB:
dd if=/dev/sda of=1Mb bs=1Mb count=1 skip=10

Deleting shared memory with ipcrm in Linux

I am working with a shared memory application, and to delete the segments I use the following command:
ipcrm -M 0x0000162e (this is the key)
But I do not know if I'm doing the right things, because when I run ipcs I see the same segment but with the key 0x0000000. So is the memory segment really deleted? When I run my application several times I see different memory segments with the key 0x000000, like this:
key shmid owner perms bytes nattch status
0x00000000 65538 me 666 27 2 dest
0x00000000 98307 me 666 5 2 dest
0x00000000 131076 me 666 5 1 dest
0x00000000 163845 me 666 5 0
What is actually happening? Is the memory segment really deleted?
Edit: The problem was - as said below in the accepted answer - that there were two processes using the shared memory, until all the process were closed, the memory segment is not going to disappear.
I vaguely remember from my UNIX (AIX and HPUX, I'll admit I've never used shared memory in Linux) days that deletion simply marks the block as no longer attachable by new clients.
It will only be physically deleted at some point after there are no more processes attached to it.
This is the same as with regular files that are deleted, their directory information is removed but the contents of the file only disappear after the last process closes it. This sometimes leads to log files that take up more and more space on the file system even after they're deleted as processes are still writing to them, a consequence of the "detachment" between a file pointer (the zero or more directory entries pointing to an inode) and the file content (the inode itself).
You can see from your ipcs output that 3 of the 4 still have attached processes so they won't be going anywhere until those processes detach from the shared memory blocks. The other's probably waiting for some 'sweep' function to clean it up but that would, of course, depend on the shared memory implementation.
A well-written client of shared memory (or log files for that matter) should periodically re-attach (or roll over) to ensure this situation is transient and doesn't affect the operation of the software.
You said that you used the following command
ipcrm -M 0x0000162e (this is the key)
From the man page for ipcrm
-M shmkey
Mark the shared memory segment associated with key shmkey for
removal. This marked segment will be destroyed after the
last detach.
So the behaviour of -M options does exactly what you observed, ie set the segment to be destroyed only after the last detach.

Resources