OpenZFS on Windows: less available space than capacity in single disk pool - windows

Creating a new pool by using the instructions from readme, as follows:
zpool create -O casesensitivity=insensitive -O compression=lz4 -O atime=off -o ashift=12 tank PHYSICALDRIVE1
I get less available space showing up in file explorer and zpool, than the disk capacity itself: 1.76TiB vs 1.81TiB
zpool list and zfs list -r poolname show the difference:
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1,81T 360K 1,81T - - 0% 0% 1.00x ONLINE -
zfs list -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 300K 1,76T 96K /tank
I'm not sure of the reason. Is there something that ZFS uses the space for?
Does it ever become available for use, or is it reserved, e.g. for root like on ext4?

Because it is copy on write, even deleting stuff requires using a tiny bit of extra storage in ZFS: until the old data has been marked as free (which requires writing newly-created metadata), you can’t start allocating the space it was using to store new data. A small amount of storage is reserved so that if you completely fill your pool it’s still possible to delete stuff to free up space. If it didn’t do this, you could get wedged, which wouldn’t be fixable unless you added more disks to your pool / made one of the disks larger if you are using virtualized storage.
There are also other small overheads (metadata storage, etc.) but I think most of the holdback you’re seeing is related to the above since it doesn’t look like you’ve written anything into the pool yet.

Related

FileNet: obtaining available space on a Storage Area

A particular Object Store, in my FileNet environement, is using a NAS as a Storage Area (this is a typical configuration). By the way, I do not have access to that NAS (the team that maintains the storage is remotely distant from me) and I want to know - in a particular moment - the available space. If the NAS will be almost saturated, I wish to know it in time, in order to make a request for adding free space on it.
If I inspect the Storage Area's properties from FEM, I obtain this:
As you can see, it shows no free bytes, but it is not true. It is, by the way, precise in the order of file count.
I also accessed the section "Storage Areas" of http://server:port/P8CE/Health, but it just shows the status of them:
Is there a way to know the available space of a Storage Area, via FEM or APIs?
You can not get the size or free space of the underlying storage device in FileNet. But you can do either of the following two
Set "Maximum size" parameter of the Storage area
Set the "Maximum size" parameter of the Storage Area to the allocated/maximum available space on the NAS.
After having done, you can check and calculate the available free space using the API.
To get the values, something along the line of the following code snippet should do the trick
StorageAreaSet storageAreaSet = filenetConnection.getObjectStore().get_StorageAreas();
Iterator<StorageArea> iter = storageAreaSet.iterator();
while(iter.hasNext()){
StorageArea sa = iter.next();
System.out.printf("Storage Area %s is %s uses %f KB of %f KB available\n", sa.get_DisplayName(), sa.get_ResourceStatus().toString(), sa.get_ContentElementKBytes(), sa.get_MaximumSizeKBytes());
}
Use a monitoring software
What we usually do is, monitor the free space of our storage devices using our monitoring solution. The monitoring solution sends an alarm if the available storage drops below a certain percentage

Creating many Sockets in ZMQ - too many files error

I am trying to create sockets with inproc:// transport class from the same context in C.
I can create 2036 sockets, when I try to create more zmq_socket() returns NULL and the zmq_errno says 24 'Too many open files'.
How can I create more than 2036 sockets? Especially as inproc forces me to use only one context.
There are several things I don't understand:
- the sockets are eventually turned to inproc, why does it take up files?
- Increasing ZMQ_MAX_SOCKETS does not help, the system file limit appears to be the limiting factor
- I am unable to increase the file limit with ulimit on my Mac, no workaround helped.
// the code is actually in cython and can be found here:
https://github.com/DavoudTaghawiNejad/ABsinthCE
Use zmq_ctx_set():
zmq_ctx_set (context, ZMQ_MAX_SOCKETS, 256);
You can change these using sysctl ( tried on Yosemite and El Capitan ), but the problem is what to change. Here is a post on this topic: Increasing the maximum number of tcp/ip connections in linux
That's on Linux, and the Mac is based on BSD 4.x, but man pages for sysctl on BSD are available online.
Note: sysctl is a private interface on iOS.
Solution is multi-fold complex:
inproc does not force you to have a common Context() instance, but it is handy to have one, as the signalling / messaging goes without any data-transfers, just by Zero-copy, pointer manipulations for in-RAM blocks of memory, which is extremely fast.
I started to assemble ZeroMQ-related facts about having some 70.000 ~ 200.000 file-descriptors available for "sockets", as supported by O/S kernel settings, but your published aims are higher. Much higher.
Given your git-published Multi-agent ABCE Project paper refers to nanosecond shaving, HPC-domain grade solution to have ( cit. / emphasis added: )
the whopping number of 1.073.545.225, many more agents than fit into the memory of even the most sophisticated supercomputer, some small hundreds of thousands of file-descriptors are not much worth spending time with.
Your Project faces multiple troubles at the same time.
Let's peel the problem layers off, step by step:
File Descriptors (FD) -- Linux O/S level -- System-wide Limits:
To see the actual as-is state:
edit /etc/sysctl.conf file
# vi /etc/sysctl.conf
Append a config directive as follows:
fs.file-max = 100000
Save and close the file.
Users need to log out and log back in again to changes take effect or just type the following command:
# sysctl -p
Verify your settings with command:
# cat /proc/sys/fs/file-max
( Max ) User-specific File Descriptors (FD) Limits:
Each user has additionally a set of ( soft-limit, hard-limit ):
# su - ABsinthCE
$ ulimit -Hn
$ ulimit -Sn
However, you can limit your ABsinthCE user ( or any other ) to any specific limits by editing /etc/security/limits.conf file, enter:
# vi /etc/security/limits.conf
Where you set ABsinthCE user the respective soft- and hard-limit as needed:
ABsinthCE soft nofile 123456
ABsinthCE hard nofile 234567
All that is not for free - each file descriptor takes up some kernel memory, so at some point you may and you will exhaust it. A few hundred thousands file descriptors are not trouble for server deployments, where event-based ( epoll on Linux ) server architectures are used. But simply straight forget to try to grow this anywhere near the said 1.073.545.225 level.
Today,one can have a private HPC machine ( not a Cloud illusion ) with ~ 50-500 TB RAM.
But still, the multi-agent Project application architecture ought be re-defined, not to fail on extreme resources allocations ( just due to a forgiving syntax simplicity ).
Professional Multi-Agent simulators are right due to extreme scaling very, VERY CONSERVATIVE on per-Agent instance resource-locking.
So the best results are to be expected ( both performance-wise and latency-wise ) when using direct memory-mapped operations. ZeroMQ inproc:// transport-class is fine and does not require a Context() instance to allocate IO-thread ( as there is no data-pump at all, if using just inproc:// transport-class ), which is very efficient for a fast prototyping phase. The same approach will become risky for growing the scales much higher towards the levels expected in production.
Latency-shaving and accelerated-time simulator operations throughput scaling is the next set of targets, for raising both the Multi-Agent based simulations static scales and increasing the simulator performance.
For a serious nanoseconds huntingfollow the excellent Bloomberg's guru, John Lakos, insights on HPC.
Either pre-allocate ( as a common Best Practice in RTOS domain) and do not allocate at all, or follow John's fabulous testing-supported insights presented on ACCU 2017.

Multi-device btrfs with single data mode and disk failure

I had a btrfs partition on a 6 disk array without raid (metadata in raid10, but data in single), and one of the disks just died.
So I lost some of my data, ok, I knew that.
But two questions:
Is it possible to know (using metadata I suppose) what data I have lost?
Is it possible to do some kind of a "btrfs delete missing" on this kind of setup, in order to recover access in rw to my other data, or I must copy all my data on a new partition?
Edit : just to be clear, I can mount it in read only with mount -o recovery,ro,degraded
And btrfs fi df /Data
Data, single: total=6.65TiB, used=6.65TiB
System, RAID1: total=32.00MiB, used=768.00KiB
Metadata, RAID1: total=13.00GiB, used=10.99GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
I'm a very very lucky guy, and I think I fixed my problem (thanks to the help of btrfs mailing list).
In my situation "btrfs-debug-tree -t 3 /dev/sda6" does not mention the missing disk anywhere (data or metadata). So there was nothing at all in the missing device.
Thus, patching the kernel with this patch allow me to mount the array in rw in degraded and a simple btrfs device remove missing did the trick.
So my array is fixed and my data seems fine (scrub in progress)
One thing I learned though is that the single mode should never ever be used.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

Resources