Trying to Create a new Policy for Multi Disk Operations - clickhouse

I am using clickhouse with just one disk which is specified at config.xml file under <path>
Now I want to extend this disk, so I updated the clickhouse version for enabling multi disk support.
What I want to do now is using the two disks together. I want to read from both of them but write data to second one only.
I have many tables, I thought changing the storage policy of the tables would do the trick but i can't change it.
For example i have a table called default_event which has default policy, after this query:
alter table default_event modify setting storage_policy='newStorage_only';
I got this error : Exception: New storage policy default shall contain volumes of old one
My storage xml is like this:
<?xml version="1.0" encoding="UTF-8"?>
<yandex>
<storage_configuration>
<disks>
<!--
default disk is special, it always
exists even if not explicitly
configured here, but you can't change
it's path here (you should use <path>
on top level config instead)
-->
<default>
<!--
You can reserve some amount of free space
on any disk (including default) by adding
keep_free_space_bytes tag
-->
<keep_free_space_bytes>1024</keep_free_space_bytes>
</default>
<test_disk>
<!--
disk path must end with a slash,
folder should be writable for clickhouse user
-->
<path>/DATA/newStorage/</path>
</test_disk>
<test_disk_2>
<!--
disk path must end with a slash,
folder should be writable for clickhouse user
-->
<path>/DATA/secondStorage/</path>
</test_disk_2>
<test_disk_3>
<!--
disk path must end with a slash,
folder should be writable for clickhouse user
-->
<path>/DATA/thirdStorage/</path>
</test_disk_3>
</disks>
<policies>
<newStorage_only>
<!-- name for new storage policy -->
<volumes>
<newStorage_volume>
<!-- name of volume -->
<!--
we have only one disk in that volume
and we reference here the name of disk
as configured above in <disks> section
-->
<disk>test_disk</disk>
</newStorage_volume>
</volumes>
</newStorage_only>
</policies>
</storage_configuration>
</yandex>
I tried adding default volume to the new policy but i can't start clickhouse with that config.

So, your main problem is that before that you did not explicitly specify the storage policy, but the default disk is written there by default. New policy should include all old disks and volumes with same names.
I gave a configuration based on yours, removing everything unnecessary. And that, I mean that in addition to those listed, you have a drive specified in path with the name default. All disks are listed in the volumes section of the new policy. Writing to new disks will happen thanks to move_factor. The value 0.5 tells us that when 50% of the disk space is reached, we need to write to the next one, and so on.
As soon as the rest of the disks fill evenly, you can lower this value.
PS: you can not use old disks in the new policy, for this you need to execute ALTER TABLE ... MOVE PARTITIONS/PARTS ... to transfer partitions/parts to new disks. Then the table will not be tied to the old disk and it will not be tedious to specify it in the new storage policy. Disks, of course, must be pre-configured in the settings.
<yandex>
<storage_configuration>
<disks>
<test_disk>
<path>/DATA/newStorage/</path>
</test_disk>
<test_disk_2>
<path>/DATA/secondStorage/</path>
</test_disk_2>
<test_disk_3>
<path>/DATA/thirdStorage/</path>
</test_disk_3>
</disks>
<policies>
<!--... old policy ... -->
<new_storage_only> <!-- policy name -->
<volumes>
<default>
<disk>default</disk>
</default>
<new_volume>
<disk>test_disk</disk>
<disk>test_disk_2</disk>
<disk>test_disk_3</disk>
</new_volume>
</volumes>
<move_factor>0.5</move_factor>
</new_storage_only>
</policies>
</storage_configuration>
</yandex>

Related

FileNet: obtaining available space on a Storage Area

A particular Object Store, in my FileNet environement, is using a NAS as a Storage Area (this is a typical configuration). By the way, I do not have access to that NAS (the team that maintains the storage is remotely distant from me) and I want to know - in a particular moment - the available space. If the NAS will be almost saturated, I wish to know it in time, in order to make a request for adding free space on it.
If I inspect the Storage Area's properties from FEM, I obtain this:
As you can see, it shows no free bytes, but it is not true. It is, by the way, precise in the order of file count.
I also accessed the section "Storage Areas" of http://server:port/P8CE/Health, but it just shows the status of them:
Is there a way to know the available space of a Storage Area, via FEM or APIs?
You can not get the size or free space of the underlying storage device in FileNet. But you can do either of the following two
Set "Maximum size" parameter of the Storage area
Set the "Maximum size" parameter of the Storage Area to the allocated/maximum available space on the NAS.
After having done, you can check and calculate the available free space using the API.
To get the values, something along the line of the following code snippet should do the trick
StorageAreaSet storageAreaSet = filenetConnection.getObjectStore().get_StorageAreas();
Iterator<StorageArea> iter = storageAreaSet.iterator();
while(iter.hasNext()){
StorageArea sa = iter.next();
System.out.printf("Storage Area %s is %s uses %f KB of %f KB available\n", sa.get_DisplayName(), sa.get_ResourceStatus().toString(), sa.get_ContentElementKBytes(), sa.get_MaximumSizeKBytes());
}
Use a monitoring software
What we usually do is, monitor the free space of our storage devices using our monitoring solution. The monitoring solution sends an alarm if the available storage drops below a certain percentage

What part of the RAM is used by the system file cache in Windows?

According to general notions about the page cache and this answer the system file cache essentially uses all the RAM not used by any other process. This is, as far as I know, the case for the page cache in Linux.
Since the notion of "free RAM" is a bit blurry in Windows, my question is, what part of the RAM does the system file cache use? For example, is the same as "Available RAM" in the task manager?
Yes, the RAM used by the file cache is essentially the RAM displayed as available in the Task Manager. But not exactly. I'll go into details and explain how to measure it more precisely.
The file cache is not a process listed in the list of processes in the Task Manager. However, since Vista, its memory is managed like a process. Thus I'll explain a bit of memory management for processes, the file cache being a special case.
In Windows, the RAM used by a process has essentially two states: "Active" and "Standby":
"Active" RAM is displayed in the Task Manager and resource monitor as "In Use". It is also the RAM displayed for each process in the Task Manager.
"Standby" RAM is visible in the Resource monitor globally and for each process with RAMMap.
"Standby" + "Free" RAM is what is called "Available" in the task manager. "Free" RAM tends to be near 0 in Windows but you can meaningfully consider Standby RAM is free as well.
Standby RAM is considered as "not used for a while by the process". It is the part of the RAM that will be used to give new memory to processes needing it. But it still belongs to the process and could be used directly if the owning process suddenly access it (which is considered as unlikely by the system).
Thus the file cache has "Active" RAM and "Standby" RAM. "Active" RAM is somehow the cache for data recently accessed. "Standby" RAM is the cache for data accessed a while ago. The "Active" RAM of the file cache is usually relatively small. The Standby RAM of the file cache is most often all the RAM of your computer: Total RAM - Active RAM of all processes. Indeed, other processes rarely have Standby RAM because it tends to go to the file cache if you do disk I/O quite a bit.
This is the info displayed by RAMMap for a busy server doing a lot of I/O and computation:
The file cache is the second row called "Mapped file". See that most of the 32 GB is either in the Active part of other processes, or in the Standby part of the file cache.
So finally, yes, the RAM used by the file cache is essentially the RAM displayed as available in the Task Manager. If you want to measure with more certainty, you can use RAMMap.
Your answer is not entirely true.
The file cache, also called the system cache, describes a range of virtual addresses, it has a physical working set that is tracked by MmSystemCacheWs, and that working set is a subset of all the mapped file physical pages on the system.
The system cache is a range of virtual addresses, hence PTEs, that point to mapped file pages. The mapped file pages are brought in by a process creating a mapping or brought in by the system cache manager in response to a file read.
Existing pages that are needed by the file cache in response to a read become part of the system working set. If a page in a mapped file is not present then it is paged in and it becomes part of the system working set. When a page is in more than one working set (i.e. system and a process or process and another process), it is considered to be in a shared working set on programs like VMMap.
The actual mapped file pages themselves are controlled by a section object, one per file, a data control area (for the file) and subsection objects for the file, and a segment object for the file with prototype PTEs for the file. These get created the first time a process creates a mapping object for the file, or the first time the system cache manager creates the mapping object (section object) for the file due to it needing to access the file in response to a file IO operation performed by a process.
When the system cache manager needs to read from the file, it maps 256KiB views of the file at a time, and keeps track of the view in a VACB object. A process maps a variable view of a file, typically the size of the whole file, and keeps track of this view in the process VAD. The act of mapping the view is simply filling in PTEs to point to physical pages that contain the file that are already resident by looking at the prototype PTE for that range in the file and seeing what it contains, and in the event that the prototype PTE does not point to a physical page, initialising the PTE to point to the prototype PTE instead of the page it points to, and the PTE is left invalid, and this fault will be resolved on demand on a page by page basis when the read from the view is actually performed.
The VACBs keep track of the 256KiB views of files that the cache manager has opened and the virtual address range of that view, which describes the range of 64 PTEs that service that range of virtual addresses. There is no virtual external fragmentation or page table external fragmentation as all views are the same size, and there is no physical external fragmentation, because all pages in the view are 4KiB. 256KiB is the size chosen because if it were smaller, there would be too many VACB objects (64 times as many, taking up space), and if it were larger, there would effectively be a lot of internal fragmentation from reads and hence large virtual address pollution, and also, the VACB uses the lower bits of the virtual address to store the number of I/O operations that are currently being performed on that range, so the VACB size would have to be increased by a few bits or it would be able to handle fewer concurrent I/O operations.
If the view were the whole size of the file, there would quickly be a lot of virtual address pollution, because it would be mapping in the whole of every file that is read, and file mappings are supposed to be for user processes which knowingly map a whole file view into its virtual address space, expecting the whole of the file to be accessed. There would also be a lot of virtual external fragmentation, because the views wouldn't be the same size.
As for executable images, they are mapped in separately with separate prototype PTEs and separate physical pages, separate control area, separate segment and subsection object to the data file map for the file. The process maps the image in, but the kernel also maps images for ntoskrnl.exe, hal.dll in large pages, and then driver images are on the system PTE working set.

Disk persistent cache in ehcache 3.4 is using (leaking?) direct memory

I am running a web application that makes use of Ehcache 3.4.0. I have a cache configuration that defines a simple default of 1000 in-memory objects:
<cache-template name="default">
<key-type>java.lang.Object</key-type>
<value-type>java.lang.Object</value-type>
<heap unit="entries">1000</heap>
</cache-template>
I then have some disk-based caches that use this default template, but override all values (generated programmatically, so that's why they even use the default template at all) like so:
<cache alias='runViewCache' uses-template='default'>
<key-type>java.lang.String</key-type>
<value-type>java.lang.String</value-type>
<resources>
<heap unit='entries'>1</heap>
<disk unit='GB' persistent='true'>1</disk>
</resources>
</cache>
As data is written into my disk-based cache, direct/off-heap memory is used by the JVM, and never freed. Even clearing the cache does not free the memory. The memory used is directly related (nearly byte-for-byte as far as I can tell) to the data written to the disk-based cache.
The authoritative tier for this cache is an instance of org.ehcache.impl.internal.store.disk.OffHeapDiskStore.
This appears to be a memory leak (memory is consumed and never freed) but I am by no means an expert at configuring ehcache. Can anyone suggest a configuration change that will cause my disk tier to NOT use off-heap memory? Or, is there something else that I am just completely misunderstanding that someone else can point out?
Thank you!
How do you measure "used"?
TL;DR No, disk tier does not waste RAM.
As of v3.0.0 Ehcache uses memory mapped files for disk persistence:
Replacement of the port of Ehcache 2.x open source disk store by one that leverages the offheap library and memory mapped files.
This means, Ehcache uses in-memory address space to access files on disk. This does consume 0 bytes of your RAM. (At least directly. As #louis-jacomet already stated, the OS can decide to cache parts of the files in RAM.)
When you're running on Linux you should compare the VIRT and RES values of your process. VIRT is the amount of virtual bytes used by the process. RES is the amount of real RAM (RESident) bytes used by the process. VIRT should increase, while disk store cache is populated, but RES should remain pretty stable.

Echache 3.2.0 No Store.Provider found to handle configured resource types [offheap, disk] exception

i have recently switched from an older implementation of ehcache to version 3.2 so i have the following xml configuration file for a project:
<eh:config xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns:eh='http://www.ehcache.org/v3'
xsi:schemaLocation="http://www.ehcache.org/v3
http://www.ehcache.org/schema/ehcache-core-3.0.xsd">
<eh:persistence directory="C:\foo\bar\Cache-Persistence"/>
<eh:thread-pools>
<eh:thread-pool alias="defaultDiskPool" min-size="1" max-size="3"/>
</eh:thread-pools>
<eh:disk-store thread-pool="defaultDiskPool"/>
<eh:cache-template name="PROC_REQTemplate">
<eh:key-type>java.lang.String</eh:key-type>
<eh:value-type>java.lang.String</eh:value-type>
<eh:expiry>
<eh:ttl>640</eh:ttl>
</eh:expiry>
<eh:resources>
<eh:offheap unit="MB">500</eh:offheap>
<eh:disk unit="GB" persistent="true">3</eh:disk>
</eh:resources>
<eh:disk-store-settings thread-pool="defaultDiskPool"/>
</eh:cache-template>
<eh:cache alias="proc_req_cache" uses-template="PROC_REQTemplate"/>
</eh:config>
with the above shown configuration i get the following exception trace that i keep truncated to conserve a bit of space but shows clearly the error:
java.lang.IllegalStateException: No Store.Provider found to handle configured resource types [offheap, disk] from {org.ehcache.impl.internal.store.heap.OnHeapStore$Provider, org.ehcache.impl.internal.store.tiering.TieredStore$Provider, org.ehcache.impl.internal.store.offheap.OffHeapStore$Provider, org.ehcache.impl.internal.store.disk.OffHeapDiskStore$Provider}
at org.ehcache.core.internal.store.StoreSupport.selectStoreProvider(StoreSupport.java:80) ~[?:?]
at org.ehcache.core.EhcacheManager.getStore(EhcacheManager.java:440) ~[?:?]
at org.ehcache.core.EhcacheManager.createNewEhcache(EhcacheManager.java:311) ~[?:?]
at org.ehcache.core.EhcacheManager.createCache(EhcacheManager.java:260) ~[?:?]
at org.ehcache.core.EhcacheManager.init(EhcacheManager.java:567) ~[?:?]
I thought that according to the current 3.2 documentation you can use any combination of data storage tiers but apparently this is not the case as the above error shows.So...
I can only make the above hown configuration to work if i comment-out the
offheap resource and leave only the disk but not both. Is this normal? what am i missing?
As per the 2.7.8 version the documentation (see here ehcache-2.8-storage-options) mentioned BigMemory as the offHeap store however, in ehcache-3.2.0.jar if i am seeing correctly there is some-kind of internal map for that purpose. Could the error reported above be related to the fact that i am not including BigMemory in the project? My guess is no, but it would be nice if someone could clarify?
Any help would be greatly appreciated. Thanks in advance.
In short, there is currently no support for having a disk tier with just an offheap tier. The current Ehcache 3.x support for tiering mandates a heap tier the moment you want to have multiple tiers.
Supported combination at this day (Ehcache 3.1.x and above):
heap or offheap or disk or clustered (single tier)
heap + offheap
heap + disk
heap + offheap + disk
heap + clustered
heap + offheap + clustered
The error has nothing to do with BigMemory which was the commercial offering on top of Ehcache 2.x.
The problem is that the higher caching level (currently offheap) needs to be a caching tier (our terminology for near caching). Right now, offheap isn't. So you need an onheap level as soon as you start having layers. Here is a working configuration.
I've also set ehcache as default namespace to make the xml more readable. And set the defaultThreadPool as default to prevent you from having to set it everywhere (and alternative is to add <event-dispatch thread-pool="defaultDiskPool"/> because the event-dispatch needs a thread pool and there was no default).
<config xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xmlns='http://www.ehcache.org/v3'
xsi:schemaLocation="http://www.ehcache.org/v3
http://www.ehcache.org/schema/ehcache-core-3.0.xsd">
<persistence directory="C:\foo\bar\Cache-Persistence"/>
<thread-pools>
<thread-pool alias="defaultDiskPool" min-size="1" max-size="3" default="true"/>
</thread-pools>
<cache-template name="PROC_REQTemplate">
<key-type>java.lang.String</key-type>
<value-type>java.lang.String</value-type>
<expiry>
<ttl>640</ttl>
</expiry>
<resources>
<heap unit="entries">1</heap>
<offheap unit="MB">500</offheap>
<disk unit="GB" persistent="true">3</disk>
</resources>
</cache-template>
<cache alias="proc_req_cache" uses-template="PROC_REQTemplate"/>
</config>

File System Block Size while creating the File System using mkfs

I am trying to use BUSE (with NBD) to create a block device in user space. I am not clearly understanding the block access patterns when creating a file system. As shown in the example when I mount the nbd device and create a ext4 file system with a block size of 4096, I am seeing the reads and writes are in multiples of 1024 and not 4096.
However once the file system is created, when I mount the device and try to read/write files the requests are being sent in multiples of 4096.
So it looks like, while creating the file system using mkfs.ext4, the block device is accessed with 1024 as block size and only after the file system is created, the user specified block size will be used. I am correct in making this inference? If so, can someone explain what happens at the backend and why 1024 is chose initially?
Thanks and Regards,
Sharath

Resources