When is storage allocated for a newly created ext2 inode? - linux-kernel

I'm reading in Understanding the Linux Kernel, 3rd Edition on how to create a new regular file on ext2fs.
(Book is available online at multiple places: not sure as to their legalities. A later version of just the relevant chapter is at O'reilly's site - but it does not have all the relevant data)
As I see it, the Linux kernel has to create a new inode, and to allocate the necessary blocks to it.
The book outlines the following procedures:
Creating inodes (p. 758)
Allocating a Data Block (p. 764)
What I don't understand is when does the kernel allocate the new inode's data blocks.
Near the end (para. 14) of the Creating inodes procedure, I see the following:
Invokes ext2_preread_inode() to read from disk the block containing
the inode and to put the block in the page cache. This type of
read-ahead is done because it is likely that a recently created inode
will be written back soon.
So, just prior to that - seems to me a logical place to allocate the inode's blocks. However, It may be that the ext2 architects decided to do the allocation at a different time.
Does anyone know when is storage allocated for a newly created ext2 inode?

IIRC, on modern compilers the answer is "when flushing the fiile from disk cache". That may appear quite late, but remember that ext2 tries to avoid fragmentation. If you can delay the allocation of blocks until the whole file is in disk cache, you know exactly how big it is and can allocate one contiguous block.

In book i found this:
Allocates the disk inode: sets the corresponding bit in the inode bitmap and marks the buffer containing the bitmap as dirty. Moreover,
if the filesystem has been mounted specifying the MS_SYNCHRONOUS flag
(see the section “Mounting a Generic Filesystem” in Chapter 12), the
function invokes sync_dirty_buffer( ) to start the I/O write operation
and waits until the operation terminates.
It means that in terms of book allocating disk inode means just setting the BIT in in memory-kept inode bitmap and marking this bitmap buffer dirty. It means that soon this bitmap would be written back to storage.
About ext2_preread_inode(), here is the code:
static void ext2_preread_inode(struct inode *inode)
166 {
167 unsigned long block_group;
168 unsigned long offset;
169 unsigned long block;
170 struct ext2_group_desc * gdp;
171 struct backing_dev_info *bdi;
172
173 bdi = inode->i_mapping->backing_dev_info;
174 if (bdi_read_congested(bdi))
175 return;
176 if (bdi_write_congested(bdi))
177 return;
178
179 block_group = (inode->i_ino - 1) / EXT2_INODES_PER_GROUP(inode->i_sb);
180 gdp = ext2_get_group_desc(inode->i_sb, block_group, NULL);
181 if (gdp == NULL)
182 return;
183
184 /*
185 * Figure out the offset within the block group inode table
186 */
187 offset = ((inode->i_ino - 1) % EXT2_INODES_PER_GROUP(inode->i_sb)) *
188 EXT2_INODE_SIZE(inode->i_sb);
189 block = le32_to_cpu(gdp->bg_inode_table) +
190 (offset >> EXT2_BLOCK_SIZE_BITS(inode->i_sb));
191 sb_breadahead(inode->i_sb, block);
192 }
I am not kernel master, but it seems that this function just preread some block of the inode bitmap, which holds the current inode index. This is done in terms of performance increasing as mentioned in comments.
So my understanding - is that when they talk about INODE BLOCK, they mean a block of bitmap when particular inode bit is set. When this block is allocated? When you perform mkfs.ext2.
Maybe i didn't catch the question, so i have a small addition:
If you are asking about the allocation of blocks for file linked with this inode then answer is the following:
The ext2_get_block() function ..., invokes the ext2_alloc_ block( )
function to actually search for a free block in the Ext2 partition.
so the answer then ext2_create -> ... -> ext2_alloc_ block

Related

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?
I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

How does fork() process mark parent's PTE's as read only?

I've searched through a lot of resources, but found nothing concrete on the matter:
I know that with some linux systems, a fork() syscall works with copy-on-write; that is, the parent and the child share the same address space, but PTE is now marked read-only, to be used later of COW. when either tries to access a page, a PAGE_FAULT occur and the page is copied to another place, where it can be modified.
However, I cannot understand how the OS reaches the shared PTEs to mark them as "read". I have hypothesized that when a fork() syscall occurs, the OS preforms a "page walk" on the parent's page table and marks them as read-only - but I find no confirmation for this, or any information regarding the process.
Does anyone know how the pages come to be marked as read only? Will appreciate any help. Thanks!
Linux OS implements syscall fork with iterating over all memory ranges (mmaps, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range (mn/memory.c) which has loop over page table entries:
copy_page_range will iterate over pgd and call
copy_pud_range to iterate over pud and call
copy_pmd_range to iterate over pmd and call
copy_pte_range to iterate over pte and call
copy_one_pte which does memory usage accounting (RSS) and has several code segments to handle COW case:
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
where is_cow_mapping will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)
#define VM_SHARED 0x00000008
#define VM_MAYWRITE 0x00000020
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".
How fork implementation calls copy_page_range:
fork syscall implementation (sys_fork? or syscall_define0(fork)) is do_fork (kernel/fork.c) which will call
copy_process which will call many copy_* functions, including
copy_mm which calls
dup_mm to allocate and fill new mm struct, where most work is done by
dup_mmap (still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt); line to do real work.

How does xv6 knows where last element of p->pgdir is?

In xv6 each process has a struct called proc which stores the process's state.
it has a field called pgdir which point to it's page directory.
My question is that: It should store the index of the last element in its page directory. I mean if it wants to allocate a new page table it should put a reference of it in pgdir.
My question is how does it know where the next element of page directory is?
This image explains my question more:
Thanks for your help.
I asked from some people in the real world and I understood p->sz stores the index of the first free element in the process's page directory
The process always keep track on how much bytes his program uses with sz field in the proc struct. With this knowledge, it's easy to calculate what is the last page table entry and page directory entry used.
For example, if a program uses 8000 bytes at the moment (meaning sz = 8000):
In XV6, each memory page is 4096 bytes (PGSIZE) and there are 1024 page table entries in each page directory entry. therefore, each page directory entry can point to 4096 * 1024 bytes (4 MB) and each page table entry can point to 4096 bytes (4 KB).
That means that the process last page directory entry is: sz / 4 MB (rounded down).
and that the process last page table in the lasts page directory is: sz / 4 KB (rounded down). In sz = 8000 example, that means: page directory entry 0 (first) & page table entry 1 (second).

kzalloc() - Maxmum size at a single call?

What is the maximum size that we can allocate using kzalloc() in a single call?
This is a very frequently asked question. Also please let me know if i can verify that value.
The upper limit (number of bytes that can be allocated in a single kmalloc / kzalloc request), is a function of:
the processor – really, the page size – and
the number of buddy system freelists (MAX_ORDER).
On both x86 and ARM, with a standard page size of 4 Kb and MAX_ORDER of 11, the kmalloc upper limit on a single call is 4 MB!
Details, including explanations and code to test this, here:
http://kaiwantech.wordpress.com/2011/08/17/kmalloc-and-vmalloc-linux-kernel-memory-allocation-api-limits/
No different to kmalloc(). That's the question you should ask (or search), because kzalloc is just a thin wrapper that sets GFP_ZERO.
Up to about PAGE_SIZE (at least 4k) is no problem :p. Beyond that... you're right to say lots of people people have asked, it's definitely something you have to think about. Apparently it depends on the kernel version - there used to be a hard 128k limit, but it's been increased (or maybe dropped altogether) now. That's just the hard limit though, what you can actually get depends on a given system. (And very definitely on the kernel version).
Maybe read What is the difference between vmalloc and kmalloc?
You can always "verify" the allocation by checking the return value from kzalloc(), but by then you've probably already logged an allocation failure backtrace. Other than that, no - I don't think there's a good way to check in advance.
However, it depends on your kernel version and config. These limits normally locate in linux/slab.h, usually descripted as below(this example is under linux 2.6.32):
#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_HIGH)
#define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_HIGH - PAGE_SHIFT)
And you can test them with code below:
#include <linux/module.h>
#include <linux/slab.h>
int init_module()
{
printk(KERN_INFO "KMALLOC_SHILFT_LOW:%d, KMALLOC_SHILFT_HIGH:%d, KMALLOC_MIN_SIZE:%d, KMALLOC_MAX_SIZE:%lu\n",
KMALLOC_SHIFT_LOW, KMALLOC_SHIFT_HIGH, KMALLOC_MIN_SIZE, KMALLOC_MAX_SIZE);
return 0;
}
void cleanup_module()
{
return;
}
Finally, the results under linux 2.6.32 32bits are: 3, 22, 8, 4194304, it means the min size is 8 bytes, and the max size is 4MB.
PS.
you can also check the actual size of memory allocated by kmalloc, just use ksize(), i.e.
void *p = kmalloc(15, GFP_KERNEL);
printk(KERN_INFO "%u\n", ksize(p)); /* this will print "16" under my kernel */

How is the Page File available calculated in Windows Task Manager?

In Vista Task Manager, I understand the available page file is listed like this:
Page File inUse M / available M
In XP it's listed as the Commit Charge Limit.
I had thought that:
Available Virtual Memory = Physical Memory Total + Sum of Page Files
But on my machine I've got Physical Memory = 2038M, Page Files = 4096M, Page File Available = 6051. There's 83M unaccounted for here. What's that used for. I thought it might be something to do with the Kernel memory, but the number doesn't seem to match up?
Info I've found so far:
See http://msdn.microsoft.com/en-us/library/aa965225(VS.85).aspx for more info.
Page file size can be found here: Computer Properties, advanced, performance settings, advanced.
I think you are correct in your guess it has to do something with the kernel - the kernel memory needs some physical backup as well.
However I have to admit that when trying to verify try, the numbers still do not match well and there is a significant amount of memory not accounted for by this.
I have:
Available Virtual Memory = 4 033 552 KB
Physical Memory Total = 2 096 148 KB
Sum of Page Files = 2048 MB
Kernel Non-Paged Memory = 28 264 KB
Kernel Paged Memory = 63 668 KB

Resources