What is indirect buffer and indirect descriptor? - data-structures

Recently I'm reading virtio code in linux kernel. I came across "indirect buffer" and "indirect descriptor" in the source code. I wonder what is the meaning of "indirect"?

You've probably come across virtqueue concept. virtqueue ring has a certain (limited) capacity, and, since some devices may work way more effective by concurrently dispatching a large number of large requests, indirect descriptor feature comes in handy to allow this kind of operation. The idea is simply to increase ring capacity, and this is achieved by storing an indirect descriptor table anywhere in memory and inserting its desciptor in main virtqueue (INDIRECT flag is set on such a descriptor to show its nature). So, when such indirect descriptor is processed in the main ring, control is passed to the indirect descriptor table (which resides in memory buffer described by addr and len of that indirect descriptor in the main ring). Then descriptors from within such a table get processed, and the end of the table is determined by absence of NEXT flag in the current descriptor. When the flag is found unset, control is passed back to the virtqueue.
So, roughly, it's just a trick to "substitute" an ordinary descriptor in the ring by a handful of descriptors thus increasing overall capacity. So, indirect descriptor - is an ordinary descriptor in the ring which has INDIRECT flag set and which references an indirect buffer. The latter is a chunk of memory which contains a table of ordinary descriptors to be processed.
Also please note descriptors from within indirect descriptor table can't be indirect themselves.

Related

VirtualAlloc a writeable, unbacked "throw-away" garbage range?

Is it possible in Win32 to get a writeable (or write-only) range of "garbage" virtual address space (i.e., via VirtualAlloc, VirtualAlloc2, VirtualAllocEx, or other) that never needs to be persisted, and thus ideally is never backed by physical memory or pagefile?
This would be a "hole" in memory.
The scenario is for simulating a dry-run of a sequential memory writing operation just in order to obtain the size that it actually consumes. You would be able to use the exact same code used for actual writing, but instead pass in an un-backed "garbage" address range that essentially ignores or discards anything that's written to it. In this example, the size of the "void" address range could be 2⁶⁴ = 18.4ᴇʙ (why not? it's nothing, after all), and all you're interested in is the final value of an advancing pointer.
[edit:] see comments section for the most clever answer. Namely: map a single 4K page multiple times in sequence, tiling the entire "empty" range
This isn't possible. If you have code that attempts to write to memory then the virtual memory needs to be backed with something.
However, if you modified your code to use the stream pattern then you could provide a stream implementation that ignored the write and just tracked the size.

Overwriting data in memory

I've written a password manager in Ocaml. In order to make it as secure as possible, I'd like to store a string (an encryption key) in memory in such a way that it can be overwritten. Since Ocaml is pass by value , and there's a garbage collector, this has proven difficult. I encrypt all buffers and variables I can, but I still need a "session key" stored to do this. To prevent this from being detected by automated key searching programs or put into swap, it's assembled from a bunch of random data in a buffer using a random increment. So really, all I need is a single variable that can be overwritten for the assembled key for a few seconds before it's passed into the Nocrypto library... Would a reference work for this?
According to this cornell "Refs and Arrays" page, refs are mutable and work similarly to pointers in C. That being said, I also found a stack overflow answer discussing Ocaml refs, in which the answer mentions "they act like pointers to new allocated memory". Does this mean every time, it just allocates a new thing in memory instead of actually mutating the stuff in memory? If so, you couldn't really "overwrite" a ref.
Other possible solutions I've come across are Bigarrays, and "custom blocks". I'm not entirely sure if "custom blocks" are actually allocated outside of the scope of garbage collection or not. They seem like they're used to access external C code. Are they copied around by the garbage collector? Could they be "overwritten?" There's also this idea of "opaque bytes" and opaque objects in memory. I'm having a pretty hard time wrapping my head around how this all fits together. A useful but confusing (to me) discussion of custom blocks in memory on stack overflow is here: Are custom blocks ever copied in memory? Answer says they can be moved around. Even so, could they be overwritten?
The last possible solution is to store it using a Cstruct like the Nocrypto library seems to do. They discuss it in this github issue: Secret material erasure The asker states:
"Granted, most key material is Cstruct.t, which is a Bigarray.Array1.t, which is allocated outside of the GC heap"
Is this even correct? If so, I can't seem to find a source file that actually does this. I'm pretty new to Ocaml and functional programming in general. If you're curious, my program is located on github here: ocaml-pass
TL;DR;
You shall not store any secret information in OCaml heap. Thus you must never copy your secret into any OCaml heap-allocated value, consequently, neither Bytes, nor Strings or Arrays could be used, even temporary.
Introduction to the OCaml Memory Model
OCaml values are uniformly represented as tagged machine words. The least significant bit of a word is used as a tag, that distinguishes between pointers (tag=0) and immediate values (tag=1). Thus a value has always a fixed size, and is a pointer or an immediate.
Immediate values store their data in the most significant part of the word, that is 31-bits in 32 bit systems, and 63 bits in 64-bit systems. Pointers store their data in blocks, that are located in a so-called OCaml Heap. The OCaml Heap is a set of blocks managed by the Garbage Collector (GC). A block is a chunk of data prefixed with a header. The header specifies the size of data, and some other meta information, used by the GC. Block can contain OCaml values (pointers or immediate values) or opaque data.
To summarize. All OCaml values are represented as machine words, that either store data directly in the word or are pointers to heap-allocated blocks. Each pointer points to one and only one block. Several pointers may point to the same block. Such values are considered physically equal. Some blocks are not pointed by any pointers. Such blocks are called dead and are reclaimed by the GC.
Introduction to the OCaml Garbage Collector
The GC manages blocks, by allocating, moving, and deallocating them. The GC itself uses an arena, that is either obtained from the C memory allocator (malloc) or directly from a kernel via the memmap syscall (depends on a particular system and runtime).
The GC is generational, that means that values are first allocated in a special region of a heap called minor heap. The minor heap is a contiguous region of memory of fixed size, represented in the runtime with three pointers: the pointer beg to the beginning of the minor heap, a pointer end to the end of the minor heap, and the pointer cur to the beginning of the free area of the minor heap. When a block is allocated, cur is increased by the size of the block. Then the block is initialized with data. When there is no more free space in the minor heap (i.e., then end - cur is less than the required block size), then a minor GC cycle is triggered. The GC analyzes all blocks stored in the Minor Heap and copies all blocks that are referenced by at least one pointer to the Major Heap. After that, the cur pointer is set to beg.
In the Major Heap, a block may also be copied several times during a process called compaction. The compactor may try to rearrange blocks in its arena in order to achieve more compact representation of the heap.
Security Consequences
As the OCaml GC is a moving GC, it may copy the heap-allocated data arbitrarily. Although it is called moving, it is still in fact just copying. I.e., when a block is moved from the minor heap to the major heap, it is in fact just bit-copied, and thus is duplicated. The block phantom in the minor heap may live for an arbitrary amount of time until it is overwritten by some newly allocated value. When an object is moved during the compaction, it is also copied, and may or may not be overwritten during the process. And, of course, it goes without saying, that once a block becomes dead, it still may survive in a heap for an arbitrary amount of time until reused by the GC.
That all means, that if a secret ends up in the OCaml heap, it will go wild, as the GC can replicate it multiple times in an arbitrary and rather unpredictable ways. Thus, we can only store secrets either in immediate values or in regions that are not controlled by the GC. As it was said before, all OCaml values that are pointers, always point to a block in the OCaml heap. A block may contain data directly, or it could contain a pointer itself, that will point outside the memory heap. The so-called custom blocks, may or may not store their information in the OCaml heap, it depeds on a particular representation of each custom block. For example, the Bigarray library provides custom blocks that store their payload outside of the OCaml heap. Thus a Bigarray is a custom block, that has two fields: a pointer and size. It is an opaque block, i.e., the GC will never treat these two values as OCaml values, and will never follow neither the size nor the pointer. The data pointed by a pointer is located outside of the OCaml heap, and is either allocated by malloc or by memmap (in fact, it could be arbitrary integer, and even point to stack, or static data, it doesn't really matter, as long as we treat bigarrays just as a ptr,len pair).
This all makes Bigarrays ideal for storing secrets. We can be sure, that they are not moved by the GC, we can overwrite them to prevent the information leakage once they are freed.
Further considerations
We should be careful, and never allow a secret to be copied into the OCaml heap from our safe place. That means, that even if our main storage is a safe bigarray the information will still leak if we will copy its contents to an OCaml string. Consequently, if we first read the information into OCaml string, and then copy it into bigarray, the information will still leak. Thus, any interface that uses OCaml heap-allocated values is unsafe and shall not be used. For example, we can't use OCaml channels to read or write secrets (we should rely on memory mapping or unbuffered IO provided by the Unix module). And again, whenever you get a string data type from a Bigarray, you get your data copied, with all the ramifications.
I would use a value of type bytes, essentially a mutable array of bytes:
# let buffer = Bytes.make 16 'x';;
val buffer : bytes = "xxxxxxxxxxxxxxxx"
# Bytes.set buffer 0 'T';;
- : unit = ()
# buffer;;
- : bytes = "Txxxxxxxxxxxxxxx"
# Bytes.fill buffer 0 16 ' ';;
- : unit = ()
# buffer;;
- : bytes = " "
You can overwrite with Bytes.fill after you're done.

How does my OS know in which direction to flush if I use OpenMP flush function?

A thread that reads a shared variable has first to call flush, and a thread that writes to a shared variable has to call OpenMP flush afterwards, to keep the shared variable in main memory and cache synchronized. How does the flush function know in which direction to flush? It needs to know which of both variables (main memory or cache) is newer. I assume, but I am not sure, that the OS or CPU take care of this somehow. Does someone know?
flush is not a function - it is an OpenMP compiler directive. It affects the way the compiler generates the executable code and instructs it to synchronise the values of all optimised variables (stored in CPU registers or other explicitly programmable cache / thread-local memory) in the flush-set. This is similar to the effect that the volatile storage modifier has on code generation, but has more limited point-local effect.
How does it work? While parsing the source code, the compiler analyses the flow of statements and the data (variables) that gets affected by those statements. Consequently the compiler builds an execution graph and a data dependency graph from the code. It knows exactly where and how the value of each variable is being used and the execution of which code block affects which variables. Then the compiler tries to optimise the code by simplifying the graph and to reduce the number of expensive memory operations by either using CPU registers to store intermediate values or by using another for of faster thread-addressable local memory. The flush directive adds special points in the execution graph, where the compiler must explicitly synchronise the memory view of the thread (register variables and local-memory variables) with the global shared memory. Since the compiler has built the dependency graph in the first place, it knows exactly which variables in the flush-set were modified and hence have to be written to the shared memory; all other variables in the flush-set have to be read from the shared memory.
So the answer to your question is that it is usually the compiler who processes the flush directive, not the OS, although the compiler might call into the OS to actually implement the flush, e.g. on systems with explicitly programmable caches/local memories. But one should also note that OpenMP is an abstract standard, which can be implemented on many different hardware platforms and that some of those platforms provide certain hardware that can help with implementing the OpenMP abstractions more efficiently (e.g. the CPU ASIC in IBM's Blue Gene/Q provides many such features).
You don't need to call flush to keep shared variables synchronized.
The hardware (CPU) does keep track of cached memory and if there are conflicting accesses, they will slow down your program, because the cache will be flushed by CPU.
I understand the flush directive more like a conditional barrier.
A flush containing the same variable must be encountered by at least two threads to have an effect.
When this directive is met by two threads with say variable a in common, if they have modified it they will write back their modifications to memory (as opposed to keep it in a local variable or register), and then I suppose there is a barrier for both thread to get to that point before they continue.
If the variable a is used after the flush it is reread from memory.

Ptrace and memory allocation

I have been playing a while with ptrace. I followed some tutorials like this one or this one. So far, when I have a ptrace-d child process, I am able to:
Detect system calls and browse the registers.
Fetch the strings contained in addresses pointed by the registers, thanks to the PTRACE_PEEKDATA option of ptrace.
Change the values of those registers and change memory values in the user space of the child process thanks to the PTRACE_POKEDATA option of ptrace.
My problem is the following: let's say that for example I have just detected an open system call. I can modify the filename of the file to be opened thanks to the address stored in the ebx register. However, I wonder if I can just change the filename to anything I want, any size. If the name I am changing to is really large (let's say 50 times the original filename length), wouldn't I be messing with some memory I should not be writing on? Should I 'allocate' some memory in the child's memory space? If so, how would this be done?
Note that the child process is some program executed with execve, I cannot access its source code.
The pathname passed to open could be dynamically allocated by the program (so its on the heap or stack somewhere), or it could be in the read-only section if it was a compile-time constant. In either case, you don't know what other parts of the program might be using it, so its probably not a good idea to change its contents. You would definitely overwrite adjacent memory if you wrote past the current length (which would probably lead to subtle problems like corrupting heap meta-data or corrupting other random allocation objects).
Here are some random ideas (totally untested) on how to allocate memory in a child process:
invoke an mmap syscall on its behalf (this would probably be pretty tricky) but would get you a page (or more) of memory to play with
allocate some space in the current stack (don't change the child's registers, but use your knowledge of which part of the stack the child is using to put temporary objects in the unused section). Technically its legal for the child process to do this same thing (so you could end up corrupting that data), but its very unlikely.
hide stuff at the far end of the stack, (again assuming the child isn't also playing this trick).
I didn't think invoking malloc would be easy, but googling for 'ptrace child allocate memory' I found: http://www.hick.org/code/skape/papers/needle.txt (which finds the malloc routine used by the ELF dynamic linker and constructs a call out to there to allocate memory).

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.
(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?
fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.
Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.
One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.
You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.
Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

Resources