No direct packet access in BPF program with just CAP_BPF? - linux-kernel

Up until Linux 5.8 CAP_SYSADMIN was required to load any but the most basic BPF program. The recently introduced CAP_BPF is a welcome addition as it allows to run software leveraging BPF with less privileges.
Certain types of BPF programs can access packet data. The pre-4.7 way of doing it is via bpf_skb_load_bytes() helper. As the verifier got smarter, it became possible to perform "direct packet access", i.e. to access packet bytes by following pointers in the context structure. E.g:
static const struct bpf_insn prog[] = {
// BPF_PROG_TYPE_SK_REUSEPORT: gets a pointer to sk_reuseport_md (r1).
// Get packet data pointer (r2) and ensure length >= 2, goto Drop otherwise
BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
offsetof(struct sk_reuseport_md, data)),
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
offsetof(struct sk_reuseport_md, data_end)),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 2),
BPF_JMP_REG(BPF_JGT, BPF_REG_4, BPF_REG_3, /* Drop: */ +4),
// Ensure first 2 bytes are 0, goto Drop otherwise
BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_2, 0),
BPF_JMP_IMM(BPF_JNE, BPF_REG_4, 0, /* Drop: */ +2),
// return SK_PASS
BPF_MOV32_IMM(BPF_REG_0, SK_PASS),
BPF_EXIT_INSN(),
// Drop: return SK_DROP
BPF_MOV32_IMM(BPF_REG_0, SK_DROP),
BPF_EXIT_INSN()
};
It is required to ensure that the accessed bytes are within bounds explicitly. The verifier will reject the program otherwise.
The program above loads successfully if the caller bears CAP_SYSADMIN. Supposedly, CAP_BPF should suffice as well, but it doesn't (Linux 5.13). Earlier kernels behave similarly. The verifier output follows:
Permission denied
0: (79) r2 = *(u64 *)(r1 +0)
1: (79) r3 = *(u64 *)(r1 +8)
2: (bf) r4 = r2
3: (07) r4 += 2
4: (2d) if r4 > r3 goto pc+4
R3 pointer comparison prohibited
processed 5 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
I understand that arbitrary pointer comparison is restricted as it reveals kernel memory layout. However, comparing a pointer to a packet data offset by a certain amount with a pointer to the packet end is safe.
I'd like to find a way to load the program without granting CAP_SYSADMIN.
Is there a way to write bounds checks in a way that doesn't trigger pointer comparison error?
The relevant code is in check_cond_jmp_op(). It looks like one can't get away with pointer comparison, even with the latest kernel version.
If there's no way to write bounds check in a way that keeps verifier happy, I wonder if lifting the limitation is on the roadmap.
As a workaround, I can grant CAP_PERFORM on top of CAP_BPF, removing the "embargo" on pointer comparison. The program loads successfully. I can probably restrict perf_event_open() and other superfluous bits with seccomp. Doesn't feel nice though.
Reproducer.

To make direct packet accesses in your program, you will need CAP_PERFMON in addition to CAP_BPF. I'm not aware of any way around it.
Why?
Because of Spectre vulnerabilities, someone able to perform arithmetic on unbounded pointers (i.e., all except stack and map value pointers) can read arbitrary memory via speculative out-of-bounds loads.
Such operations thus need to be forbidden for unprivileged users. Allowing CAP_BPF users to perform those operations would essentially give read access to arbitrary memory to CAP_BPF. For those reasons, I doubt this limitation will be lifted in the future.

Related

How do I retrieve high and low-order parts of a value from two registers in inline assembly?

I'm currently working on a little game that can run from the boot sector of a hard drive, just for something fun to do. This means my program runs in 16-bit real mode, and I have my compiler flags set up to emit pure i386 code. I'm writing the game in C++, but I do need a lot of inline assembly to talk to the BIOS via interrupt calls. Some of these calls return a 32-bit integer, but stored in two 16-bit registers. Currently I'm doing the following to get my number out of the assembly:
auto getTicks = [](){
uint16_t ticksL{ 0 }, ticksH{ 0 };
asm volatile("int $0x1a" : "=c"(ticksH), "=d"(ticksL) : "a"(0x0));
return static_cast<uint32_t>( (ticksH << 16) | ticksL );
};
This is a lambda function I use to call this interrupt function which returns a tick count. I'm aware that there are better methods to get time data, and that I haven't implemented a check for AL to see if midnight has passed, but that's another topic.
As you can see, I have to use two 16-bit values, get the register values separately, then combine them into a 32-bit number the way you see at the return statement.
Is there any way I could retrieve that data into a single 32-bit number in my code right away avoid the shift and bitwise-or? I know that those 16-bit registers I'm accessing are really just the higher and lower 16-bits of a 32-bit register in reality, but I have no idea how to access the original 32-bit register as a whole.
I know that those 16-bit registers I'm accessing are really just the higher and lower 16-bits of a 32-bit register in reality, but I have no idea how to access the original 32-bit register as a whole.
As Jester has already pointed out, these are in fact 2 separate registers, so there is no way to retrieve "the original 32-bit register."
One other point: That interrupt modifies the ax register (returning the 'past midnight' flag), however your asm doesn't inform gcc that you are changing ax. Might I suggest something like this:
asm volatile("int $0x1a" : "=c"(ticksH), "=d"(ticksL), "=a"(midnight) : "a"(0x0));
Note that midnight is also a uint16_t.
As other answers suggest you can't load DX and CX directly into a 32-bit register. You'd have to combine them as you suggest.
In this case there is an alternative. Rather than using INT 1Ah/AH=0h you can read the BIOS Data Area (BDA) in low memory for the 32-bit DWORD value and load it into a 32-bit register. This is allowed in real mode on i386 processors. Two memory addresses of interest:
40:6C dword Daily timer counter, equal to zero at midnight;
incremented by INT 8; read/set by INT 1A
40:70 byte Clock rollover flag, set when 40:6C exceeds 24hrs
These two memory addresses are in segment:offset format, but would be equivalent to physical address 0x0046C and 0x00470.
All you'd have to do is temporarily set the DS register to 0 (saving the previous value), turn off interrupts with CLI retrieve the values from lower memory using C/C++ pointers, re-enable interrupts with STI and restore DS to the previously saved value. This of course is added overhead in the boot sector compared to using INT 1Ah/AH=0h but would allow you direct access to the memory addresses the BIOS is reading/writing on your behalf.
Note: If DS is set to zero already no need to save/set/restore it. Since we don't see the code that sets up the environment before calling into the C++ code I don't know what your default segment values are. If you don't need to retrieve both the roll over and timer values and only wish to get them individually you can eliminate the CLI/STI.
You're looking for the 'A' constraint, which refers to the dx:ax register pair as a double-wide value. You can see the full set of defined constraints for x86 in the gcc documentation. Unfortunately there are no constraints for any other register pairs, so you have to get them as two values and reassemble them with shift and or, like you describe.

Checking for valid user memory in kernel mode with copy_to_user

So, I tried using this:
copy_to_user(p, q, 0)
I want to copy from q to p and if it doesn't work, then I want to know if p points to an invalid address.
copy_to_user returns the number of bytes that weren't copied successfully but in this case, there are 0 bytes and I can't know for sure if p points to an invalid address.
Is there another way to check if p points to a valid user memory?
Yes. You need to check passing size value manually each time before calling copy_to_user(). If it's 0 or not in valid range -- you shouldn't call copy_to_user() at all. This way you can rely on copy_to_user() return value.
the method copy_to_user defined at /usr/src/linux-3.0.6-gentoo/include/asm-generic/uaccess.h
static inline long copy_to_user(void __user *to,
const void *from, unsigned long n)
{
might_fault();
if (access_ok(VERIFY_WRITE, to, n))
return __copy_to_user(to, from, n);
else
return n;
}
the method access_ok checks the accessibility of to(user memory). So you can use the method access_ok to check memory is valid or not(to is not NULL / it's in user space)?
Argument VERIFY_READ or VERIFY_WRITE. VERIFY_READ: identifies whether memory region is readable, VERIFY_WRITE: identifies whether the memory region is readable as well as writable.
source of method access_ok
And what do you consider 'valid user memory'? What do you need this for?
Let's say we only care about the target buffer residing in userspace range (for archs with joint address spaces). From this alone we see that testing the address without the size is pointless - what if the address is the last byte of userspace? Appropriate /range/ check is done by access_ok.
Second part is whether there is a page there or a read/write can be performed without servicing a page fault. Is this of any concern for you? If you read copy_from/whatever you will see it performs the read/write and only catches the fault. There is definitely KPI to check whether the target page can be written to without a fault, but you would need to hold locks (mmap_sem and likely more) over your check and whatever you are going to do next, which is likely not what you wanted to do.
So far it seems you are trying

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

Replace HW interrupt in flat memory mode with DOS32/A

I have a question about how to replace HW interrupt in flat memory mode...
about my application...
created by combining Watcom C and DOS32/A.
written for running on DOS mode( not on OS mode )
with DOS32/A now I can access >1M memory and allocate large memory to use...(running in flat memory mode !!!)
current issue...
I want to write an ISR(interrupt service routine) for one PCI card. Thus I need to "replace" the HW interrupt.
Ex. the PCI card's interrupt line = 0xE in DOS. That means this device will issue interrupt via 8259's IRQ 14.
But I did not how to achieve my goal to replace this interrupt in flat mode ?
# resource I found...
- in watcom C's library, there is one sample using _dos_getvect, _dos_setvect, and _chain_intr to hook INT 0x1C...
I tested this code and found OK. But when I apply it to my case: INT76 ( where IRQ 14 is "INT 0x76" <- (14-8) + 0x70 ) then nothing happened...
I checked HW interrupt is generated but my own ISR did not invoked...
Do I lose something ? or are there any functions I can use to achieve my goal ?
===============================================================
[20120809]
I tried to use DPMI calls 0x204 and 0x205 and found MyISR() is still not invoked. I described what I did as below and maybe you all can give me some suggestions !
1) Use inline assembly to implement DPMI calls 0x204 and 0x205 and test OK...
Ex. Use DPMI 0x204 to show the interrupt vectors of 16 IRQs and I get(selector:offset) following results: 8:1540(INT8),8:1544(INT9),.....,8:1560(INT70),8:1564(INT71),...,8:157C(INT77)
Ex. Use DPMI 0x205 to set the interrupt vector for IRQ14(INT76) and returned CF=0, indicating successful
2) Create my own ISR MyISR() as follows:
volatile int tick=0; // global and volatile...
void MyISR(void)
{
tick = 5; // simple code to change the value of tick...
}
3) Set new interrupt vector by DPMI call 0x205:
selector = FP_SEG(MyISR); // selector = 0x838 here
offset = FP_OFF(MyISR); // offset = 0x30100963 here
sts = DPMI_SetIntVector(0x76, selector, offset, &out_ax);
Then sts = 0(CF=0) indicating successful !
One strange thing here is:my app runs in flat memory model and I think the selector should be 0 for MyISR()... But if selector = 0 for DPMI call 0x205 then I got CF=1 and AX = 0x8022, indicating "invalid selector" !
4) Let HW interrupt be generated and the evidences are:
PCI device config register 0x5 bit2(Interrupt Disabled) = 0
PCI device config register 0x6 bit3(Interrupt status) = 1
PCI device config register 0x3C/0x3D (Interrupt line) = 0xE/0x2
In DOS the interrupt mode is PIC mode(8259 mode) and Pin-based(MSIE=0)
5) Display the value of tick and found it is still "0"...
Thus I think MyISR() is not invoked correctly...
Try using DPMI Function 0204h and 0205h instead of '_dos_getvect' and '_dos_setvect', respectively.
The runtime environment of your program is DOS32A or a DPMI Server/host. So use the api they provided instead of using DOS int21h facilities. But DOS32A does intercepts int21h interrupts, so your code should work fine, as far as real mode is concerned.
Actually what you did is you install only real mode interrupt handler for IRQ14 by using '_dos_getvect' and '_dos_setvect' functions.
By using the DPMI functions instead, you install protected mode interrupt handler for IRQ14, and DOS32a will autopassup IRQ14 interrupt to this protected mode handler.
Recall: A dos extender/DPMI server can be in protected mode or real mode while an IRQ is asserted.
This is becoz your application uses some DOS or BIOS API, so extender needs to switch to real mode to execute them and the return back to protected mode to transfer control to you protected mode application.
DOS32a does this by allocating a real-mode callback (at least for hardware interrupts) which calls your protected mode handler if IRQ14 is asserted while the Extender is in real-mode.
If the extender is in protected mode, while IRQ14 is asserted, it will automatically transfer control to your IRQ14 handler.
But if you didn't install protected mode handler for your IRQ, then DOS32a, will not allocate any real-mode callback, and your real-mode irq handler may not get control.
But it should recieve control AFAIK.
Anyway give the above two functions a try. And do chain to the previous int76h interrupt handler as Sean said.
In short:
In case of DOS32a, you need not use '_dos_getvect' and '_dos_setvect' functions. Instead use the DPMI functions 0204h and 0205h for installing your protected mode IRQ handler.
An advise : In your interrupt handler the first step should be to check whether your device actually generated interrupt or it is some other device sharing this irq(IRQ14 in your case). You can do this by checking a 'interrupt pending bit' in your device, if it is set, service your device and chain to next handler. If it is not set to 1, simply chain to next handler.
EDITED:
Use the latest version of DOS32a, instead of one that comes with OW.
Update on 2012-08-14:
Yes, you can use FP_SEG and FP_OFF macros for obtaining selector and offset respectively, just like you would use these macros in real modes to get segment and offset.
You can also use MK_FP macro to create far pointers from selector and offset. eg.
MK_FP(selector, offset).
You should declare your interrupt handler with ' __interrupt ', keyword when writing handlers in C.
Here is a snippet:
#include <i86.h> /* for FP_OFF, FP_SEG, and MK_FP in OW */
/* C Prototype for your IRQ handler */
void __interrupt __far irqHandler(void);
.
.
.
irq_selector = (unsigned short)FP_SEG( &irqHandler );
irq_offset = (unsigned long)FP_OFF( &irqHandler );
__dpmi_SetVect( intNum, irq_selector, irq_offset );
.
.
.
or, try this:
extern void sendEOItoMaster(void);
# pragma aux sendEOItoMaster = \
"mov al, 0x20" \
"out 0x20, al" \
modify [eax] ;
extern void sendEOItoSlave(void);
# pragma aux sendEOItoSlave = \
"mov al, 0x20" \
"out 0xA0, al" \
modify [eax] ;
unsigned int old76_selector, new76_selector;
unsigned long old76_offset, new76_offset;
volatile int chain = 1; /* Chain to the old handler */
volatile int tick=0; // global and volatile...
void (__interrupt __far *old76Handler)(void) = NULL; // function pointer declaration
void __interrupt __far new76Handler(void) {
tick = 5; // simple code to change the value of tick...
.
.
.
if( chain ){
// disable irqs if enabled above.
_chain_intr( old76Handler ); // 'jumping' to the old handler
// ( *old76Handler )(); // 'calling' the old handler
}else{
sendEOItoMaster();
sendEOItoSlave();
}
}
__dpmi_GetVect( 0x76, &old76_selector, &old76_offset );
old76Handler = ( void (__interrupt __far *)(void) ) MK_FP (old76_selector, old76_offset)
new76_selector = (unsigned int)FP_SEG( &new76Handler );
new76_offset = (unsigned long)FP_OFF( &new76Handler );
__dpmi_SetVect( 0x76, new76_selector, new76_offset );
.
.
NOTE:
You should first double check that the IRQ# you are hooking is really assigned/mapped to the interrupt pin of your concerned PCI device. IOWs, first read 'Interrupt Line register' (NOT Interrupt Pin register) from PCI configuration space, and hook only that irq#. The valid values for this register, in your case are: 0x00 through 0x0F inclusive, with 0x00 means IRQ0 and 0x01 means IRQ1 and so on.
POST/BIOS code writes a value in 'Interrupt Line register', while booting, and you MUST NOT modify this register at any cost.(of course, unless you are dealing with interrupt routing issues which an OS writer will deal with)
You should also get and save the selector and offset of the old handler by using DPMI call 0204h, in case you are chaining to old handler. If not, don't forget to send EOI(End-of-interrupt) to BOTH master and slave PICs in case you hooked an IRQ belonging to slave PIC(ie INT 70h through 77h, including INT 0Ah), and ONLY to master PIC in case you hooked an IRQ belonging to master PIC.
In flat model, the BASE address is 0 and Limit is 0xFFFFF, with G bit(ie Granularity bit) = 1.
The base and limit(along with attribute bits(e.g G bit) of a segment) reside in the descriptor corresponding to a particular segment. The descriptor itself, sits in the descriptor table.
Descriptor tables are an array with each entry being 8bytes.
The selector is merely a pointer(or an index) to the 8-byte descriptor entry, in the Descriptor table(either GDT or LDT). So a selector CAN'T be 0.
Note that lowest 3 bits of 16-bit selector have special meaning, and only the upper 13-bits are used to index a descriptor entry from a descriptor table.
GDT = Global Descriptor Table
LDT = Local Descriptor Table
A system can have only one GDT, but many LDTs.
As entry number 0 in GDT, is reserved and can't be used. AFAIK, DOS32A, does not create any LDT for its applications, instead it simply allocate and initalize descriptor entries corresponding to the application, in GDT itself.
Selector MUST not be 0, as x86 architecture regards 0 selector as invalid, when you try to access memory with that selector; though you can successfully place 0 in any segment register, it is only when you try to access(read/write/execute) that segment, the cpu generates an exception.
In case of interrupt handlers, the base address need not be 0, even in case of flat mode.
The DPMI environment must have valid reasons for doing this so.
After all, you still need to tackle segmentation at some level in x86 architecture.
PCI device config register 0x5 bit2(Interrupt Disabled) = 0
PCI device config register 0x6 bit3(Interrupt status) = 1
I think, you mean Bus master command and status registers respectively. They actually reside in either I/O space or memory space, but NOT in PCI configuration space.
So you can read/write them directly via IN/OUT or MOV, instructions.
For reading/writing, PCI configuration registers you must use configuration red/write methods or PCI BIOS routines.
NOTE:
Many PCI disk controllers, have a bit called 'Interrupt enable/disable' bit. The register
that contains this bit is usually in the PCI configuration space, and can be found from the datasheet.
Actually, this setting is for "forwarding" the interrupt generated by the device attached to the PCI controller, to the PCI bus.
If, interrupts are disabled via this bit, then even if your device(attached to PCI controller) is generating the interrupt, the interrupt will NOT be forwarded to the PCI bus(and hence cpu will never know if interrupt occurred), but the interrupt bit(This bit is different from 'Interrupt enable/disable' bit) in PCI controller is still set to notify that the device(attached to PCI controller, eg a hard disk) generated an interrupt, so that the program can read this bit and take appropriate actions. It is similar to polling, from programming perspective.
This usually apply only for non-bus master transfers.
But, it seems that you are using bus master transfers(ie DMA), so it should not apply in your case.
But anyway, I would suggest you do read the datasheet of the PCI controller carefully, especially looking for bits/registers related to interrupt handling
EDITED:
Well, as far as application level programming is concerned, you need not encounter/use _far pointers, as your program will not access anything outside to your code.
But this is not completely true, when you go to system-level programming, you need to access memory mapped device registers, external ROM, or implementing interrupt handlers, etc.
The story changes here. The creation of a segment ie allocating descriptor and getting its associated selector, ensures that even if there is a bug in code, it will not annoyingly change anything external to that particular segment from which current code is executing. If it tries to do so, cpu will generate a fault. So when accessing external devices(especially memory mapped device's registers), or accessing some rom data, eg BIOS etc., it is a good idea to have allocate a descriptor and set the base and segment limits according to the area you need to execute/read/write and proceed. But you are not bound to do so.
Some external code residing for eg in rom, assume that they will be invoked with a far call.
As I said earlier, in x86 architecture, at some level(the farther below you go) you need to deal with segmentation as there is no way to disable it completely.
But in flat model, segmentation is present as an aid to programmer, as I said above, when accessing external(wrt to your program) things. But you need not use if you don't desire to do so.
When an interrupt handler is invoked, it doesn't know the base and limits of program that was interrupted. It doesn't know the segment attributes, limits etc. of the interrupted program, we say except CS and EIP all registers are in undefined state wrt interrupt handler. So it is needed to be declared as far function to indicate that it resides somewhere external to currently executing program.
it's been a while since I fiddled with interrupts, but the table is a pointer to set where the processor should go to to process an interrupt. I can give you the process, but not code, as I only ever used 8086 code.
Pseudo code:
Initialize:
Get current vector - store value
Set vector to point to the entry point of your routine
next:
Process Interrupt:
Your code decides what to do with data
If it's your data:
process it, and return
If not:
jump to the stored vector that we got during initialize,
and let the chain of interrupts continue as they normally would
finally:
Program End:
check to see if interrupt still points to your code
if yes, set vector back to the saved value
if no, set beginning of your code to long jump to vector address you saved,
or set a flag that lets your program not process anything

Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure?

Consider a complex, memory hungry, multi threaded application running within a 32bit address space on windows XP.
Certain operations require n large buffers of fixed size, where only one buffer needs to be accessed at a time.
The application uses a pattern where some address space the size of one buffer is reserved early and is used to contain the currently needed buffer.
This follows the sequence:
(initial run) VirtualAlloc -> VirtualFree -> MapViewOfFileEx
(buffer changes) UnMapViewOfFile -> MapViewOfFileEx
Here the pointer to the buffer location is provided by the call to VirtualAlloc and then that same location is used on each call to MapViewOfFileEx.
The problem is that windows does not (as far as I know) provide any handshake type operation for passing the memory space between the different users.
Therefore there is a small opportunity (at each -> in my above sequence) where the memory is not locked and another thread can jump in and perform an allocation within the buffer.
The next call to MapViewOfFileEx is broken and the system can no longer guarantee that there will be a big enough space in the address space for a buffer.
Obviously refactoring to use smaller buffers reduces the rate of failures to reallocate space.
Some use of HeapLock has had some success but this still has issues - something still manages to steal some memory from within the address space.
(We tried Calling GetProcessHeaps then using HeapLock to lock all of the heaps)
What I'd like to know is there anyway to lock a specific block of address space that is compatible with MapViewOfFileEx?
Edit: I should add that ultimately this code lives in a library that gets called by an application outside of my control
You could brute force it; suspend every thread in the process that isn't the one performing the mapping, Unmap/Remap, unsuspend the suspended threads. It ain't elegant, but it's the only way I can think of off-hand to provide the kind of mutual exclusion you need.
Have you looked at creating your own private heap via HeapCreate? You could set the heap to your desired buffer size. The only remaining problem is then how to get MapViewOfFileto use your private heap instead of the default heap.
I'd assume that MapViewOfFile internally calls GetProcessHeap to get the default heap and then it requests a contiguous block of memory. You can surround the call to MapViewOfFile with a detour, i.e., you rewire the GetProcessHeap call by overwriting the method in memory effectively inserting a jump to your own code which can return your private heap.
Microsoft has published the Detour Library that I'm not directly familiar with however. I know that detouring is surprisingly common. Security software, virus scanners etc all use such frameworks. It's not pretty, but may work:
HANDLE g_hndPrivateHeap;
HANDLE WINAPI GetProcessHeapImpl() {
return g_hndPrivateHeap;
}
struct SDetourGetProcessHeap { // object for exception safety
SDetourGetProcessHeap() {
// put detour in place
}
~SDetourGetProcessHeap() {
// remove detour again
}
};
void MapFile() {
g_hndPrivateHeap = HeapCreate( ... );
{
SDetourGetProcessHeap d;
MapViewOfFile(...);
}
}
These may also help:
How to replace WinAPI functions calls in the MS VC++ project with my own implementation (name and parameters set are the same)?
How can I hook Windows functions in C/C++?
http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf
Imagine if I came to you with a piece of code like this:
void *foo;
foo = malloc(n);
if (foo)
free(foo);
foo = malloc(n);
Then I came to you and said, help! foo does not have the same address on the second allocation!
I'd be crazy, right?
It seems to me like you've already demonstrated clear knowledge of why this doesn't work. There's a reason that the documention for any API that takes an explicit address to map into lets you know that the address is just a suggestion, and it can't be guaranteed. This also goes for mmap() on POSIX.
I would suggest you write the program in such a way that a change in address doesn't matter. That is, don't store too many pointers to quantities inside the buffer, or if you do, patch them up after reallocation. Similar to the way you'd treat a buffer that you were going to pass into realloc().
Even the documentation for MapViewOfFileEx() explicitly suggests this:
While it is possible to specify an address that is safe now (not used by the operating system), there is no guarantee that the address will remain safe over time. Therefore, it is better to let the operating system choose the address. In this case, you would not store pointers in the memory mapped file, you would store offsets from the base of the file mapping so that the mapping can be used at any address.
Update from your comments
In that case, I suppose you could:
Not map into contiguous blocks. Perhaps you could map in chunks and write some intermediate function to decide which to read from/write to?
Try porting to 64 bit.
As the earlier post suggests, you can suspend every thread in the process while you change the memory mappings. You can use SuspendThread()/ResumeThread() for that. This has the disadvantage that your code has to know about all the other threads and hold thread handles for them.
An alternative is to use the Windows debug API to suspend all threads. If a process has a debugger attached, then every time the process faults, Windows will suspend all of the process's threads until the debugger handles the fault and resumes the process.
Also see this question which is very similar, but phrased differently:
Replacing memory mappings atomically on Windows

Resources