How does the Linux kernel enter supervisor mode in x86? - linux-kernel

I tried to probe the event when the mode switch happens (user->kernel mode), as a result, I need to find which function will be triggered when the transition happens.
It seems that SBI is the placed doing transition for RISC-V. I'm wondering where is the code to handle this for x86?

It's not that simple. In x86, there are 4 different privilege levels: 0 (operating system kernel), 1, 2, and 3 (applications). Privilege levels 1 and 2 aren't used in Linux: the kernel runs at privilege level 0 while user space code runs at privilege level 3. The current privilege level (CPL) is stored in bits 0 and 1 of the CS (code segment) register.
There are multiple ways in which the transition from user to kernel can happen:
Through hardware interrupts: page faults, general protection faults, devices, hardware timer, and so on.
Through software interrupts: the int instruction raises a software interrupt. The most common in Linux is int 0x80, which is configured to be used for system calls from user space to kernel space.
Through specialized instructions like sysenter and syscall.
In any case, there is no actual code that does the transition: it is done by the processor itself, which switches from one privilege level to the other, and sets up segment selectors, instruction pointer, stack pointer and more according to the information that was set up by the kernel right after booting.
In the case of interrupts, the entries of the Interrupt Descriptor Table (IDT) are used. See this useful documentation page about interrupts in Linux which explains more about the IDT. If you want to get into the details, check out Chapter 5 of the Intel 64 and IA-32 architectures software developer's manual, Volume 3.
In short, each IDT entry specifies a descriptor privilege level (DPL) and a new code segment and offset. In case of software interrupts, some privilege level checks are made by the processor (one of which is CPL <= DPL) to determine whether the code that issued the interrupt has the privilege to do so. Then, the interrupt handler is executed, which implicitly sets the new CS register with the privilege level bits set to 0. This is how the canonical int 0x80 syscall for x86 32bit is made.
In case of specialized instructions like sysenter and syscall, the details differ, but the concept is similar: the CPU checks privileges and then retrieves the information from dedicated Model Specific Registers (MSR) that were previously set up by the kernel after boot.
For system calls the result is always the same: user code switches to privilege level 0 and starts executing kernel code, ending up right at the beginning of one of the different syscall entry points defined by the kernel.
Possible syscall entry points are:
entry_INT80_32 for 32-bit int 0x80
entry_INT80_compat for 32-bit int 0x80 on a 64-bit kernel
entry_SYSENTER_32 for 32-bit sysenter
entry_SYSENTER_compat for 32-bit sysenter on a 64-bit kernel
entry_SYSCALL_64 for 64-bit syscall
entry_SYSCALL_compat for 32-bit syscall on 64-bit kernel (special entry point which is not used by user code, in theory syscall is also a valid 32-bit instruction on AMD CPUs, but Linux only uses it for 64-bit because of its weird semantics)

Related

Directory Table Base divided 4k, but my windows DTB / 4k = x ... 2

I did some experiments about memory analysis.
I have some problems..
almost Directory Table Base can divide 4k(4096) i know.
But my process in windows 10 (1909) have 0x14695e002 DTB.
So that can't divide 4k. 2 ramians.
Why my windows have that value??
The dirBase / Directory Table base is the value of the CR3 register for the current process. As you may know the CR3 is the base register which (indirectly) points to the base of the PML4 (or PDPT) table and is used when switching between process, which basically switches their entire physical memory.
Base CR3
As you may have seen in the Intel manual the 4 lower bits of the CR3 should be ignored by the CPU (Format of the CR3 register with 4-Level Paging):
4-level paging
Now if you look closely at the at the Intel Manual (Chapter 4.5; 4-level Paging).
A logical processor uses 4-level paging if CR0.PG = 1, CR4.PAE = 1, and IA32_EFER.LME = 1
Respectively: Paging; Physical Address Extension; Long Mode Enable.
Use of CR3 with 4-level paging depends on whether process context identifiers (PCIDs) have been enabled by setting CR4.PCIDE.
CR4.PCIDE
CR4.PCIDE is documented in the Intel Manual (Chapter 2.5 Control Registers):
CR4.PCIDE
PCID-Enable Bit (bit 17 of CR4) — Enables process-context identifiers (PCIDs) when set. See Section 4.10.1, “Process-Context Identifiers (PCIDs)”. Can be set only in IA-32e mode (if IA32_EFER.LMA = 1).
So when CR4.PCIDE is set, the 12 (0:11) lower bits of CR3 are used as PCID, that is, a "Process-Context Identifier" (bits 12 to M-1, where M is usually 48, are used for the physical address for the base of the PML4 table).
PCIDs
PCIDs are documented in the Intel Manuel (Chapter 4.10.1; Process-Context Identifiers (PCIDs)):
Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces. The processor may retain cached information when software switches to a different linear address space with a different PCID.
And a little bit further in the same chapter:
When a logical processor creates entries in the TLBs [...] and paging-structure caches [...], it associates those entries with the current PCID.
So basically PCIDs (as far as I understand them) are a way to selectively control how the TLB and paging structure caches are preserved or flushed when a context switch happens.
Some of the instruction that operate on cacheability control (such as CLFLUSH, CLFLUSHOPT, CLWB, INVD, WBINVD, INVLPG, INVPCID, and memory instructions with a non-temporal hint) will check the PCID to either flush everything that concerns a precise PCID or flush only a part of the cache (such as the TLB) and keep everything in relation to a given PCID.
For example the INVPLG instruction:
The INVLPG instruction normally flushes TLB entries only for the specified page; however, in some cases, it may flush more entries, even the entire TLB. The instruction invalidates TLB entries associated with the current PCID and may or may not do so for TLB entries associated with other PCIDs.
The INVPCID specifically uses the PCIDs:
Invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on process-context identifier (PCID)
Why it is always 2 (as far as I can see, it's always 2 for every processes in the system) on Windows, I don't know.

Does the CS and DS registers still affect instructions in x64 intel

Afaik they are never used and CS=DS=SS nowadays. However if I were to set these values, would anything change or does the processor ignore them. Ive found really conflicting information on the question and I don't understand why they would still be there if they are ignored. Help pls
Yes, the segment registers do still affect code execution.
The question and some of the comments don't seem to distinguish between the selector value and the base address. To clearly understand some of the apparently conflicting information you're reading on this topic, you need to make sure you recognize which one is being discussed.
The CS selector cannot be 0. It must refer to a valid code segment descriptor in the GDT or LDT. The L bit of the code segment descriptor controls whether the current process is 64-bit mode or 32-bit compatibility mode.
CS (the selector) cannot be equal to DS and SS. CS must refer to a code segment, whereas DS and SS must refer to data segments (possibly the same one). The DS and SS selectors are allowed to be 0 (which would cause a GP fault in 32-bit mode).
The main aspect of segment registers that doesn't still have an effect is the base address and segment limit; the base address of CS, DS, ES, and SS are all treated as if they are 0, and there are no segment limit checks in 64-bit code.
This is the reason you see people saying that they are ignored.
As Margaret mentioned, the current privilege level (CPL) is in the low 2 bits of the CS and SS selector registers and also in the DPL bits of the descriptors in the GDT. These bits should be either 0 or 3, since no current operating systems use rings 1 and 2, as far as I know.
One other minor point is that certain faults caused by memory accesses are reported as stack faults instead of GP faults, if the memory access is performed using the SS segment (because RBP or RSP is used as a base register in the instruction operand).

Is x86 32-bit assembly code valid x86 64-bit assembly code?

Is all x86 32-bit assembly code valid x86 64-bit assembly code?
I've wondered whether 32-bit assembly code is a subset of 64-bit assembly code, i.e., every 32-bit assembly code can be run in a 64-bit environment?
I guess the answer is yes, because 64-bit Windows is capable of executing 32-bit programs, but then I've seen that the 64-bit processor supports a 32-bit compatible mode?
If not, please provide a small example of 32-bit assembly code that isn't valid 64-bit assembly code and explain how the 64-bit processor executes the 32-bit assembly code.
A modern x86 CPU has three main operation modes (this description is simplified):
In real mode, the CPU executes 16 bit code with paging and segmentation disabled. Memory addresses in your code refer to phyiscal addresses, the content of segment registers is shifted and added to the address to form an effective address.
In protected mode, the CPU executes 16 bit or 32 bit code depending on the segment selector in the CS (code segment) register. Segmentation is enabled, paging can (and usually is) enabled. Programs can switch between 16 bit and 32 bit code by far jumping to an appropriate segment. The CPU can enter the submode virtual 8086 mode to emulate real mode for individual processes from inside a protected mode operating system.
In long mode, the CPU executes 64 bit code. Segmentation is mostly disabled, paging is enabled. The CPU can enter the sub-mode compatibility mode to execute 16 bit and 32 bit protected mode code from within an operating system written for long mode. Compatibility mode is entered by far-jumping to a CS selector with the appropriate bits set. Virtual 8086 mode is unavailable.
Wikipedia has a nice table of x86-64 operating modes including legacy and real modes, and all 3 sub-modes of long mode. Under a mainstream x86-64 OS, after booting the CPU cores will always all be in long mode, switching between different sub-modes depending on 32 or 64-bit user-space. (Not counting System Management Mode interrupts...)
Now what is the difference between 16 bit, 32 bit, and 64 bit mode?
16-bit and 32-bit mode are basically the same thing except for the following differences:
In 16 bit mode, the default address and operand width is 16 bit. You can change these to 32 bit for a single instruction using the 0x67 and 0x66 prefixes, respectively. In 32 bit mode, it's the other way round.
In 16 bit mode, the instruction pointer is truncated to 16 bit, jumping to addresses higher than 65536 can lead to weird results.
VEX/EVEX encoded instructions (including those of the AVX, AVX2, BMI, BMI2 and AVX512 instruction sets) aren't decoded in real or Virtual 8086 mode (though they are available in 16 bit protected mode).
16 bit mode has fewer addressing modes than 32 bit mode, though it is possible to override to a 32 bit addressing mode on a per-instruction basis if the need arises.
Now, 64 bit mode is a somewhat different. Most instructions behave just like in 32 bit mode with the following differences:
There are eight additional registers named r8, r9, ..., r15. Each register can be used as a byte, word, dword, or qword register. The family of REX prefixes (0x40 to 0x4f) encode whether an operand refers to an old or new register. Eight additional SSE/AVX registers xmm8, xmm9, ..., xmm15 are also available.
you can only push/pop 64 bit and 16 bit quantities (though you shouldn't do the latter), 32 bit quantities cannot be pushed/popped.
The single-byte inc reg and dec reg instructions are unavailable, their instruction space has been repurposed for the REX prefixes. Two-byte inc r/m and dec r/m is still available, so inc reg and dec reg can still be encoded.
A new instruction-pointer relative addressing mode exists, using the shorter of the 2 redundant ways 32-bit mode had to encode a [disp32] absolute address.
The default address width is 64 bit, a 32 bit address width can be selected through the 0x67 prefix. 16 bit addressing is unavailable.
The default operand width is 32 bit. A width of 16 bit can be selected through the 0x66 prefix, a 64 bit width can be selected through an appropriate REX prefix independently of which registers you use.
It is not possible to use ah, bh, ch, and dh in an instruction that requires a REX prefix. A REX prefix causes those register numbers to mean instead the low 8 bits of registers si, di, sp, and bp.
writing to the low 32 bits of a 64 bit register clears the upper 32 bit, avoiding false dependencies for out-of-order exec. (Writing 8 or 16-bit partial registers still merges with the 64-bit old value.)
as segmentation is nonfunctional, segment overrides are meaningless no-ops except for the fs and gs overrides (0x64, 0x65) which serve to support thread-local storage (TLS).
also, many instructions that specifically deal with segmentation are unavailable. These are: push/pop seg (except push/pop fs/gs), arpl, call far (only the 0xff encoding is valid), les, lds, jmp far (only the 0xff encoding is valid),
instructions that deal with decimal arithmetic are unavailable, these are: daa, das, aaa, aas, aam, aad,
additionally, the following instructions are unavailable: bound (rarely used), pusha/popa (not useful with the additional registers), salc (undocumented),
the 0x82 instruction alias for 0x80 is invalid.
on early amd64 CPUs, lahf and sahf are unavailable.
And that's basically all of it!
No, it isn't.
While there is a large amount of overlap, 64-bit assembly code is not a superset of 32-bit assembly code and so 32-bit assembly is not in general valid in 64-bit mode.
This applies both the mnemonic assembly source (which is assembled into binary format by an assembler), as well as the binary machine code format itself.
This question covers in some detail instructions that were removed, but there are also many encoding forms whose meanings were changed.
For example, Jester in the comments gives the example of push eax not being valid in 64-bit code. Based on this reference you can see that the 32-bit push is marked N.E. meaning not encodable. In 64-bit mode, the encoding is used to represent push rax (an 8-byte push) instead. So the same sequence of bytes has a different meaning in 32-bit mode versus 64-bit mode.
In general, you can browse the list of instructions on that site and find many which are listed as invalid or not encodable in 64-bit.
If not, please provide a small example of 32-bit assembly code that
isn't valid 64-bit assembly code and explain how the 64-bit processor
executes the 32-bit assembly code.
As above, push eax is one such example. I think what is missing is that 64-bit CPUs support directly running 32-bit binaries. They don't do it via compatibility between 32-bit and 64-bit instructions at the machine language level, but simply by having a 32-bit mode where the decoders (in particular) interpret the instruction stream as 32-bit x86 rather than x86-64, as well as the so-called long mode for running 64-bit instructions. When such 64-bit chips were first released, it was common to run a 32-bit operating system, which pretty much means the chip is permanently in this mode (never goes into 64-bit mode).
More recently, it is typical to run a 64-bit operating system, which is aware of the modes, and which will put the CPU into 32-bit mode when the user launches a 32-bit process (which are still very common: until very recently my browser was still 32-bit).
All the details and proper terminology for the modes can be found in fuz's answer, which is really the one you should read.

How to enable alignment exceptions for my process on x64?

I'm curious to see if my 64-bit application suffers from alignment faults.
From Windows Data Alignment on IPF, x86, and x64 archive:
In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT.
On the x64 architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
[Ed. emphasis mine]
On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3, we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.
Ignoring the direction to consult the AMD Architecture Programmer's Manual, i will instead consult the Intel 64 and IA-32 Architectures Software Developer’s Manual
5.10.5 Checking Alignment
When the CPL is 3, alignment of memory references can be checked by setting the
AM flag in the CR0 register and the AC flag in the EFLAGS register. Unaligned memory
references generate alignment exceptions (#AC). The processor does not generate
alignment exceptions when operating at privilege level 0, 1, or 2. See Table 6-7 for a
description of the alignment requirements when alignment checking is enabled.
Excellent. I'm not sure what that means, but excellent.
Then there's also:
2.5 CONTROL REGISTERS
Control registers (CR0, CR1, CR2, CR3, and CR4; see Figure 2-6) determine operating
mode of the processor and the characteristics of the currently executing task.
These registers are 32 bits in all 32-bit modes and compatibility mode.
In 64-bit mode, control registers are expanded to 64 bits. The MOV CRn instructions
are used to manipulate the register bits. Operand-size prefixes for these instructions
are ignored.
The control registers are summarized below, and each architecturally defined control
field in these control registers are described individually. In Figure 2-6, the width of
the register in 64-bit mode is indicated in parenthesis (except for CR0).
CR0 — Contains system control flags that control operating mode and states of
the processor
AM
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking
when set; disables alignment checking when clear. Alignment checking is
performed only when the AM flag is set, the AC flag in the EFLAGS register is
set, CPL is 3, and the processor is operating in either protected or virtual-
8086 mode.
I tried
The language i am actually using is Delphi, but pretend it's language agnostic pseudocode:
void UnmaskAlignmentExceptions()
{
asm
mov rax, cr0; //copy CR0 flags into RAX
or rax, 0x20000; //set bit 18 (AM)
mov cr0, rax; //copy flags back
}
The first instruction
mov rax, cr0;
fails with a Privileged Instruction exception.
How to enable alignment exceptions for my process on x64?
PUSHF
I discovered that the x86 has the instruction:
PUSHF, POPF: Push/pop first 16-bits of EFLAGS on/off the stack
PUSHFD, POPFD: Push/pop all 32-bits of EFLAGS on/off the stack
That then led me to the x64 version:
PUSHFQ, POPFQ: Push/pop the RFLAGS quad on/off the stack
(In 64-bit world the EFLAGS are renamed RFLAGS).
So i wrote:
void EnableAlignmentExceptions;
{
asm
PUSHFQ; //Push RFLAGS quadword onto the stack
POP RAX; //Pop them flags into RAX
OR RAX, $20000; //set bit 18 (AC=Alignment Check) of the flags
PUSH RAX; //Push the modified flags back onto the stack
POPFQ; //Pop the stack back into RFLAGS;
}
And it didn't crash or trigger a protection exception. I have no idea if it does what i want it to.
Bonus Reading
How to catch data-alignment faults on x86 (aka SIGBUS on Sparc) (unrelated question; x86 not x64, Ubunutu not Windows, gcc vs not)
Applications running on x64 have access to a flag register (sometimes referred to as EFLAGS). Bit 18 in this register allows applications to get exceptions when alignment errors occur. So in theory, all a program has to do to enable exceptions for alignment errors is modify the flags register.
However
In order for that to actually work, the operating system kernel must set cr0's bit 18 to allow it. And the Windows operating system doesn't do that. Why not? Who knows?
Applications can not set values in the control register. Only the kernel can do this. Device drivers run inside the kernel, so they can set this too.
It is possible to muck about and try to get this to work by creating a device driver, see:
Old New Thing - Disabling the program crash dialog archive
and the comments that follow. Note that this post is over a decade old, so some of the links are dead.
You might also find this comment (and some of the other answers in this question) to be useful:
Larry Osterman - 07-28-2004 2:22 AM
We actually built a version of NT with alignment exceptions turned on for x86 (you can do that as Skywing mentioned).
We quickly turned it off, because of the number of apps that broke :)
As an alternative to AC for finding slowdowns due to unaligned accesses, you can use hardware performance counter events on Intel CPUs for mem_inst_retired.split_loads and mem_inst_retired.split_stores to find loads/stores that split across a cache-line boundary.
perf record -c 10 -e mem_inst_retired.split_stores,mem_inst_retired.split_loads ./a.out should be useful on Linux. -c 10 records a sample every 10 HW events. If your program does a lot of unaligned accesses and you only want to find the real hotspots, leave it at the default. But -c 10 can get useful data even on a tiny binary that calls printf once. Other perf options like -g to record parent functions on each sample work as usual, and could be useful.
On Windows, use whatever tool you prefer for looking at perf counters. VTune is popular.
Modern Intel CPUs (P6 family and newer) have no penalty for misalignment within a cache line. https://agner.org/optimize/. In fact, such loads/stores are even guaranteed to be atomic (up to 8 bytes), on Intel CPUs. So AC is stricter than necessary, but it will help find potentially-risky accesses that could be page-splits or cache-line splits with differently-aligned data.
AMD CPUs may have penalties for crossing a 16-byte boundary within a 64-byte cache line. I'm not familiar with what hardware counters are available there. Beware that profiling on Intel HW won't necessarily find slowdowns that occur on AMD CPUs, if the offending access never crosses a cache line boundary.
See How can I accurately benchmark unaligned access speed on x86_64? for some details on the penalties, including my testing on 4k-split latency and throughput on Skylake.
See also http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ for possible penalties to store-forwarding efficiency for misaligned loads/stores on Intel/AMD.
Running normal binaries with AC set is not always practical. Compiler-generated code might choose to use an unaligned 8-byte load or store to copy multiple struct members, or to store some literal data.
gcc -O3 -mtune=generic (i.e. the default with optimization enabled) assumes that cache-line splits are cheap enough to be worth the risk of using unaligned accesses instead of multiple narrow accesses like the source does. Page-splits got much cheaper in Skylake, down from ~100 to 150 cycles in Haswell to ~10 cycles in Skylake (about the same penalty as CL splits), because apparently Intel found they were less rare than they previously thought.
Many optimized library functions (like memcpy) use unaligned integer accesses. e.g. glibc's memcpy, for a 6-byte copy, would do 2 overlapping 4-byte loads from the start/end of the buffer, then 2 overlapping stores. (It doesn't have a special case for exactly 6 bytes to do a dword + word, just increasing powers of 2). This comment in the source explains its strategies.
So even if your OS would let you enable AC, you might need a special version of libraries to not trigger AC all over the place for stuff like small memcpy.
SIMD
Alignment when looping sequentially over an array really matters for AVX512, where a vector is the same width as a cache line. If your pointers are misaligned, every access is a cache-line split, not just every other with AVX2. Aligned is always better, but for many algorithms with a decent amount of computation mixed with memory access, it only makes a significant difference with AVX512.
(So with AVX1/2, it's often good to just use unaligned loads, instead of always doing extra work to check alignment and go scalar until an alignment boundary. Especially if your data is usually aligned but you want the function to still work marginally slower in case it isn't.)
Scattered misaligned accesses cross a cache line boundary essentially have twice the cache footprint from touching both lines, if the lines aren't otherwise touched.
Checking for 16, 32 or 64 byte alignment with SIMD is simple in asm: just use [v]movdqa alignment-required loads/stores, or legacy-SSE memory source operands for instructions like paddb xmm0, [rdi]. Instead of vmovdqu or VEX-coded memory source operands like vpaddb xmm0, xmm1, [rdi] which let hardware handle the case of misalignment if/when it occurs.
But in C with intrinsics, some compilers (MSVC and ICC) compile alignment-required intrinsics like _mm_load_si128 into [v]movdqu, never using [v]movdqa, so that's annoying if you actually wanted to use alignment-required loads.
Of course, _mm256_load_si256 or 128 can fold into an AVX memory source operand for vpaddb ymm0, ymm1, [rdi] with any compiler including GCC/clang, same for 128-bit any time AVX and optimization are enabled. But store intrinsics that don't get optimized away entirely do get done with vmovdqa / vmovaps, so at least you can verify store alignment.
To verify load alignment with AVX, you can disable optimization so you'll get separate load / spill into __m256i temporary / reload.
This works in 64-bit Intel CPU. May fail in some AMD
pushfq
bts qword ptr [rsp], 12h ; set AC bit of rflags
popfq
It will not work right away in 32-bit CPUs, these will require first a kernel driver to change the AM bit of CR0 and then
pushfd
bts dword ptr [esp], 12h
popfd

AVR ATmega64 using two 8-bit timers

I would like to use both 8-bit timers of an ATmega 64 microcontroller.
I used the following code to declare their compare interrupts:
.org 0x0012 ; Timer2 8 bit counter
rjmp TIM2
.org 0x001E ; Timer0 8 bit counter
rjmp TIM1
I noticed that if I enter the first interrupt (0x0012) the second timer doesn't work... its interrupt is never generated.
Why does this happen and how do I solve it?
I also notice something strange. If I reverse their order, I get the error:
Error 3 Overlap in .cseg: addr=0x1e conflicts with 0x1e:0x1f
On the ATmega other interrupts are blocked during the execution of any interrupt vector.
This is a useful feature for various reasons. This prevents an interrupt from interrupting itself, prevents potential stack overflows due to recursion, and allows special registers to be set aside specifically for use in low-latency interrupts without having to save them first, and ensures that the handler is atomic, among other reasons.
It is occasionally useful to explicitly use reentrant interrupts however, especially on the ATmega which lacks interrupt priority levels. To do this, simply add an SEI instruction to set the interrupt enable flag.
You must take great care to avoid the problems mentioned above when doing this though. Generally this means that any registers used must be preserved on the stack and that the interrupt itself needs to be disabled before the re-entrant part starts.
As for your address overlap problem, I suspect the problem is that your assembler counts its program addresses in bytes whereas the interrupt vector addresses in the datasheet are specified in words (for example, the timer 2 compare interrupt would be at 0x24 instead of 0x12). You also need to take care to return to the main code segment after finishing the definition of the vectors or any subsequent code will simply run on into the other vectors.

Resources