I'm trying to set up a very minimal aarch64 KVM-capable system.
My design requires a minimalist kernel with few drivers linked to the kernel image.
My objective is to bring up a virtual machine running a bare-metal application as quickly as possible.
The same hypervisor is required later to be able to run a full-fledged Linux distribution.
It happens to me that when this aarch64 hypervisor starts a Linux VM with qemu -M virt,accel=kvm the VM executes the bootloader, the kernel's efi stub, but hangs in the kernel's arch-specific initialization.
To be more precise, running the qemu and peep into the hung system, I found the PC to be often around this position:
U-Boot 2021.01 (Aug 20 2021 - 10:50:56 +0200)
DRAM: 128 MiB
Flash: 128 MiB
In: pl011#9000000
Out: pl011#9000000
Err: pl011#9000000
Net: No ethernet found.
Hit any key to stop autoboot: 0
Timer summary in microseconds (7 records):
Mark Elapsed Stage
0 0 reset
815,990,026815,990,026 board_init_f
816,418,047 428,021 board_init_r
818,760,551 2,342,504 id=64
818,763,069 2,518 main_loop
Accumulated time:
10,017 dm_r
53,553 dm_f
10189312 bytes read in 111 ms (87.5 MiB/s)
Scanning disk virtio-blk#31...
Found 1 disks
Missing RNG device for EFI_RNG_PROTOCOL
No EFI system partition
Booting /\boot\Image
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) info registers
PC=ffffffc010010a34 X00=1de7ec7edbadc0de X01=ffffffc01084ca38
X02=ffffffc01084ca38 X03=1de7ec7edbadc0de X04=ffffffc0108f0120
X05=0000000000000348 X06=ffffffc010679000 X07=0000000000000000
X08=ffffffc010a079b8 X09=ffffffc01008ef80 X10=ffffffc0108ee398
X11=ffffffc010000000 X12=ffffffc01099a3c8 X13=0000000300000101
X14=0000000000000000 X15=0000000046df02b8 X16=0000000047f6d968
X17=0000000000000000 X18=0000000000000000 X19=ffffffc0109006c0
X20=1de7ec3eec328b16 X21=ffffffc0108f02b0 X22=ffffffc01073fd40
X23=00000000200001c5 X24=0000000046df0368 X25=0000000000000001
X26=0000000000000000 X27=0000000000000000 X28=ffffffc0109006c0
X29=ffffffc0108f0090 X30=ffffffc01065abf4 SP=ffffffc01084c2b0
PSTATE=400003c5 -Z-- EL1h FPCR=00000000 FPSR=00000000
Q00=0000000000000000:0000000000000000 Q01=0000000000000000:0000000000000000
Q02=0000000000000000:0000000000000000 Q03=0000000000000000:0000000000000000
Q04=0000000000000000:0000000000000000 Q05=0000000000000000:0000000000000000
Q06=0000000000000000:0000000000000000 Q07=0000000000000000:0000000000000000
Q08=0000000000000000:0000000000000000 Q09=0000000000000000:0000000000000000
Q10=0000000000000000:0000000000000000 Q11=0000000000000000:0000000000000000
Q12=0000000000000000:0000000000000000 Q13=0000000000000000:0000000000000000
Q14=0000000000000000:0000000000000000 Q15=0000000000000000:0000000000000000
Q16=0000000000000000:0000000000000000 Q17=0000000000000000:0000000000000000
Q18=0000000000000000:0000000000000000 Q19=0000000000000000:0000000000000000
Q20=0000000000000000:0000000000000000 Q21=0000000000000000:0000000000000000
Q22=0000000000000000:0000000000000000 Q23=0000000000000000:0000000000000000
Q24=0000000000000000:0000000000000000 Q25=0000000000000000:0000000000000000
Q26=0000000000000000:0000000000000000 Q27=0000000000000000:0000000000000000
Q28=0000000000000000:0000000000000000 Q29=0000000000000000:0000000000000000
Q30=0000000000000000:0000000000000000 Q31=0000000000000000:0000000000000000
(qemu) x/10i 0xffffffc010010a34
0xffffffc010010a34: 8b2063ff add sp, sp, x0
0xffffffc010010a38: d53bd040 mrs x0, (unknown)
0xffffffc010010a3c: cb2063e0 sub x0, sp, x0
0xffffffc010010a40: f274cc1f tst x0, #0xfffffffffffff000
0xffffffc010010a44: 54002ca1 b.ne #+0x594 (addr -0x3feffef028)
0xffffffc010010a48: cb2063ff sub sp, sp, x0
0xffffffc010010a4c: d53bd060 mrs x0, (unknown)
0xffffffc010010a50: 140003fc b #+0xff0 (addr -0x3feffee5c0)
0xffffffc010010a54: d503201f nop
0xffffffc010010a58: d503201f nop
https://elixir.bootlin.com/linux/v5.12.10/source/arch/arm64/kernel/entry.S#L113
The guest kernel is the upstream 5.12, with no modifications in the arch initialization area.
There are just a few changes in the start_kernel() where a few printk() are added to provide timestamps, but the VM never reaches them.
Also, the same hypervisor can run the guest with the same qemu by only removing the KVM acceleration.
I have a little experience with KVM and less on KVM on aarch64.
I feel like I miss something in my hypervisor kernel.
I removed many drivers from the kernel to have the quickest boot time possible, so no USB or network drivers, but I made sure the kernel to have the full-fledged virtualization suite configurations enabled.
As the system starts, /dev/kvm is there, and the qemu seems to shallow my command, including KVM feature without complaining.
But then the host hangs.
I got these results on two different platforms, just to be sure it is not hardware dependent:
Pine64 with ATF and uboot (I'm aware allwinner BSP leaves the processor at EL0, so I implemented the boot chain using ATF)
Raspberry3
Currently, I have no clue where to investigate next. Any suggestion would be greatly appreciated.
At last, I managed to make the KVM host run in the environment described up there.
Resuming here the issue:
Minimal Linux with KVM support
Minimal Linux system with its bootloader and kernel image to run in the virtualized context
Linux kernel slightly modified in the start_kernel() by adding a few printk() to emit timestamps
The image is proved to be working on qemu without the KVM extension
Test with qemu and the KVM extension gives the following results:
As the qemu starts
qemu-system-aarch64 -machine virt -cpu host -enable-kvm -smp 1 -nographic -bios ./u-boot.bin -drive file=./rootfs.ext2,if=none,format=raw,id=hd0 -device v
irtio-blk-device,drive=hd0
The bootloader gets executed, the flow is handed to the Linux kernel, the Linux kernel EFI stub emits its messages, the messages flow stops, and everything seems to hang.
Peeping into the execution at a random time, I verified the code to be
consistently inside the /arch/arm64/kernel/entry.S.
A quick inspection of this code led me to the wrong conclusion this code is to be part of a loop controlled by some exotic (for me) aarch64 specific control register.
I fired my question on StackOverflow
At a second time look, I realized no loop was involved and that the phenomenon results from an exception chain.
Apparently, my modification to the kernel where I added a printk() as early as possible added this code before the kernel correctly set up its stack.
This modification was the root of the problem.
A few things I learned from this issue:
Peep into the code at random times, and finding it around the same instruction does not imply the code is in a loop
Having the code apparently working in an environment does not imply the code's correct
Emulation and virtualization aim to give the same results, but it is very unlikely they will ever reach 100% of the cases. Consider them different if you don't want to stick with false assumptions.
For more than half of the tests, I have used the Pine64 board. Pine64, the one I own, is based on Allwinner A64 SoC. There's not a lot of documentation on this board and its main SoC. Moreover, the documentation available is old. I had concerns about its boot sequence, as there is a document online stating that because a proprietary component of the boot chain releases the processor at EL1, the KVM can not work properly. This fact triggered me to dig into the Pine64 boot process.
What I learned on the Pine64 board and KVM:
The document is referring to the early BSP provided by Allwinner. The component is NOT the boot ROM. Newer components of the boot chain based on ATF do NOT suffer this problem. Refer to the code box below for details
Allwinner A64 SoC is a Cortex-A53, and as any other ARMv8, as no design issue in running KVM or any other virtualization tool.
On aarch64, The kernel can start at EL2, or EL1. But If you need virtualization with KVM, you need it to start at EL2, because this is the exception level where hypervisors (KVM) can do their job.
U-boot starts kernel at EL2; there is a specific configuration to make it start the kernel at EL1, which is NOT the default
The starting kernel installs the KVM hypervisor code and changes the CPU to EL1 before the start_kernel () function.
U-Boot SPL 2021.01 (Aug 31 2021 - 10:14:46 +0200)
DRAM: 512 MiB
Trying to boot from MMC1
U-Boot 2021.01 (Aug 31 2021 - 10:13:59 +0200) Allwinner Technology
EL = 2
CPU: Allwinner A64 (SUN50I)
Model: Pine64
DRAM: 512 MiB
MMC: mmc#1c0f000: 0
Loading Environment from FAT... *** Warning - bad CRC, using default environment
In: serial
Out: serial
Err: serial
Net: phy interface6
eth0: ethernet#1c30000
starting USB...
Bus usb#1c1a000: USB EHCI 1.00
Bus usb#1c1a400: USB OHCI 1.0
Bus usb#1c1b000: USB EHCI 1.00
Bus usb#1c1b400: USB OHCI 1.0
scanning bus usb#1c1a000 for devices... 1 USB Device(s) found
scanning bus usb#1c1a400 for devices... 1 USB Device(s) found
scanning bus usb#1c1b000 for devices... 1 USB Device(s) found
scanning bus usb#1c1b400 for devices... 1 USB Device(s) found
scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot: 0
switch to partitions #0, OK
mmc0 is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
270 bytes read in 2 ms (131.8 KiB/s)
## Executing script at 4fc00000
32086528 bytes read in 1541 ms (19.9 MiB/s)
27817 bytes read in 4 ms (6.6 MiB/s)
Moving Image from 0x40080000 to 0x40200000, end=42120000
## Flattened Device Tree blob at 4fa00000
Booting using the fdt blob at 0x4fa00000
EHCI failed to shut down host controller.
Loading Device Tree to 0000000049ff6000, end 0000000049fffca8 ... OK
EL = 2
Starting kernel ...
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.12.19 (alessandro#x1) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.05.1) 9.4.0, GNU ld (GNU Binutils) 2.35.2) #2 SMP PREEMPT Mon Aug 30 20:06:45 CEST 2021
[ 0.000000] EL = 1
[ 0.000000] Machine model: Pine64
[ 0.000000] efi: UEFI not found.
I have a memory dump of Linux system.
I am trying to find all the executable pages in memory belonging to the kernel. How can I do that using Volatility/Rekall?
Is it correct that an operating system will be using my hard drive space if all active programs will use all RAM space? Which will lead to performance issue (all programs will be work slower because read info from disk is slower than from RAM)
Is it correct that an operating system will be using my hard drive space if all active programs will use all RAM space?
No. The operating system uses one or more paging files to support virtual memory. In a virtual memory system all process memory is mapped to secondary storage. It uses hard drive space even when there is available memory.
Which will lead to performance issue (all programs will be work slower because read info from disk is slower than from RAM)
If a page of memory has been paged out and has to be retrieved from disk (a page fault), it is a slow process.
I'm new to kvm, can someone explain it's process when guest handle a external interrupt or the emulated device interrupt?
Thanks
Amos
In x86 architecture, Intel in this case, most interrupts will cause CPU VM exit, which means the control of CPU will return to host from guests.
So the processes are
CPU is used by guest OS in VMX non-root mode.
CPU is aware of an interrupt coming.
CPU's control returns to host running in VMX root mode. (VM exit)
The host (KVM) handles the interrupt.
Host executed VMLAUNCH instruction to let CPU transfer to VMX non-root mode again for running
guest code.
Repeat 1.
If you are new to kvm, you should first read a few papers about how kvm module works (I assume you know basic idea of virtualization).How it uses qemu to do i/o emulation etc.
I recommend you read these papers:
kvm: the Linux Virtual Machine Monitor: https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=225
Kernel-based Virtual Machine Technology : http://www.fujitsu.com/downloads/MAG/vol47-3/paper18.pdf KVM: Kernel-based Virtualization Driver: http://www.linuxinsight.com/files/kvm_whitepaper.pdf
These are papers written by guys who started kvm.( they are short and sweet :) )
After this you should start looking at the documentation of the kvm in the source code especially the file api.txt its very good.
Then I think you can jump into the source code to understand how things actually work.
Cheers
I conducted the following benchmark in qemu and qemu-kvm, with the following configuration:
CPU: AMD 4400 process dual core with svm enabled, 2G RAM
Host OS: OpenSUSE 11.3 with latest Patch, running with kde4
Guest OS: FreeDos
Emulated Memory: 256M
Network: Nil
Language: Turbo C 2.0
Benchmark Program: Count from 0000000 to 9999999. Display the counter on the screen
by direct accessing the screen memory (i.e. 0xb800:xxxx)
It only takes 6 sec when running in qemu.
But it takes 89 sec when running in qemu-kvm.
I ran the benchmark one by one, not in parallel.
I scratched my head the whole night, but still not idea why this happens. Would somebody give me some hints?
KVM uses qemu as his device simulator, any device operation is simulated by user space QEMU program. When you write to 0xB8000, the graphic display is operated which involves guest's doing a CPU `vmexit' from guest mode and returning to KVM module, who in turn sends device simulation requests to user space QEMU backend.
In contrast, QEMU w/o KVM does all the jobs in unified process except for usual system calls, there's fewer CPU context switches. Meanwhile, your benchmark code is a simple loop which only requires code block translation for just one time. That cost nothing, compared to vmexit and kernel-user communication of every iteration in KVM case.
This should be the most probable cause.
Your benchmark is an IO-intensive benchmark and all the io-devices are actually the same for qemu and qemu-kvm. In qemu's source code this can be found in hw/*.
This explains that the qemu-kvm must not be very fast compared to qemu. However, I have no particular answer for the slowdown. I have the following explanation for this and I think its correct to a large extent.
"The qemu-kvm module uses the kvm kernel module in linux kernel. This runs the guest in x86 guest mode which causes a trap on every privileged instruction. On the contrary, qemu uses a very efficient TCG which translates the instructions it sees at the first time. I think that the high-cost of trap is showing up in your benchmarks." This ain't true for all io-devices though. Apache benchmark would run better on qemu-kvm because the library does the buffering and uses least number of privileged instructions to do the IO.
The reason is too much VMEXIT take place.