aarch64 KVM guest hangs on early linux boot - linux-kernel

I'm trying to set up a very minimal aarch64 KVM-capable system.
My design requires a minimalist kernel with few drivers linked to the kernel image.
My objective is to bring up a virtual machine running a bare-metal application as quickly as possible.
The same hypervisor is required later to be able to run a full-fledged Linux distribution.
It happens to me that when this aarch64 hypervisor starts a Linux VM with qemu -M virt,accel=kvm the VM executes the bootloader, the kernel's efi stub, but hangs in the kernel's arch-specific initialization.
To be more precise, running the qemu and peep into the hung system, I found the PC to be often around this position:
U-Boot 2021.01 (Aug 20 2021 - 10:50:56 +0200)
DRAM: 128 MiB
Flash: 128 MiB
In: pl011#9000000
Out: pl011#9000000
Err: pl011#9000000
Net: No ethernet found.
Hit any key to stop autoboot: 0
Timer summary in microseconds (7 records):
Mark Elapsed Stage
0 0 reset
815,990,026815,990,026 board_init_f
816,418,047 428,021 board_init_r
818,760,551 2,342,504 id=64
818,763,069 2,518 main_loop
Accumulated time:
10,017 dm_r
53,553 dm_f
10189312 bytes read in 111 ms (87.5 MiB/s)
Scanning disk virtio-blk#31...
Found 1 disks
Missing RNG device for EFI_RNG_PROTOCOL
No EFI system partition
Booting /\boot\Image
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) info registers
PC=ffffffc010010a34 X00=1de7ec7edbadc0de X01=ffffffc01084ca38
X02=ffffffc01084ca38 X03=1de7ec7edbadc0de X04=ffffffc0108f0120
X05=0000000000000348 X06=ffffffc010679000 X07=0000000000000000
X08=ffffffc010a079b8 X09=ffffffc01008ef80 X10=ffffffc0108ee398
X11=ffffffc010000000 X12=ffffffc01099a3c8 X13=0000000300000101
X14=0000000000000000 X15=0000000046df02b8 X16=0000000047f6d968
X17=0000000000000000 X18=0000000000000000 X19=ffffffc0109006c0
X20=1de7ec3eec328b16 X21=ffffffc0108f02b0 X22=ffffffc01073fd40
X23=00000000200001c5 X24=0000000046df0368 X25=0000000000000001
X26=0000000000000000 X27=0000000000000000 X28=ffffffc0109006c0
X29=ffffffc0108f0090 X30=ffffffc01065abf4 SP=ffffffc01084c2b0
PSTATE=400003c5 -Z-- EL1h FPCR=00000000 FPSR=00000000
Q00=0000000000000000:0000000000000000 Q01=0000000000000000:0000000000000000
Q02=0000000000000000:0000000000000000 Q03=0000000000000000:0000000000000000
Q04=0000000000000000:0000000000000000 Q05=0000000000000000:0000000000000000
Q06=0000000000000000:0000000000000000 Q07=0000000000000000:0000000000000000
Q08=0000000000000000:0000000000000000 Q09=0000000000000000:0000000000000000
Q10=0000000000000000:0000000000000000 Q11=0000000000000000:0000000000000000
Q12=0000000000000000:0000000000000000 Q13=0000000000000000:0000000000000000
Q14=0000000000000000:0000000000000000 Q15=0000000000000000:0000000000000000
Q16=0000000000000000:0000000000000000 Q17=0000000000000000:0000000000000000
Q18=0000000000000000:0000000000000000 Q19=0000000000000000:0000000000000000
Q20=0000000000000000:0000000000000000 Q21=0000000000000000:0000000000000000
Q22=0000000000000000:0000000000000000 Q23=0000000000000000:0000000000000000
Q24=0000000000000000:0000000000000000 Q25=0000000000000000:0000000000000000
Q26=0000000000000000:0000000000000000 Q27=0000000000000000:0000000000000000
Q28=0000000000000000:0000000000000000 Q29=0000000000000000:0000000000000000
Q30=0000000000000000:0000000000000000 Q31=0000000000000000:0000000000000000
(qemu) x/10i 0xffffffc010010a34
0xffffffc010010a34: 8b2063ff add sp, sp, x0
0xffffffc010010a38: d53bd040 mrs x0, (unknown)
0xffffffc010010a3c: cb2063e0 sub x0, sp, x0
0xffffffc010010a40: f274cc1f tst x0, #0xfffffffffffff000
0xffffffc010010a44: 54002ca1 b.ne #+0x594 (addr -0x3feffef028)
0xffffffc010010a48: cb2063ff sub sp, sp, x0
0xffffffc010010a4c: d53bd060 mrs x0, (unknown)
0xffffffc010010a50: 140003fc b #+0xff0 (addr -0x3feffee5c0)
0xffffffc010010a54: d503201f nop
0xffffffc010010a58: d503201f nop
https://elixir.bootlin.com/linux/v5.12.10/source/arch/arm64/kernel/entry.S#L113
The guest kernel is the upstream 5.12, with no modifications in the arch initialization area.
There are just a few changes in the start_kernel() where a few printk() are added to provide timestamps, but the VM never reaches them.
Also, the same hypervisor can run the guest with the same qemu by only removing the KVM acceleration.
I have a little experience with KVM and less on KVM on aarch64.
I feel like I miss something in my hypervisor kernel.
I removed many drivers from the kernel to have the quickest boot time possible, so no USB or network drivers, but I made sure the kernel to have the full-fledged virtualization suite configurations enabled.
As the system starts, /dev/kvm is there, and the qemu seems to shallow my command, including KVM feature without complaining.
But then the host hangs.
I got these results on two different platforms, just to be sure it is not hardware dependent:
Pine64 with ATF and uboot (I'm aware allwinner BSP leaves the processor at EL0, so I implemented the boot chain using ATF)
Raspberry3
Currently, I have no clue where to investigate next. Any suggestion would be greatly appreciated.

At last, I managed to make the KVM host run in the environment described up there.
Resuming here the issue:
Minimal Linux with KVM support
Minimal Linux system with its bootloader and kernel image to run in the virtualized context
Linux kernel slightly modified in the start_kernel() by adding a few printk() to emit timestamps
The image is proved to be working on qemu without the KVM extension
Test with qemu and the KVM extension gives the following results:
As the qemu starts
qemu-system-aarch64 -machine virt -cpu host -enable-kvm -smp 1 -nographic -bios ./u-boot.bin -drive file=./rootfs.ext2,if=none,format=raw,id=hd0 -device v
irtio-blk-device,drive=hd0
The bootloader gets executed, the flow is handed to the Linux kernel, the Linux kernel EFI stub emits its messages, the messages flow stops, and everything seems to hang.
Peeping into the execution at a random time, I verified the code to be
consistently inside the /arch/arm64/kernel/entry.S.
A quick inspection of this code led me to the wrong conclusion this code is to be part of a loop controlled by some exotic (for me) aarch64 specific control register.
I fired my question on StackOverflow
At a second time look, I realized no loop was involved and that the phenomenon results from an exception chain.
Apparently, my modification to the kernel where I added a printk() as early as possible added this code before the kernel correctly set up its stack.
This modification was the root of the problem.
A few things I learned from this issue:
Peep into the code at random times, and finding it around the same instruction does not imply the code is in a loop
Having the code apparently working in an environment does not imply the code's correct
Emulation and virtualization aim to give the same results, but it is very unlikely they will ever reach 100% of the cases. Consider them different if you don't want to stick with false assumptions.
For more than half of the tests, I have used the Pine64 board. Pine64, the one I own, is based on Allwinner A64 SoC. There's not a lot of documentation on this board and its main SoC. Moreover, the documentation available is old. I had concerns about its boot sequence, as there is a document online stating that because a proprietary component of the boot chain releases the processor at EL1, the KVM can not work properly. This fact triggered me to dig into the Pine64 boot process.
What I learned on the Pine64 board and KVM:
The document is referring to the early BSP provided by Allwinner. The component is NOT the boot ROM. Newer components of the boot chain based on ATF do NOT suffer this problem. Refer to the code box below for details
Allwinner A64 SoC is a Cortex-A53, and as any other ARMv8, as no design issue in running KVM or any other virtualization tool.
On aarch64, The kernel can start at EL2, or EL1. But If you need virtualization with KVM, you need it to start at EL2, because this is the exception level where hypervisors (KVM) can do their job.
U-boot starts kernel at EL2; there is a specific configuration to make it start the kernel at EL1, which is NOT the default
The starting kernel installs the KVM hypervisor code and changes the CPU to EL1 before the start_kernel () function.
U-Boot SPL 2021.01 (Aug 31 2021 - 10:14:46 +0200)
DRAM: 512 MiB
Trying to boot from MMC1
U-Boot 2021.01 (Aug 31 2021 - 10:13:59 +0200) Allwinner Technology
EL = 2
CPU: Allwinner A64 (SUN50I)
Model: Pine64
DRAM: 512 MiB
MMC: mmc#1c0f000: 0
Loading Environment from FAT... *** Warning - bad CRC, using default environment
In: serial
Out: serial
Err: serial
Net: phy interface6
eth0: ethernet#1c30000
starting USB...
Bus usb#1c1a000: USB EHCI 1.00
Bus usb#1c1a400: USB OHCI 1.0
Bus usb#1c1b000: USB EHCI 1.00
Bus usb#1c1b400: USB OHCI 1.0
scanning bus usb#1c1a000 for devices... 1 USB Device(s) found
scanning bus usb#1c1a400 for devices... 1 USB Device(s) found
scanning bus usb#1c1b000 for devices... 1 USB Device(s) found
scanning bus usb#1c1b400 for devices... 1 USB Device(s) found
scanning usb for storage devices... 0 Storage Device(s) found
Hit any key to stop autoboot: 0
switch to partitions #0, OK
mmc0 is current device
Scanning mmc 0:1...
Found U-Boot script /boot.scr
270 bytes read in 2 ms (131.8 KiB/s)
## Executing script at 4fc00000
32086528 bytes read in 1541 ms (19.9 MiB/s)
27817 bytes read in 4 ms (6.6 MiB/s)
Moving Image from 0x40080000 to 0x40200000, end=42120000
## Flattened Device Tree blob at 4fa00000
Booting using the fdt blob at 0x4fa00000
EHCI failed to shut down host controller.
Loading Device Tree to 0000000049ff6000, end 0000000049fffca8 ... OK
EL = 2
Starting kernel ...
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.12.19 (alessandro#x1) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2021.05.1) 9.4.0, GNU ld (GNU Binutils) 2.35.2) #2 SMP PREEMPT Mon Aug 30 20:06:45 CEST 2021
[ 0.000000] EL = 1
[ 0.000000] Machine model: Pine64
[ 0.000000] efi: UEFI not found.

Related

PCIe DMA problems in ARM Machines

I'm trying to write a PCIe driver for an ARM machine (Cavium ThunderX2). I'm working with Xilinx Alveo FPGAs. Our work involves migrating pages between heterogeneous nodes (x86 and ARM) and the driver takes care of the DMA between the host and the FPGA, and handles the device interrupts.
The DMA doesn't work (From Device/To Device) and I get "ARM SMMU v3.x 0x10 event occurred" errors. I tried disabling the SMMU (recommended by some threads in the NVIDIA community - https://forums.developer.nvidia.com/t/how-dma-works-in-arm-the-dma-stopped-working-with-our-pci-driver/53699), but that leads to a protection issue ("RAS Controller stopped"), and the system hangs.
I use dma_map_single APIs from dma-mapping.h to convert the virtual address to a DMA-capable bus address. Would dma_alloc_coherent make a difference? (Of course, I'll try this out)
I'm unable to figure out the problem. Is this is a PCIe driver issue or an issue with the device or is there a fix/patch available for ARM PCIe DMA ops? Any help would be appreciated!!
Thanks,
Narayan
Error snippet

How to exclude GIC distributor initialization during Linux kernel booting

I am new to Linux kernel, and now working with an AMP system on a 4-core Cortex-A53, where 2 cores are for RTOS and the other 2 for Linux.
The issue is RTOS and Linux share some H/W resource, like GIC controller and SMMU(IOMMU). I would like to initialize them(write to their register) by RTOS before booting the Linux kernel. Linux will only use these shared resource, but skip the initialization(e.g. irqchip_init()) during kernel booting.
Is it possible? Can I make it by kernel build configuration? If possible, how should I describe them in device tree?
I know this is a rare implementation case. Does anyone have experience?

NVMe PCIe Hard Disk on Freescale LS2080A not recognised

I have a Freescale LS2080 box for which I am developing a custom linux 4.1.8 kernel using the Freescale Yocto project.
I have an NVMe hard disk attached to the LS2080 via a PCIe card, but the disk is not recognised when I boot up the board with my custom linux kernel.
I plugged the same combination of NVMe disk and PCIe card into a linux 3.16.7 desktop PC and it was detected and mounted without problem.
When building the LS2080 kernel using the Yocto project, I have enabled the NVMe block device driver and I have verified that this module is present in the kernel when booting on the board.
The PCIe slot on the board is working fine because I have tried it with a PCIe Ethernet card and a PCIe SATA disk.
I suspect that I am missing something in the kernel configuration or device tree, but I'm not sure what. When I add the NVMe driver to the kernel using menuconfig, the NVMe driver dependencies are supposed to be resolved.
Can anyone provide insight into what I am missing?
First make sure that PCIe device is recognized using lspci.
If device is not shown in lspci list this is enumeration problem, to check the error use PCIe analyzers.
If the device is shown in the list then simply add the device vendor id and device ID to NVMe driver and recompile to load the driver for your device.

Linux boot process

I'm playing a little bit with the kernel linux and I got some errors during the boot process : Kernel panic - not syncing: Attempted to kill init!
I want to understand how the boot process of the linux kernel works in general, and especially during and after the start_kernel() function and the load of the rootfs.
Thank you guys.
Lets take an example of porting linux on beaglebone through mmc.,
You get the idea of boot process. It works like this -
First when we power on the board the bootrom code executes(Hard coded in the rom of board) and initialize the CPU, disable MMU.
after executing boot-ROM code it jumps to MLO (X-loader with header, it is board specific) and load it.
MLO executes and load the Uboot it is board specific and all peripherals are initialized here.
Now the Uboot executed and looking for bootcmd where the Kernel and rootfs addressed(in mmc). this calls the kernel
Kernel extracted and than it calls the initramfs (root file system)
Actually user can not interact with hardware by Kernel only thus the rootfs gives user inerface to the kernel to run application.

Where is guest ring-3 code run in VM environment?

According to the white paper that VMWare has published, binary translation techinology is only used in kernel (ring 0 codes), ring 3 code is "directly executed" on cpu hardware.
As I observed, no matter how many processes are run in the guest OS, there is always only 1 process in the host OS. So I assume all the guest ring 3 code are run in the single host process context. (for VMWare, it's vmware-vmx.exe).
So my question here is, how do you execute so many ring 3 code natively in a single process? Considering most of the windows exe file don't contain relocation information, it cannot be executed somewhere else, and binary translation is not used in ring3 code.
Thanks.
Let's talk about VMX, which is Intel VT-x's design.
Intel VT-x introduces two new modes to solve this problem: VMX root mode and VMX non-root mode, which are for host and guest respectively. Both modes have ring 0~3, which means the host and guest will not share the same ring level.
A hypervisor running in ring 3 of VMX root mode, when it decides to transfer the CPU control to a guest, the hypervisor lanuch VMLAUNCH instruction, which allows transfer to VMX non-root mode from VMX root mode. Then guest ring 3 code now is able to automatically executing in VMX non-root mode. All of this is supported by Intel VT-x. No binary translation or instruction emulation is needed for running guest.
Of course ring 3 of VMX non-root mode has less privilege and power. For example, when a guest ring 3 code encounters somthing it cannot handle, such as a physical device access request, CPU will automatically detect this kind of restriction and transfer back to hypervisor in VMX root-mode. After hypervisor finish this task, then it will trigger VMLAUNCH again to for running guest.

Resources