I am having trouble understanding which (if any) system calls cause a VM Exit to VMX root-mode under Intel VMX. I am specifically interested in network-related system calls (i.e. socket, accept, send, recv) as they require a "virtual" device. I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
Any clarification would be greatly appreciated.
According to the Intel 64 and IA-32 Architectures Software Developer's Manual (Volume 3, Chapter 22) none of int 0x80, sysenter and syscall, the three main instructions used under Linux to execute a system call, can cause VM exits per se. So in general there isn't a clear-cut way to tell which syscalls cause a VM exit and which ones don't.
VM exits can occur in a lot of scenarios, for example the host can configure an exception bitmap to decide which exceptions cause a VM exit, including page faults, so in theory almost any piece of code doing memory operations (kernel or user) could cause a VM exit.
Excluding such an extreme case and talking specifically about networking, as Peter Cordes suggests in the above comment, what you should be concerned about are operations that [may] send and receive data, since those will eventually require communication with the hardware (NIC):
Syscalls like socket, socketpair, {get,set}sockopt, bind, shutdown (etc.) should not cause VM exits since they do not require communication with the underlying hardware and they merely manipulate logical kernel data structures.
read, recv and write can cause VM exits unless the kernel already has data available to read or is waiting to accumulate enough data to write (e.g. as per Nagle's algorithm) before sending. Whether or not the kernel actually stops to read from HW or directly sends to HW depends on socket options, syscall flags and current state of the underlying socket/connection.
sendto, recvfrom, sendmsg, recvmsg (etc.), select, poll, epoll (etc.) on network sockets can all cause VM exits, again depending on the specific situation, pretty much the same reasoning as the previous point.
connect should not need to VM exit for datagram sockets (SOCK_DGRAM) as it merely sets a default address, but definitely can for connection-based protocols (e.g. SOCK_STREAM) as the kernel needs to send and receive packets to establish a connection.
accept also needs to send/receive data and therefore can cause VM exits.
I understand the hypervisor will have to be invoked to actually open a socket, but could this be done in parallel (assuming on a multi-core processor)?
"In parallel" is not the term that I would use, but network I/O operations can be handled by the OS asynchronously, e.g. packets are not necessarily received or sent exactly when requested through a syscall, but when needed. For example, one or more VM exits needed to receive data could have already been performed before the guest userspace program issues the relative syscall.
Is it always necessary for a VM Exit to occur (if necessary) to send a packet on the NIC if on a multi-core system and there are available cores that could allow the VMM and a guest to run concurrently? I guess what I'm asking if increased parallelism could prevent VM Exits simply by allowing the hypervisor to run in parallel with a guest.
When a VM exit occurs the guest CPU is stopped and cannot resume execution until the VMM issues a VMRESUME for it (see Intel SDE Vol 3 Chapter 23.1 "Virtual Machine Control Structures Overview"). It is not possible to "prevent" a VM exit from occurring, however on a multi-processor system the VMM could theoretically run on multiple cores and delegate the handling of a VM exit to another VMM thread while resuming the stopped VM early.
So while increased parallelism cannot prevent VM exits, it could theoretically reduce their overhead. However do note that this can only happen for VM exits that can be handled "lazily" while resuming the guest. As an example, if the guest page-faults and VM-exits, the VMM cannot really "delegate" the handling of the VM exit and resume the guest earlier, since the guest will need the page fault to be resolved before resuming execution.
All in all, whenever the guest kernel needs to communicate with hardware, this can be a cause of VM exit. Access to emulated hardware for I/O operations requires the hypervisor to step in and therefore cause VM exits. There are however possible optimizations to consider:
Hardware passthrough can be used on systems which support IOMMU to make devices directly available to the guest OS and achieve very low overhead in HW communication with no need for VM exits. See Intel VT-d, Intel VT-c, SR-IOV, and also "PCI passthrough via OVMF" on ArchWiki.
Virtio is a standard for paravirtualization of network (NICs) and block devices (disks) which aims at reducing I/O overhead (i.e. overall number of needed VM exits), but needs support from both guest and host. The guest is "aware" of being a guest in this case. See also: Virtio for Linux/KVM.
Further reading:
x86 virtualization - Wikipedia
Virtual device passthrough for high speed VM networking - S. Garzarella, G. Lettieri, L. Rizzo
virtio: Towards a De-Facto Standard For Virtual I/O Devices - Rusty Russell
I am trying to troubleshoot a Windows 10 Pro system. This is complicated by the fact that the system event log is flooded by a set of identical information level messages about TPM every minute.
The messages are:
The TPM has been cleared. Reason: ByPPI.
The Ownership of the Trusted Platform Module (TPM) hardware on this computer was
successfully taken (TPM TakeOwnership command) by the system.
The TBS device identifier has been generated.
The TPM was successfully provisioned and is now ready for use.
I found a TPM-Maintenance task in the task scheduler that seems to run every minute.
Can someone explain why Windows is clearing the TPM once a minute? Is that necessary? Can I safely disable the TPM-Maintenance task to prevent this flood of messages?
Thanks!
I want to make my development board's bootloader (which is u-boot) be locked so no one can stop autoboot process at all. (In this way I will be sure that no one can alter the firmware of the board!)
So please guide me in order to locking the firmware and block any "root" attempts. I am using amlogic s905x SoC.
Another question:
Is this process relates to verified boot and signing kernel images?
If your SoC chip supports secure boot, then it is a way to make its firmware, secure of jail breaking. consult the datasheet. Otherwise you can not 'lock' the firmware.
You can use CONFIG_AUTOBOOT_KEYED=Password to require a password for stopping autoboot.
You can use CONFIG_BOOTDELAY=-2 to autoboot with no delay and not check for abort.
This will not stop users with physical access from altering the firmware. They could simply insert there own SD-card or USB stick or reprogram the eMMC.
CONFIG_AUTOBOOT_KEYED will make you live easier if you need to reflash a returned device.
I ran into a trouble with a bad network performance on Centos. The issue was observed on the latest OpenVZ RHEL7 kernel (3.10 based) on Dell server with 24 cores and Broadcom 5720 NIC. No matter it was host system or OpenVZ container. Server receives RTMP connections and reproxy RTMP streams to another consumers. Reads and writes was unstable and streams froze periodically for few seconds.
I've started to check system with strace and perf. Strace affects system heavily and seems that only perf may help. I've used OpenVZ debug kernel with debugfs enabled. System spends too much time in swapper process (according to perf data). I've built flame graph for the system under the load (100mbit in data, 200 mbit out) and have noticed that kernel spent too much time in tcp_write_xmit and tcp_ack. On the top of these calls I see save_stack syscalls.
On another hand, I tested the same scenario on Amazon EC2 instance (latest Amazon Linux AMI 2017.09) and perf doesn't track such issues. Total amount of samples was 300000, system spends 82% of time according to perf samples in swapper, but net_rx_action (and as consequent tcp_write_xmit and tcp_ack) in swapper takes only 1797 samples (0.59% of total amount of samples). On the top of net_rx_action call in flame graph I don't see any calls related to stack traces.
Output of OpenVZ system looks differently. Among 1833152 samples 500892 (27%) was in swapper process, 194289 samples (10.5%) was in net_rx_action.
Full svg of calls on vzkernel7 is here and svg of EC2 instance calls is here. You may download it and open in browser to interactively check flame graph.
So, I want to ask for help and I have few questions.
Why flame graph from EC2 instance doesn't contain so much save_stack calls like my server?
Does perf forces system to call save_stack or it's some kernel setting? May it be disabled and how?
Does Xen on EC2 guest process all tcp_ack and other syscalls? Is it possible that host system on EC2 server makes some job and guest system doesn't see it?
Thank you for a help.
I've read kernel sources and have an answer for my questions.
save_stack calls is caused by the Kernel Address Sanitizer feature that was enabled in OpenVZ debug kernel by CONFIG_KASAN option. When this options is enabled, on each kmem_cache_free syscall kernel calls __cache_free
static inline void __cache_free(struct kmem_cache *cachep, void *objp,
unsigned long caller)
{
/* Put the object into the quarantine, don't touch it for now. */
if (kasan_slab_free(cachep, objp))
return;
___cache_free(cachep, objp, caller);
}
With CONFIG_KASAN disabled kasan_slab_free will response with false (check include/linux/kasan.h). OpenVZ debug kernel was built with CONFIG_KASAN=y, Amazon AMI wasn't.
My Xen domu's keep drifting their times. My dom0 is using kernel 3.2.0 for AMD 64. DomU's are using 2.6.26. How do you keep the time from drifting?
I guess your Domu's are getting drifted w.r.t your Dom0 time. In that case , you can do one of the following:-
Configure ntp (Network time protocol) on your Domu. It is a fairly simple process. You can make changes in your /etc/ntp.conf file (assuming it is a linux domu) to include Dom0 as a ntp server and then start ntp daemon ("service ntpd start"). This way domu can sync their time with Dom0.
Configure Dom0 and all Domu's so that they sync their time with an external server. Make changes in ntp configuration file of all and then restart ntp on all of them. This way all of them will be in sync with some external source and hence time drift should not happen.
For an immediate time sync , you can use "ntpdate" command on Domu.