According to the white paper that VMWare has published, binary translation techinology is only used in kernel (ring 0 codes), ring 3 code is "directly executed" on cpu hardware.
As I observed, no matter how many processes are run in the guest OS, there is always only 1 process in the host OS. So I assume all the guest ring 3 code are run in the single host process context. (for VMWare, it's vmware-vmx.exe).
So my question here is, how do you execute so many ring 3 code natively in a single process? Considering most of the windows exe file don't contain relocation information, it cannot be executed somewhere else, and binary translation is not used in ring3 code.
Thanks.
Let's talk about VMX, which is Intel VT-x's design.
Intel VT-x introduces two new modes to solve this problem: VMX root mode and VMX non-root mode, which are for host and guest respectively. Both modes have ring 0~3, which means the host and guest will not share the same ring level.
A hypervisor running in ring 3 of VMX root mode, when it decides to transfer the CPU control to a guest, the hypervisor lanuch VMLAUNCH instruction, which allows transfer to VMX non-root mode from VMX root mode. Then guest ring 3 code now is able to automatically executing in VMX non-root mode. All of this is supported by Intel VT-x. No binary translation or instruction emulation is needed for running guest.
Of course ring 3 of VMX non-root mode has less privilege and power. For example, when a guest ring 3 code encounters somthing it cannot handle, such as a physical device access request, CPU will automatically detect this kind of restriction and transfer back to hypervisor in VMX root-mode. After hypervisor finish this task, then it will trigger VMLAUNCH again to for running guest.
Related
There are already several questions about how to enable virtualization on Macs (e.g. How to enable support of CPU virtualization on Macbook Pro?). It is often reported that sysctl -a | grep 'machdep.cpu.feature.*VMX' should match, but with a caveat: matching means that virtualization is supported by the cpu, not that it is enabled.
Is there a way to check that virtualization is enabled? I'm ready to compile and run a small program if that's what it takes to be able to answer, but I'd rather not.
There are 3 things which basically tells you if Intel VMX is supported and enabled on a machine or not. This is not OS specific but specific to Intel boards.
CPUID.1 will tell you in ecx.BIT[5] == 1 if CPU supports vmx.
IA32_FEATURE_CONTROL MSR BIT.2 == 1 will tell you if VMX is enabled in normal mode. If BIT.2 is 0 and BIT.0 is 1 in this MSR, this means VMX is disabled and locked in BIOS. You would need to reboot and enable that in BIOS.
Control Register CR4.BIT.13[VMXE] == 1 will tell you that VMX is enabled now on the machine. CPU will GPF if CR4.VMXE bit is cleared and you try to execute a VMXON instruction to enter VMX root mode.
You can write a small program to do this and check what you are missing.
What is the difference between KVM and Linux Containers (LXCs)? To me it seems, that LXC is also a way of creating multiple VMs within the same kernel if we use both "namespaces" and "control groups" features of kernel.
Text from https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/7-Beta/html/Resource_Management_and_Linux_Containers_Guide/sec-Linux_Containers_Compared_to_KVM_Virtualization.html Copyright © 2014 Red Hat, Inc.:
Linux Containers Compared to KVM Virtualization
The main difference between the KVM virtualization and Linux
Containers is that virtual machines require a separate kernel instance
to run on, while containers can be deployed from the host operating
system. This significantly reduces the complexity of container
creation and maintenance. Also, the reduced overhead lets you create a
large number of containers with faster startup and shutdown speeds.
Both Linux Containers and KVM virtualization have certain advantages
and drawbacks that influence the use cases in which these technologies
are typically applied:
KVM virtualization
KVM virtualization lets you boot full operating systems of different
kinds, even non-Linux systems. However, a complex setup is sometimes
needed. Virtual machines are resource-intensive so you can run only a
limited number of them on your host machine.
Running separate kernel instances generally means better separation
and security. If one of the kernels terminates unexpectedly, it does
not disable the whole system. On the other hand, this isolation makes
it harder for virtual machines to communicate with the rest of the
system, and therefore several interpretation mechanisms must be used.
Guest virtual machine is isolated from host changes, which lets you
run different versions of the same application on the host and virtual
machine. KVM also provides many useful features such as live
migration. For more information on these capabilities, see Red Hat
Enterprise Linux 7 Virtualization Deployment and Administration Guide.
Linux Containers:
The current version of Linux Containers is designed primarily to
support isolation of one or more applications, with plans to implement
full OS containers in the near future. You can create or destroy
containers very easily and they are convenient to maintain.
System-wide changes are visible in each container. For example, if you
upgrade an application on the host machine, this change will apply to
all sandboxes that run instances of this application.
Since containers are lightweight, a large number of them can run
simultaneously on a host machine. The theoretical maximum is 6000
containers and 12,000 bind mounts of root file system directories.
Also, containers are faster to create and have low startup times.
source
LXC, or Linux Containers are the lightweight and portable OS based virtualization units which share the base operating system's kernel, but at same time act as an isolated environments with its own filesystem, processes and TCP/IP stack. They can be compared to Solaris Zones or Jails on FreeBSD. As there is no virtualization overhead they perform much better then virtual machines.
KVM represents the virtualization capabilities built in the own Linux kernel. As already stated in the previous answers, it's the hypervisor of type 2, i.e. it's not running on a bare metal.
This whitepaper gives the difference between the hypervisor and linux containers and also some history behind the containers
http://sp.parallels.com/fileadmin/media/hcap/pcs/documents/ParCloudStorage_Mini_WP_EN_042014.pdf
An excerpt from the paper:
a hypervisor works by having the host operating
system emulate machine
hardware and then bringing up
other virtual machines (VMs)
as guest operating systems on
top of that hardware. This
means that the communication
between guest and host
operating systems must follow
a hardware paradigm (anything
that can be done in hardware
can be done by the host to the
guest).
On the other hand,
container virtualization (shown
in figure 2), is virtualization at
the operating system level,
instead of the hardware level.
So each of the guest operating
systems shares the same kernel, and
sometimes parts of the operating system, with
the host. This enhanced sharing gives
containers a great advantage in that they are
leaner and smaller than hypervisor guests,
simply because they're sharing much more of
the pieces with the host. It also gives them the
huge advantage that the guest kernel is much
more efficient about sharing resources
between containers, because it sees the
containers as simply resources to be
managed.
An example:
Container 1 and Container 2 open the same file, the host kernel opens the file and
puts pages from it into the kernel page cache. These pages are then handed out to
Container 1 and Container 2 as they are needed, and if both want to read the same
position, they both get the same page.
In the case of VM1 and VM2 doing the same thing, the host opens the file (creating
pages in the host page cache) but then each of the kernels in VM1 and VM2 does
the same thing, meaning if VM1 and VM2 read the same file, there are now three
separate pages (one in the page caches of the host, VM1 and VM2 kernels) simply
because they cannot share the page in the same way a container can. This
advanced sharing of containers means that the density (number of containers of
Virtual Machines you can run on the system) is up to three times higher in the
container case as with the Hypervisor case.
Summary:
KVM is a Hypervisor based on emulating virtual hardware. Containers, on the other hand, are based on shared operating systems and is skinnier. But this poses a limitation on the containers that that we are using a single shared kernel and hence cant run Windows and Linux on the same shared hardware
I'm new to kvm, can someone explain it's process when guest handle a external interrupt or the emulated device interrupt?
Thanks
Amos
In x86 architecture, Intel in this case, most interrupts will cause CPU VM exit, which means the control of CPU will return to host from guests.
So the processes are
CPU is used by guest OS in VMX non-root mode.
CPU is aware of an interrupt coming.
CPU's control returns to host running in VMX root mode. (VM exit)
The host (KVM) handles the interrupt.
Host executed VMLAUNCH instruction to let CPU transfer to VMX non-root mode again for running
guest code.
Repeat 1.
If you are new to kvm, you should first read a few papers about how kvm module works (I assume you know basic idea of virtualization).How it uses qemu to do i/o emulation etc.
I recommend you read these papers:
kvm: the Linux Virtual Machine Monitor: https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=225
Kernel-based Virtual Machine Technology : http://www.fujitsu.com/downloads/MAG/vol47-3/paper18.pdf KVM: Kernel-based Virtualization Driver: http://www.linuxinsight.com/files/kvm_whitepaper.pdf
These are papers written by guys who started kvm.( they are short and sweet :) )
After this you should start looking at the documentation of the kvm in the source code especially the file api.txt its very good.
Then I think you can jump into the source code to understand how things actually work.
Cheers
I conducted the following benchmark in qemu and qemu-kvm, with the following configuration:
CPU: AMD 4400 process dual core with svm enabled, 2G RAM
Host OS: OpenSUSE 11.3 with latest Patch, running with kde4
Guest OS: FreeDos
Emulated Memory: 256M
Network: Nil
Language: Turbo C 2.0
Benchmark Program: Count from 0000000 to 9999999. Display the counter on the screen
by direct accessing the screen memory (i.e. 0xb800:xxxx)
It only takes 6 sec when running in qemu.
But it takes 89 sec when running in qemu-kvm.
I ran the benchmark one by one, not in parallel.
I scratched my head the whole night, but still not idea why this happens. Would somebody give me some hints?
KVM uses qemu as his device simulator, any device operation is simulated by user space QEMU program. When you write to 0xB8000, the graphic display is operated which involves guest's doing a CPU `vmexit' from guest mode and returning to KVM module, who in turn sends device simulation requests to user space QEMU backend.
In contrast, QEMU w/o KVM does all the jobs in unified process except for usual system calls, there's fewer CPU context switches. Meanwhile, your benchmark code is a simple loop which only requires code block translation for just one time. That cost nothing, compared to vmexit and kernel-user communication of every iteration in KVM case.
This should be the most probable cause.
Your benchmark is an IO-intensive benchmark and all the io-devices are actually the same for qemu and qemu-kvm. In qemu's source code this can be found in hw/*.
This explains that the qemu-kvm must not be very fast compared to qemu. However, I have no particular answer for the slowdown. I have the following explanation for this and I think its correct to a large extent.
"The qemu-kvm module uses the kvm kernel module in linux kernel. This runs the guest in x86 guest mode which causes a trap on every privileged instruction. On the contrary, qemu uses a very efficient TCG which translates the instructions it sees at the first time. I think that the high-cost of trap is showing up in your benchmarks." This ain't true for all io-devices though. Apache benchmark would run better on qemu-kvm because the library does the buffering and uses least number of privileged instructions to do the IO.
The reason is too much VMEXIT take place.
I read this blog:
After conducting a bit of research I
discovered that this is caused by the
fact that the default Linux kernel
runs at a 1000Hz internal clock
frequency and that VMware is unable to
deliver the clock interrupts on time
without losing them. This means that
some clock interrupts are lost without
notice to the Linux kernels which
assumes each interrupt marks 1/1000th
of a second. So each clock interrupt
that gets lost makes the clock fall
behind a 1/1000th of a second.
Now, my question is, how does the hypervisor sync time internally if the hypervisor is capable of handling the clock interrupts?
Because when say (scaled up example, not real world): its 19:10:22 on Host, till it propagates to the guest, it will be 19:10:23 on the host.
I understand this is a hard problem, but I guess you need to slow the time from the VMs prespective. How is that achieved?
VMWare timekeeping
The hypervisor does not synchronize the clocks. It is software running in the guest VM that keeps the clocks in sync.
From page 15 (with the explanation continuing on through page 19) of your linked PDF:
There are two main options available for guest operating system clock synchronization: VMware Tools periodic clock synchronization
or the native synchronization software that you would use with the guest operating system if you were running it directly on physical
hardware. Some examples of native synchronization software are Microsoft W32Time for Windows and NTP for Linux.
The VMware Tools clock sync tool just checks the guest clock against the host clock every so often (probably once per minute) and corrects the guest clock if it's wrong. If the guest clock is off by just a little bit the tool will speed up or slow down the guest clock until it has the correct time (using an API like SetSystemTimeAdjustment on Windows or adjtime on Unix). If you're wondering how the tool accesses the host's clock, there's just an API for it that the VMware tool knows how to use.