Introduction
Definition
A virtual machine is taken to be an efficient, isolated duplicate of the real machine.
- Duplicate: VM should behave identically to the real machine.
- Software cannot distinguish between real and virtual hardware.
- Exceptions: Fewer resources, some timing differences
- Isolated: Several VMs can share real machine without interfering with each other.
- Efficient: VM should execute at speed comparable to that of the real machine.
- Requires that most instructions are executed directly by real hardware.
Terminology:
- Simulation: Abstract model of a system is functionally simulated.
- Replicate the functions only
- Emulation: Hardware or software (or both) emulates the behavior of the guest in a host so that emulated behavior is close to behavior of real system.
- Simulators as high-level emulators.
- Virtualization: Virtualization involves simulating parts of a computer’s hardware. Enough for a guest operating system to run unmodified. But most operations still occur on the real hardware for efficiency reasons.
Types of virtualization
- Process-level
- running inside process
- e.g., JVM, .NET
- Type-1, Bare metal, system-level
- Virtual machines run its own OS. Multiple VM on hypervisor on top of hardware
- Type-2, Hosted, system-level
- Hypervisor running on top of host OS
- Hypervisor hosts guest OSs
- Significant overhead
- e.g., virtual box
- OS-level
- Applications running inside VM without OS
- VM on top of virtualization layer on host OS
- e.g., Docker
Hypervisor
Hypervisor: Program that runs on hardware or OS to implement the virtual machine(s). Also called “Virtual Machine Monitor”.
Controls resources:
- Partitions hardware
- Schedules guests
- Mediates access to shared resources
- Switches between “world” as seen by hypervisor and guests.
- Much more expensive than process switches
Requires:
- Hypervisor must run in privileged mode.
- Guests must run in non-privileged mode.
- Privileged instructions of guests must be intercepted and run by hypervisor.
Purpose
- Historically, used for easier sharing of expensive mainframes. e.g., IBM CP-40
- Ideal to host legacy systems, on new-generation hardware
- Other than for mainframes, largely disappeared in 80’s.
- Time sharing supported by other operating systems.
- Hardware too cheap to worry about consolidation.
- Improved Isolation helped virtualization become popular again in the early 2000’s.
- Cloud Computing is today’s main driver.
- Increased utilization by sharing hardware (resource over-allocation)
- Reduced maintenance through increased scale (pets vs. cattle)
- On-demand provisioning
- Dynamic load balancing through migration
Terminology
Isolation:
- improved Quality of Service
- improved security
- different concurrent OSs
Encapsulation of guest OS:
- replication
- migration/consolidation
- checkpointing
- debugging
Mechanisms
Issues
Critical instructions should not be executed directly by guest OS, such as halt. Hypervisor should find a way to solve it.
Techniques
- De-privileging (“Trap-and-Emulate”)
- All instructions that read/write privileged state trap when executed in unprivileged level.
- Execute guest OS directly, but at unprivileged level.
- Para-Virtualization
- “Modify guest operating system to provide higher-level information to VMM.”
- Interpretive Execution
- Add dedicated HW execution mode for running the guest OS.
- e.g. IBM 370 SIE (“start interpretive execution”) instruction.
- Reduces number of required traps.
- Binary Translation
- Used in VMWare.
De-privileging (Trap & Emulate)
Problematic instructions in VM will throw exception. Exception handler will execute desired code for the instructions.
Terminology:
- Privileged State: Part of system state that determines resource allocation, e.g., addressing context, processor mode.
- Sensitive Instruction: Control or behavior sensitive
- Control Sensitive Instruction: Changes privileged state.
- Example: manipulate status register, return from interrupt
- Behavior Sensitive Instruction: Exposes privileged state.
- Example: load physical address
- Control Sensitive Instruction: Changes privileged state.
- “Innocuous” Instruction: not sensitive.
- Privileged Instruction: Traps when executed in user rather than in kernel mode (NOOP not sufficient!)
Requirements:
- Theorem: For any conventional third generation [1974] computer, a virtual machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.
- In other words, A system is virtualizable by Trap & Emulate if all sensitive instructions are privileged.
Two obstacles:
- Visibility of Privileged State
- Current Privilege Level is stored in code segment register. Guest therefore can know that it runs in de-privileged mode.
- Interrupt descriptor table in virtual memory.
- Lack of Traps when Privileged Instructions run at User-Level
- Some control-sensitive instructions generate NOOP in user mode rather than generating a trap.
- e.g. POPF “pop flags”, which modifies ALU and system flags, must generate trap for VMM to intervene.
- x86 allows POPF in user mode, but simply does not update interrupt flag (IF) in the system flags if executed in user mode. (This protects OS from mis-behaving users programs.) If operating system runs in user mode, it has no way to know whether to enable interrupts or not.
Para-virtualization
- Present software interface (the “para-API”) to virtual machines that is similar but not identical to that of the underlying hardware.
- Provide specially defined ‘hooks’ to allow the guest(s) to hand over handling of problematic portions of code to VMM.
- Requires the guest operating system to be explicitly ported for the para-API.
- A conventional O/S distribution that is not paravirtualization-aware cannot be run on top of a paravirtualized VMM!
- Xen solution for closed-source O/Ss: paravirtualization-aware device drivers (e.g. XenWindowsGplPv project) to be installed in guest O/S.
Binary translation
Observation: Traditionally, software VMMs run very slow due to interpretation.
Binary Translation:
- Replace sensitive instructions in guest binary on-the-fly and replace by emulation code.
- Binaries as input, not source code.
- Dynamic translation at run-time.
- Input is full x86 instruction set. Output is safe subset.
Steps:
1. read prefixes, opcodes, operands
2. stop at 12 instructions or terminating instruction (control flow) -> Translation Unit (TU)
3. leave innocuous instructions alone
4. translate remaining instructions
5. generate compiled-code-fragment (CCF)
Instructions to be translated:
- Instructions that are affected by translation, because code layout changes:
- PC-relative addressing
- Direct control flow (direct calls, branches, jumps)
- Indirect control flow (jmp, call, ret)
- Sensitive instructions:
- Some instructions run faster in binary translation mode than native.
- e.g. cli (clear interrupts) on Pentium 4 takes 60 cycles; replaced by
vcpu.flags.IF:=0
.
- e.g. cli (clear interrupts) on Pentium 4 takes 60 cycles; replaced by
- Other operations (e.g. context switch) may need to call out to a runtime, with lots of overhead.
- Some instructions run faster in binary translation mode than native.
Observation:
- This approach scales well
- Translated code has good instruction-cache locality.
- Translator captures execution trace of guest code.
- Rarely-executed code (e.g. error handling) is placed off the “hot” execution path.
- Binary Translation is not needed for safe execution of “most user code” on “most guest
operating systems”.
Memory
Virtual, Physical and Machine memory
- Virtual memory: for process inside guest OS
- Physical memory: for guest OS
- Machine memory: for host OS
Shadow page tables
Hypervisor manage a shadow page table for each process, which is actually used by the real memory management. pointed by CR3 (Virtual Addr to Machine Addr).
Steps to create or switch shadow page table:
- Process loads virtual page table address into virtual CR3
- Translated or trapped,
handle_CR3
is executed - Hypervisor knows the address of virtual page table, then generate shadow page table
- Load virtual and real CR3 register
- Shadow page table is used for following memory operations
Page faults:
- Page fault occurred in shadow page table
- Page Fault Handler: check if the page fault is real or shadow
- real: indeed a page fault, guest page entry (virtual page table) not valid
- Direct fault to guest OS for handling
- shadow: guest page entry is valid
- guest process updates the virtual page table; hypervisor doesn’t know it
- VMM updates shadow page entry
- real: indeed a page fault, guest page entry (virtual page table) not valid
Pros & cons:
- Pros:
- Eliminates double-bookkeeping (VA->PA and PA->MA)
- Cons:
- Every page fault by virtual machine exits into the hypervisor.
- Every page table needs to be duplicated.
Solution:
- Hardware support for memory virtualization
- AMD nested paging, Intel extended page tables (EPT)
- TLB maps VA -> MA, is aware of VMs
- Such TLB is much larger
Page replacement: Double page faults
Memory Over-commitment: Move some “physical” memory to disk when memory requirement exceed available resources.
Issues:
- A page replacement algorithm now has to pick:
- victim machine (ok)
- victim page (huh!? What is a good page to replace?!)
- Double-Paging Problem:
- What can happen when we page out a physical page that is on disk?
- Some time later the guest picks physical page on disk as victim.
- Actually done in virtual page table of the guest process
- In order to page out victim, it must first be paged in by VMM beforehand.
- Some time later the guest picks physical page on disk as victim.
- This causes 2 page faults per fault.
- What can happen when we page out a physical page that is on disk?
Memory management: Ballooning
It’s best to have guest OSs to choose which page to page out.
Solution: Ballooning
- Have an indicator for the amount of memory pressure that the hypervisor experiences.
- Start virtual machines, each with a given amount of physical memory, the memory pressure for the hypervisor increases.
- Pressure increases further as we start more virtual machines.
- Deploy special device drivers that control memory balloons. Balloons just request physical memory from the guest OS and so make it inaccessible to the virtual machine (guest OS).
- This memory is then made available to the hypervisor to allocate to other virtual machines.
Process:
- If the hypervisor experiences memory pressure because it is running many virtual machines, it may inflate one ore more balloons.
- When the balloon in a virtual machine inflates, it requests more memory, thus causing memory pressure for the guest OS.
- In response to the increased memory pressure in the virtual machine, the virtual memory manager of the guest OS kicks in and starts paging out pages, either directly, or by swapping out processes.
- The memory pressure in the virtual machine is released, and the hypervisor memory pressure as well, until it reaches the target level.
Potential problem:
- Ballooning works fine as long as it works.
- Ballooning drivers may be uninstalled, disabled explicitly, unavailable during booting.
- Upper levels on balloon sizes may be imposed by guest operating system.
Reference
This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.