Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Operating System 7 Virtualization

2018-04-29

Introduction

Definition

A virtual machine is taken to be an efficient, isolated duplicate of the real machine.

  • Duplicate: VM should behave identically to the real machine.
    • Software cannot distinguish between real and virtual hardware.
    • Exceptions: Fewer resources, some timing differences
  • Isolated: Several VMs can share real machine without interfering with each other.
  • Efficient: VM should execute at speed comparable to that of the real machine.
    • Requires that most instructions are executed directly by real hardware.

Terminology:

  • Simulation: Abstract model of a system is functionally simulated.
    • Replicate the functions only
  • Emulation: Hardware or software (or both) emulates the behavior of the guest in a host so that emulated behavior is close to behavior of real system.
    • Simulators as high-level emulators.
  • Virtualization: Virtualization involves simulating parts of a computer’s hardware. Enough for a guest operating system to run unmodified. But most operations still occur on the real hardware for efficiency reasons.

Types of virtualization

  • Process-level
    • running inside process
    • e.g., JVM, .NET
  • Type-1, Bare metal, system-level
    • Virtual machines run its own OS. Multiple VM on hypervisor on top of hardware
  • Type-2, Hosted, system-level
    • Hypervisor running on top of host OS
    • Hypervisor hosts guest OSs
    • Significant overhead
    • e.g., virtual box
  • OS-level
    • Applications running inside VM without OS
    • VM on top of virtualization layer on host OS
    • e.g., Docker

Hypervisor

Hypervisor: Program that runs on hardware or OS to implement the virtual machine(s). Also called “Virtual Machine Monitor”.

Controls resources:

  • Partitions hardware
  • Schedules guests
  • Mediates access to shared resources
  • Switches between “world” as seen by hypervisor and guests.
    • Much more expensive than process switches

Requires:

  • Hypervisor must run in privileged mode.
  • Guests must run in non-privileged mode.
  • Privileged instructions of guests must be intercepted and run by hypervisor.

Purpose

  • Historically, used for easier sharing of expensive mainframes. e.g., IBM CP-40
  • Ideal to host legacy systems, on new-generation hardware
  • Other than for mainframes, largely disappeared in 80’s.
    • Time sharing supported by other operating systems.
    • Hardware too cheap to worry about consolidation.
  • Improved Isolation helped virtualization become popular again in the early 2000’s.
  • Cloud Computing is today’s main driver.
    • Increased utilization by sharing hardware (resource over-allocation)
    • Reduced maintenance through increased scale (pets vs. cattle)
    • On-demand provisioning
    • Dynamic load balancing through migration

Terminology

Isolation:

  • improved Quality of Service
  • improved security
  • different concurrent OSs

Encapsulation of guest OS:

  • replication
  • migration/consolidation
  • checkpointing
  • debugging

Mechanisms

Issues

Critical instructions should not be executed directly by guest OS, such as halt. Hypervisor should find a way to solve it.

Techniques

  1. De-privileging (“Trap-and-Emulate”)
    • All instructions that read/write privileged state trap when executed in unprivileged level.
    • Execute guest OS directly, but at unprivileged level.
  2. Para-Virtualization
    • “Modify guest operating system to provide higher-level information to VMM.”
  3. Interpretive Execution
    • Add dedicated HW execution mode for running the guest OS.
    • e.g. IBM 370 SIE (“start interpretive execution”) instruction.
    • Reduces number of required traps.
  4. Binary Translation
    • Used in VMWare.

De-privileging (Trap & Emulate)

Problematic instructions in VM will throw exception. Exception handler will execute desired code for the instructions.

Terminology:

  • Privileged State: Part of system state that determines resource allocation, e.g., addressing context, processor mode.
  • Sensitive Instruction: Control or behavior sensitive
    • Control Sensitive Instruction: Changes privileged state.
      • Example: manipulate status register, return from interrupt
    • Behavior Sensitive Instruction: Exposes privileged state.
      • Example: load physical address
  • “Innocuous” Instruction: not sensitive.
  • Privileged Instruction: Traps when executed in user rather than in kernel mode (NOOP not sufficient!)

Requirements:

  • Theorem: For any conventional third generation [1974] computer, a virtual machine monitor may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.
  • In other words, A system is virtualizable by Trap & Emulate if all sensitive instructions are privileged.

Two obstacles:

  1. Visibility of Privileged State
    • Current Privilege Level is stored in code segment register. Guest therefore can know that it runs in de-privileged mode.
    • Interrupt descriptor table in virtual memory.
  2. Lack of Traps when Privileged Instructions run at User-Level
    • Some control-sensitive instructions generate NOOP in user mode rather than generating a trap.
    • e.g. POPF “pop flags”, which modifies ALU and system flags, must generate trap for VMM to intervene.
      • x86 allows POPF in user mode, but simply does not update interrupt flag (IF) in the system flags if executed in user mode. (This protects OS from mis-behaving users programs.) If operating system runs in user mode, it has no way to know whether to enable interrupts or not.

Para-virtualization

  • Present software interface (the “para-API”) to virtual machines that is similar but not identical to that of the underlying hardware.
  • Provide specially defined ‘hooks’ to allow the guest(s) to hand over handling of problematic portions of code to VMM.
  • Requires the guest operating system to be explicitly ported for the para-API.
    • A conventional O/S distribution that is not paravirtualization-aware cannot be run on top of a paravirtualized VMM!
    • Xen solution for closed-source O/Ss: paravirtualization-aware device drivers (e.g. XenWindowsGplPv project) to be installed in guest O/S.

Binary translation

Observation: Traditionally, software VMMs run very slow due to interpretation.

Binary Translation:

  • Replace sensitive instructions in guest binary on-the-fly and replace by emulation code.
  • Binaries as input, not source code.
  • Dynamic translation at run-time.
  • Input is full x86 instruction set. Output is safe subset.

Steps:

1. read prefixes, opcodes, operands
2. stop at 12 instructions or terminating instruction (control flow) -> Translation Unit (TU)
3. leave innocuous instructions alone
4. translate remaining instructions
5. generate compiled-code-fragment (CCF)

Instructions to be translated:

  1. Instructions that are affected by translation, because code layout changes:
    • PC-relative addressing
    • Direct control flow (direct calls, branches, jumps)
    • Indirect control flow (jmp, call, ret)
  2. Sensitive instructions:
    • Some instructions run faster in binary translation mode than native.
      • e.g. cli (clear interrupts) on Pentium 4 takes 60 cycles; replaced by vcpu.flags.IF:=0.
    • Other operations (e.g. context switch) may need to call out to a runtime, with lots of overhead.

Observation:

  • This approach scales well
  • Translated code has good instruction-cache locality.
    • Translator captures execution trace of guest code.
    • Rarely-executed code (e.g. error handling) is placed off the “hot” execution path.
  • Binary Translation is not needed for safe execution of “most user code” on “most guest
    operating systems”.

Memory

Virtual, Physical and Machine memory

  • Virtual memory: for process inside guest OS
  • Physical memory: for guest OS
  • Machine memory: for host OS

Shadow page tables

Hypervisor manage a shadow page table for each process, which is actually used by the real memory management. pointed by CR3 (Virtual Addr to Machine Addr).

Steps to create or switch shadow page table:

  1. Process loads virtual page table address into virtual CR3
  2. Translated or trapped, handle_CR3 is executed
  3. Hypervisor knows the address of virtual page table, then generate shadow page table
  4. Load virtual and real CR3 register
  5. Shadow page table is used for following memory operations

Page faults:

  1. Page fault occurred in shadow page table
  2. Page Fault Handler: check if the page fault is real or shadow
    • real: indeed a page fault, guest page entry (virtual page table) not valid
      • Direct fault to guest OS for handling
    • shadow: guest page entry is valid
      • guest process updates the virtual page table; hypervisor doesn’t know it
      • VMM updates shadow page entry

Pros & cons:

  • Pros:
    • Eliminates double-bookkeeping (VA->PA and PA->MA)
  • Cons:
    • Every page fault by virtual machine exits into the hypervisor.
    • Every page table needs to be duplicated.

Solution:

  • Hardware support for memory virtualization
    • AMD nested paging, Intel extended page tables (EPT)
    • TLB maps VA -> MA, is aware of VMs
      • Such TLB is much larger

Page replacement: Double page faults

Memory Over-commitment: Move some “physical” memory to disk when memory requirement exceed available resources.

Issues:

  • A page replacement algorithm now has to pick:
    • victim machine (ok)
    • victim page (huh!? What is a good page to replace?!)
  • Double-Paging Problem:
    • What can happen when we page out a physical page that is on disk?
      • Some time later the guest picks physical page on disk as victim.
        • Actually done in virtual page table of the guest process
      • In order to page out victim, it must first be paged in by VMM beforehand.
    • This causes 2 page faults per fault.

Memory management: Ballooning

It’s best to have guest OSs to choose which page to page out.

Solution: Ballooning

  • Have an indicator for the amount of memory pressure that the hypervisor experiences.
  • Start virtual machines, each with a given amount of physical memory, the memory pressure for the hypervisor increases.
  • Pressure increases further as we start more virtual machines.
  • Deploy special device drivers that control memory balloons. Balloons just request physical memory from the guest OS and so make it inaccessible to the virtual machine (guest OS).
    • This memory is then made available to the hypervisor to allocate to other virtual machines.

Process:

  1. If the hypervisor experiences memory pressure because it is running many virtual machines, it may inflate one ore more balloons.
  2. When the balloon in a virtual machine inflates, it requests more memory, thus causing memory pressure for the guest OS.
  3. In response to the increased memory pressure in the virtual machine, the virtual memory manager of the guest OS kicks in and starts paging out pages, either directly, or by swapping out processes.
  4. The memory pressure in the virtual machine is released, and the hypervisor memory pressure as well, until it reaches the target level.

Potential problem:

  • Ballooning works fine as long as it works.
  • Ballooning drivers may be uninstalled, disabled explicitly, unavailable during booting.
  • Upper levels on balloon sizes may be imposed by guest operating system.

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.