Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Operating System 3 Concurrency and Threading

2018-04-27

User level view

Why concurrency

  1. Latency reduction
    • Apply parallel algorithm
  2. Latency hiding
    • Perform useful work while another operation is pending (multiprogramming)
  3. Throughput increase
    • Multiple concurrent executions to allow more simultaneous work

How

  • Serial
    • Complete one task then start new
    • Pros:
      • No potential for conflict
      • Inter-task invariants cannot be violated
    • Cons:
      • No multiprogramming
      • No multiprocessor parallelism
  • Cooperative
    • Voluntarily yield CPU at well-defined points in execution
    • Pros:
      • Allows for some controlled multiprogramming
      • Invariant only needs to be ensured at yielding points
    • Cons:
      • Invariants are not automatically enforced
  • Preemptive
    • Execution of tasks can interleave
    • A scheduler is needed
    • Pros:
      • Allows for high-level of multiprogramming
      • Parallelism
    • Cons:
      • Invariants must be hold at all times
      • Mechanism: synchronization

Invariant: predicate that has to be true at all times

Threads

  • Scheduler manages threads
  • Threads has its own registers and stack
  • POSIX thread model is widely used
    • Thread attributes controls the behavior of threads
    • Functions:
      • pthread_create()
      • pthread_cancel()

Low-level thread dispatching on x86

Dispatch will store state of current thread execution and load state of target thread onto CPU.

Dispatching in 3 steps:

  1. Save state of current thread
    • processor’s, like registers, etc
    • Higher level can be handled by the system
  2. Load state of new thread
  3. Continue execution of new thread

Thread dispatching can be handled in the way of handling exceptions.

Handling exceptions:

  1. Push error code and interrupt number onto stack
  2. Call a function to
    • push remaining registers onto stack (save states), and
    • High-level exception handler is called and has access to those CPU registers
    • Previously saved states is loaded from the stack and into the register
    • Return with iret

Steps for dispatching:

  1. Thread A call function dispatch_to(B)
  2. Current state for A is saved in A stack
  3. Load state of B from B stack into register
  4. iret to return, B takes control
  5. B return from previous dispatch_to()
  6. Continue execution of B

Structure of a thread

Thread control block

  • thread id
  • stack: pointer to the stack
  • stackptr: pointer to the current top of the stack
  • priority
  • bookkeeping
  • etc

Details of dispatch

Steps:

  1. fix stack
    • have return_addr, KERNEL_CS, eflags and _thread in stack
  2. save state
    • push fake error code and interrupt number
    • save general purpose registers
    • save stack pointer in the TCB
  3. locate state of new thread (from function argument)
    • make the new thread current, change pointer in Thread
  4. load state of the new thread

Thread creation

Thread needs to be created by pushing:

  1. tshutdown
  2. _tfunction, thread function
  3. eflags, instruction pointer
  4. fake error code and interrupt number
  5. general purpose registers
  6. segment registers

Execution life cycle of a thread

  1. Create a new thread
  2. Someone dispatches to this thread
  3. Load those states of the new thread
  4. Execute thread_start() at the top of stack
  5. Execute thread function _tfunction
  6. Execute tshutdown()

CPU scheduling

State of a thread/process

  1. starting, –> 2
  2. ready, ready to execute, switch to running
  3. running, can be switched back to ready
  4. blocked, switched from running
  5. suspended ready, swap out due to memory shortage
  6. suspended blocked, swap out due to memory shortage
  7. terminated, from running

Scheduler:

  • long-term (admission) scheduler
    • state 1 and 2
  • short-term (CPU) scheduler
    • 2 and 3
    • for CPU-burst, threads compete for CPU
  • medium-term (memory) scheduler
    • 2 and 5, 4 and 6

Decision points for scheduler:

  • 3 to 4
  • 3 to 7
  • 4 to 2
  • 3 to 2

Scheduler

There is a ready queue containing TCBs for ready threads. Scheduler is responsible to pick a thread from it and dispatch to this thread.

Criteria:

  • User oriented
    • Response time, between start and first response
    • Normalized response time, ratio of response time to execution time
    • Wait time, sum of periods waiting in ready queue
  • System oriented
    • CPU utilization: percentage of time CPU is busy
    • Throughput: number of job completed per time unit

FIFO

  • First come, first serve
  • very simple
  • long average and worst-case waiting time
  • poor dynamic behavior

Shortest Job First

  • When CPU is idle, pick thread with shortest next CPU burst
  • minimize average waiting times
  • starvation of threads with long CPU bursts
  • difficulty to determine the length of CPU burst

Fixed-priority

  • When CPU is idle, pick thread with highest priority
  • priority: thread/process class or urgency
  • Starvation of low-priority threads
  • Solution: aging, increase priority along time
    • Use an array of FIFO queues
    • Aging by move between queues

Round-Robin

  • FIFO with preemption after time quantum
  • Jobs are only allowed to be executed continuously for a time quantum
  • After a time quantum, a job will be moved back to the tail of FIFO queue
  • Choice of time quantum
    • large, FIFO
    • small, processor sharing, context switching overhead

MLFQ, Multilevel Feedback Queue

  • An array of Round-robin queues with different quantum values
  • Queues have different priority
  • Aging is implemented
  • Demotion: preempted thread is added at the end of the next lower-priority queue

Multiprocessor scheduling

  • Better to have a single scheduler for a CPU
    • Reduce context switching overload
  • A global thread partitioning module is used to assign threads

User-level vs. kernel-level threads

Kernel-level threads

  • threads managed by OS
    • A thread in user space is transferred into kernel-level thread in kernel space
    • A thread in user space corresponds to one thread in kernel space
  • Pro: avoid system-integration issues
  • Con: Too heavy-weight
    • Performance cost of thread management operations
      • Cross protection boundary on every thread operation
      • System calls
    • Cost of generality, a single implementation must be used by all applications
      • user-level libraries can be tuned to applications

User-level threads

  • managed by runtime library
  • Management operations require no kernel intervention
    • Multiple user-level threads are actually one thread in kernel space (while system call)
    • Library can allocate multiple threads in kernel space as well
    • Pros
      • Low cost
      • Flexible (various API, POSIX, Actors, Java)
      • Implementation requires no change to OS
    • Cons
      • Performance issues due to mapping to OS
      • Difficult to use and to integrated with system services
        • kernel events are invisible to user level
          • a kernel thread blocking may lead to loss of user-level thread
        • kernel threads are scheduled obliviously with user-level thread state
          • user threads are scheduled accordingly
          • cause issues because OS doesn’t know priority

Three functions with standard C library:

int getcontext(ucontext_t *ucp);
// saves current thread's execution context in the structure pointed to by ucp.
int setcontext(const ucontext_t * ucp);
//make a previously saved thread context the current thread context. The current
//context is lost and setcontext() does not return.
//Instead, execution continue in the context specified by ucp.
void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
//modifies context pointed to by ucp so that it will continue execution by invoking
//func() with the arguments provided. Context upc must have previously been
//initialized by call to getcontext().

Solution:

  • To balance KL and UL, we can provide kernel support to user-level threads
  • Kernel can tell user library about thread states in kernel with upcall
  • UL thread system can request/release processors
  • UL thread system has complete control over which thread to run on allocated processors

Example:

  1. User thread UA is running with kernel thread KA
  2. KA is blocked
  3. Kernel create a new kernel thread KB and inform user library
  4. User library resume user thread UB to run on KB
  5. KA unblocked
  6. Kernel preempts KB and inform user library
  7. User library preempts UB
  8. UA resumed

Event-based programming

Thread is hard

  • Dead lock
  • Module cannot be designed independently
    • Cause dead lock
  • Achieving good performance is hard
    • simple lock, low concurrency
    • fine-grain lock, reduce performance
    • OS limit, scheduling, context switching
  • Not well supported
    • Hard to port threaded code
    • libraries often not thread safe
    • kernel calls, often not multi-threaded

Event-driven programming is a good way for cooperative task management.

  • Handlers “register” for events
  • Event loop waits for events and then invokes handlers (through function calls)
  • Handlers execute to completion (no preemption)
  • Handlers are typically short-lived.
  • Events can be initiated by devices as well

Advantages

  • Easier to use:
    • No concurrency, no preemption, no synchronization, no deadlocks.
  • Easier to debug:
    • Timing dependencies all generated by events, not by internal scheduling.
    • Problems easier to trace, e.g. slow event handling vs. corrupted memory.
  • Fast:
    - No context switching, no scheduling overhead.
  • Portable:
    • It’s just function calls, after all

Difficulties

  • Manual control flow management:
    • Control flow for single task is broken up across multiple procedures.
  • Manual stack management:
    • Have to manually carry local state across these procedures.
  • Unexpected blocking:
    • What if event handler is blocked by page fault?
    • Wait
  • No parallelism:
    • How to run events on a multiprocessor?
    • No way

Reference

This is my class notes while taking CSCE 611 at TAMU. Credit to the instructor Dr. Bettati.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.