Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Advanced Algorithms 2, B+ tree and Segment Intersection

2017-10-15

B+ tree 0919

  • A popular data structure used in database, dealing with data stored in disks.
  • Support fast search
  • Support dynamic changes
  • Support range search

Node

A node of order n has:

  • n keys: search-keys (values of the attributes in the index)
  • n+1 pointers (disk address)

Basically we want each B+ treenode to fit in a disk block so that a B+ treenode can be read/written by a single disk I/O. Typically, n ~ 100-200.

To keep the nodes not too empty, also for the operations to be applied efficiently: Basically, use at least one half of the pointers

  • Non-leaf:
    at least (n+1)/2 pointers (to children)
  • Leaf:
    at least (n+1)/2 pointers to data (plus a “sequence pointer” to the next leaf)

A sample non-leaf with n = 3 is as

sample non_leaf

A sample leaf with n = 3 is as

sample non_leaf

Example for full and min node with n = 3 is as

sample non_leaf

Rules

  1. All leaves are at same lowest level (balanced tree)
  2. Pointers in leaves point to records except for “sequence pointer”
  3. Number of keys/pointers in nodes:
Type Max. No. pointers Max. No. keys Min. No. pointers Min. No. keys
Non-leaf n+1 n (n+1)/2 (n+1)/2 - 1
Leaf n+1 n (n+1)/2 + 1 (n+1)/2
Root n+1 n 2 or 1 1

Search(ptr, k);

\\ search a record of key value in the subtree rooted at ptr
\\ assume the B+tree is a dense index of order
CASE 1. is a leaf \\ is the sequence pointer

IF () for a key in *ptr THEN RETURN ;
ELSE RETURN Null;

CASE 2. is not a leaf

find a key in *ptr such that ;
RETURN Search(, );

To research all records whose key values are between and :

Range-Search(, , )

  1. Call Search(, ) to find the leaf of value ;
  2. Follow the sequence pointers until the search key value is larger than .

Insert

  1. Find the leaf L where the record r should be inserted;
  2. If L has further room, then insert r into L, and return;
  3. If L is full, spilt L plus r into two leaves (each is about half full): this causes an additional child for the parent P of L, thus we need to add a child to P;
  4. If P is already full, then we have to split P and add an additional child to P’s parent … (recursively)

Leaf overflow

b-plus-insert-leaf-overflow

Non-leaf overflow

b-plus-insert-non-leaf-overflow

Pseudo Code

Leaf part:

b-plus-insert-leaf-code

Non-leaf part

b-plus-insert-non-leaf-code

Delete

  1. Find the leaf L where the record r should be deleted;
  2. If L remains at least half-full after deleting r, then delete r, and return;
  3. Else consider the sibling L’ of L;
    • If L’ is more than half-full, then move a record from L’ to L, and return;
    • Else combine L and L’ into a single leaf;
    • But now you need to consider the case of deleting a child from L’s parent … (recursively)

Key re-distribution at leaves

b-plus-delete-leaf-redist

Key re-distribution at nonleaves

b-plus-delete-non-leaf-redist

Leaf coalescence

b-plus-delete-leaf-coal

Non-leaf coalescence

b-plus-delete-non-leaf-coal

Pseudo Code

b-plus-delete-code

Practice

Often, coalescing is not implemented. Too hard and not worth it!

Elevator Algorithm 0921

To process disk request, but not the optimal algorithm.

The basic idea is keep going & process request until no request, then reverse direction.

Algorithm

  • dir: up, down, stop
  • UQ: upper queue
  • LQ: lower queue
  • pos: postion of the processor

PROCESS()

  1. IF UQ is empty and LQ is empty
    THEN dir = stop; RETURN
  2. IF dir = up
    THEN

    IF UQ is not empty
    THEN x = MIN(UQ); pos = x; DELETE(UQ, x);
    ELSE IF LQ is not empty
    THEN dir = down
    ELSE

    dir = stop; RETURN

    ELSE \\ dir = down
    IF LQ is not empty
    THEN x = MAX(LQ); pos = x; DELETE(LQ, x)
    ELSE IF UQ is not empty
    THEN dir = up
    ELSE

    dir = stop; RETURN;

REQUEST(x)

  1. IF x > pos
    THEN INSERT(UQ, x)
    ELSE INSERT(LQ, x)
  2. IF dir = stop
    THEN activate PROCESS

Segment intersection 0926

Geometric sweeping

In this notes, we illustrate a technique called geometric sweeping and use it to solve the segment intersection problem. Geometric sweeping is a generalization of a technique called plane sweeping, that is primarily used for 2-dimensional problems. In most cases, we will illustrate the technique for 2-dimensional cases. The generalization to higher dimensions is straightforward. This technique is also known as the scan-line method in computer graphics, and is used for a variety of applications, such as shading, polygon filling, among others.

The technique is intuitively simple. Suppose that we have a line in the plane. To collect the geometric information we are interested in, we slide the line in some way so that the whole plane will be “scanned” by the line. While the line is sweeping the plane, we stop at some points and update our recording. We continue this process until all interesting objects are collected.

There are two basic structures associated with this technique. One is for the sweeping line status, which is an appropriate description of the relevant information of the geometric objects at the sweeping line, and the other is for the event points, which are the places we should stop and update our recording. Note that the structures may be implemented in different data structures under various situations. In general, the data structures should support efficient operations that are necessary for updating the structures while the line is sweeping the plane.

Problem

Segment Intersection. Given n line segments in the plane, find all intersections.

Assume:

  1. No two stop points on the same vertical line
  2. No vertical segment

Idea

Suppose that we have a vertical line L. We sweep the plane from left to right. At every moment, the sweeping line status contains all segments intersecting the line L, sorted by the y-coordinates of their intersecting points with L. The sweeping line status is modi ed whenever one of the following three cases occurs:

  1. The line L hits the left-end of a segment S. In this case, the segment S was not seen before and it may have intersections with other segments on the right side of the line L, so the segment S should be added to the sweeping line status;
  2. The line L hits the right-end of a segment S. In this case, the segment S cannot have any intersections with other segments on the right side of the line L, so the segment S can be deleted from the sweeping line status;
  3. The line L hits an intersection of two segments S1 and S2. In this case, the relative positions of the segments S1 and S2 in the sweeping line status should be swapped, since the segments in the sweeping line status are sorted by the y-coordinates of their intersection points with the line L.

It is easy to see that the sweeping line status of the line L will not be changed when it moves from left to right unless it hits either an endpoint of a segment or an intersection of two segments. Therefore, the set of event points consists of the endpoints of the given segments and the intersection points of the segments. We sort the event points by their x-coordinates.

We use two data structures EVENT and STATUS to store the event points and the sweeping line status, respectively, such that the set operations MINIMUM, INSERT, and DELETE can be performed efficiently (for example, they can be 2-3 trees). At very beginning, we suppose that the line L is far enough to the left so that no segments intersect L. At this moment, the sweeping line status is an empty set. We sort all endpoints of the segments by their x- coordinates and store them in EVENT. These are the event points at which the line L should stop and update the sweeping line status. However, the list is not complete since an intersection point of two segments should also be an event point.

Unfortunately, these points are unknown to we at beginning. For this, we update the structure EVENT in the following way. Whenever we fi nd an intersection point of two segments while the line L is sweeping the plane, we add the intersection point to EVENT. But how do we find these intersection points?

Note that if the next event point to be hit by the sweeping line L is an intersection point of two segments S1 and S2, then the segments S1 and S2 should be adjacent in the sweeping line status. Therefore, whenever we change the adjacency relation in STATUS, we check for intersection points for new adjacent segments. When the line L reaches the right-most endpoint of the segments, all possible intersection points are collected.

Algorithm

Algorithm SEGMENT-INTERSECTION

Given: n line segments S1, S2, … , Sn

Output: all intersections of these segments

Implicitly, we use a vertical line L to sweep the plane. At any moment, the segments intersecting L are stored in STATUS, sorted by the y-coordinates of their intersection points with L. The event points stored in EVENT are sorted by their x-coordinates

  1. Sort the endpoints of the segments and put them in EV;
  2. ST = ;;
  3. while EV do

    3.1. p = MINIMUM(EV); DELETE(EV, p);
    3.2. if p is a right-end of some segment S then

    let Si and Sj be the two segments adjacent to S in ST;
    if p is an intersection point of S with Si or Sj then REPORT(p);
    DELETE(ST, S);
    if Si and Sj intersect at p1 and x(p1) x(p) then INSERT(EV, p1);

    3.3. else if p is a left-end of some segment S then

    INSERT(ST, S);
    let Si and Sj be the adjacent segments of S in ST;
    if p is an intersection point of S with Si or Sj then REPORT(p);
    if S intersects Si at p1 with x(p1) > x(p), INSERT(EV, p1);
    if S intersects Sj at p2 with x(p2) > x(p), INSERT(EV, p2);

    3.4. else if p is an intersection of segments Si and Sj with Si on the left of Sj in ST then

    REPORT(p);
    swap the positions of Si and Sj in ST;
    Let Sk be the segment left to Sj and let Sh be the segment right to Si in ST;
    if Sk and Sj intersect at p1 with x(p1) > x(p) then INSERT(EV, p1);
    if Sh and Si intersect at p2 with x(p2) > x(p) then INSERT(EV, p2).

Analysis

Step 1, sorting the endpoints of the segments, can be done in time . Step 2 takes constant time .

For step 3, suppose there are intersection points for these segments. In the while loop, each segment is inserted then deleted from the structure ST exactly once, and each event point is inserted then deleted from the structure EV exactly once. There are event points. If we suppose that the operations MINIMUM, INSERT, and DELETE can all be done in time on a set of elements, then processing each segment takes at most time, and processing each event point takes at most time. Therefore, the algorithm runs in time

Observe that is at most , so . Thus we conclude that the algorithm SEGMENT-INTERSECTION runs in time .

We remark that the time complexity of the above algorithm depends on the number of intersection points of the segments and the algorithm is not always efficient. For example, when the number m is of order , then the algorithm runs in time , which is even worse than the straightforward method that picks every pair of segments and computes their intersection point. On the other hand, if the number m is of order , then the algorithm runs efficiently in time .

Reference

This is my class note of CSCE 629 in Texas A&M taught by Dr. Jianer Chen. Credit to him.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2025. All rights reserved by melonskin. Powered by Jekyll.