Processing math: 100%
Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap

2017-10-14

Largest elements 0831

For Big O notation:

O(n)cn=belong toO(n2)cn2

Therefore, we know:

O(n2)O(n)

Algorithm

An algorithm should be presented with 4 components:

  1. problem statelemt
  2. algorithm design
  3. algorthm analysis
  4. discussion

Example

e.g. 1

  1. find largest element in array A[1,n]
  2. design
    • a. max = A[1]
    • b. FOR i = 2 to n DO
      • IF max < A[i]
      • THEN max = A[i]
    • c. RETURN max
  3. make sure it’s currect, time complexity is O(n)
  4. discussion

e.g. 2

largest(i, j)

  1. IF i == j
    THEN return A[i]
  2. k = (i + j) / 2
  3. max1 = largest(i, k)
    max2 = largest(k+1, j)
  4. RETURN max(max1, max2)

main()

RETURN largest(1, n)

Analysis:

Let T(h) be the time for largest when it’s applied on a subarray of length h.

We know T(1)=O(1).

When h>1:

T(h)=O(1)step 1+O(1)step 2+2T(h2)step 3+O(1)step 3 assignment

Therefore,

T(h)=2T(h2)+O(1)

Equations can be converted to:

T(1)=O(1)T(1)C1 T(h)=2T(h2)+O(1)T(h)2T(h2)+C2

By recursion:

T(h)22T(h22)+2C2+C2 T(h)2loghT(1)+(1+2+22+...+2logh1)C2 T(h)hC1+(h1)C2

In conclusion:

T(h)=O(h)

We can also do “guess and verify”.

0905

To organize the data to speed up searching, we can sort the data in O(nlogn). Then do binary search in O(logn).

No searching algorithm can do better than O(logn).

Binary search tree 0907

A set is a collection of elements. All elements of a set are different, which means no set can contain two copies of the same element.

For a node in a BST, all values in its left subtree is less than its value; all values in its right subtree is no less than its value.

Operations

  1. Search(x, A)
  2. Insert(x, A)
  3. Delete(x, A)

All operations can not be done in time better than O(logn).

Balance BST

  1. Use rotation, AVL, Red-black tree
  2. split & merge, B+ tree

Rotation

  1. AVL tree: for each node, remember 2 heights of its two subtrees, so can know how to rotate
  2. Red-black: add color to the node (1 bit). For every path root to leaf
    • numbers of black are the same
    • no consecutive two red

Split & merge

2-3 tree

Definition

tree

A tree is

  • undirected, no cycles, connected, has a root
  • recursive definition

2-3 tree

A tree is a 2-3 tree if

  • each non-leaf has either 2 or 3 children
  • all leaves are at the same level / every path from the root to a leaf is of the same length

An empty set or a single root can be regarded as 2-3 tree

Data are stored in leaves, sorted in increasing order.

With n leaves, the tree has at most n-1 non-leaves.

A 2-3 tree of n leaves has a height bounded by logn.

Prove: remove a branch if having 3 children for each node, get a complete binary tree with n1n leaves. Meanwhile, height doesn’t change.

h=log2(n1)log2(n)

A linearly ordered set of elements can be represented by a 2-3 tree by assigning the elements to the leaves of the tree in such a way that for any non-leaf node of the tree, all elements stored in its first child are less than any elements stored in its second child, and all elements stored in its second child are less than any elements stored in its third child (if it has a third child).

Each non-leaf node v of a 2-3 tree has two or three children, which will be named child1(v), child2(v), child3(v), resp.. The node v also keeps three values for the corresponding subtrees:

  • l(v): the largest element stored in the subtree rooted at child1(v).
  • m(v): the largest element stored in the subtree rooted at child2(v).
  • h(v): the largest element stored in the subtree rooted at its child3(v) (if child3(v) exists).

Strictly speaking, the third value h(v) is not needed. All algorithms can be implemented without the value h(v), and without increasing the time complexity. However, we suggest to keep the third value in an implementation, which will simplify certain implementation details.

Operations

Search

  1. root: r
  2. element to search: x

SEARCH(r, x)

If (r is empty) report failure;
If (r is a leaf node) report (value(r) == x);
If (l(r) >= x) SEARCH(child1(r), x);
Else If (m(r) >= x) SEARCH(child2(r), x);
Else If (r has a third child) SEARCH(child3(r), x);
Else report failure.

Since the height of a 2-3 tree is O(logn), and the algorithm simply follows a path in the tree from the root to a leaf, and spends time O(1) on each level, the time complexity of the algorithm SEARCH is O(logn).

Insert

To insert a new element x into a 2-3 tree T rooted at r, we apply a recursive algorithm that dose two things:

  1. insert x into the tree T rooted at r;
  2. report whether this insertion splits the tree T rooted at r into two 2-3 trees.

If the 2-3 tree T has at most one leaf, then the job is easy:

  1. if T has no leaf (i.e., T represents an empty set), then we simply make a 2-3 tree that consists of a single node, which is both the root and the leaf of the tree, with a value x.
  2. if T has only one leaf of value y, then the tree T is a single-node tree, inserting x into T makes a two-leaf tree, whose values are x and y, respectively, and the leaves are ordered properly.

Now if the 2-3 tree T has a height at least 1 with at least two leaves, then we proceed at first as if we were searching x in the tree T. However, at the level just above the leaves, we start our insertion operation recursively. In general, suppose that we want to add a new child w to a node v in the 2-3 tree. If v has only two children, we simply make w a new child of v, placing the children in the proper order and updating the information of the node v.

Suppose, however, that v already has three children v1, v2, and v3. Then w would be the fourth child of v. We cannot have a node with four children in a 2-3 tree, so we split the node v into two nodes, which we call v and v’. With the new node v’, we can let the fi rst two of (v1, v2, v3, w) (in terms of the linear order) be children of v, and let the rest two be children of v’. Now, the node v’ is the root of a subtree and should be added as a new child to the parent of v. Thus, the operation now can be recursively done at the level of the parent of v.

One special case occurs when we wind up splitting the root. In that case we create a new root, whose two children are the two nodes into which the old root was split. This is how the number of levels in (i.e., the height of) a 2-3 tree increases.

INSERT(r, x)

  1. If (the tree rooted at r has < 2 leaves)

    process directly; return;

  2. Addson(r, x, r’);
  3. If (r’ != NULL)

    create a new node v; let r and r’ be children of v;

  4. return v.

The procedure Addson(r,x,r’) above is implemented by the following recursive algorithm, which inserts a new element x to the 2-3 tree rooted at r. Moreover, if this insertion causes splitting the node r due to exceeding the number of children, then a new node r’ is created to take two children from r. Therefore, if r’ is not empty when the procedure returns, then r and r’, respectively, are the roots of two 2-3 trees of the same height.

Addson(r, x, r’)

  1. r’ = NULL;
  2. If (r is a parent of leaves)

    If (r has 2 children) add x as a new child of r;
    Else \\ r has 3 children

    order x and the three children of r in the linear order;
    let the first two be children of r; and the rest two be children of r’;

    return;

  3. If (l(r) >= x) v = child1(r);
    Else If (m(r) >= x or child3(r)==NULL) v = child2(r);
    Else v = child3(r);
  4. Addson(v, x, v’);
  5. If (v’ == NULL) return;
  6. If (v’ != NULL and r has 2 children) add v’ as a new child of r;
    Else \\ r has 3 children and v’ is not NULL

    order v’ and the three children of r in the linear order;
    let the first two be children of r; and the rest two be children of r’;

Minimum and maximum

Given a 2-3 tree T we want to find out the minimum element stored in the tree. Recall that in a 2-3 tree the numbers are stored in leaf nodes in ascending order from left to right. Therefore the problem is reduced to going down the tree, always selecting the left most link, until a leaf node is reached. This leaf node should contain the minimum element stored in the tree. Evidently, the time complexity of this algorithm is O(logn) for a 2-3 tree with n leaves.

MINIMUM(r)

If (r is empty) return failure;
If (r is a leaf) return value(r);
Else return MINIMUM(child1(r)).

The operation MAXIMUM can be similarly implemented to run in time O(logn).

Delete

When we delete a leaf from a 2-3 tree, we may leave its parent v with only one child. If v is the root, delete v and let its lone child be the new root. Otherwise, let p be the parent of v. If p has another child, adjacent to v on either the right or the left, and that child of p has three children, we can transfer the proper one of those three to v. Then v has two children, and we are done.

If the children of p adjacent to v have only two children, transfer the lone child of v to an adjacent sibling of v, and delete v. Should p now have only one child, repeat all the above, recursively, with p in place of v.

Summarizing these discussions together, we get the algorithm DELETE, as shown below, where procedure DELETE() is merely a driver for sub-procedure DEL() in which the actual work is done.

The variables done and 1son in DEL() are boolean flags used to indicate successful deletion and to detect the case when a node in the tree has only one child, respectively.

In the worst case we need to traverse a path in the tree from root to a leaf to locate the node to be deleted, then from that leaf node to the root, in case that every non-leaf node on the path has only two children in the original 2-3 tree T. Thus the time complexity of DELETE algorithm for a 2-3 tree with n leaves is O(logn).

Algorithm DELETE(r, x)

  1. If (r == Null) return failure;
  2. If (r is a leaf)

    If (x == l(r)) return Null;
    Else return failure;

  3. DEL(r, x, done, 1son);
  4. If (done == false) return failure;
  5. If (1son == true) return child1(r);
    ELSE return r.

Algorithm DEL(r, x, done, 1son)

  1. done = true;
  2. If (r is a parent of leaves) process properly and return;
    \\ i.e., delete x if it is in the tree; update done and 1son;
  3. If (x <= l(r)) r’ = child1(r);
    Else if (x <= m(r)) or (child3(r) == Null) r’ = child2(r);
    Else r’ = child3(r);
  4. DEL(r’, x, done’, 1son’);
  5. If (done’ == false) done = false; return;
  6. If (1son’ == true)

    If (r has at least 4 grandchildren)

    reorganize the grandchildren of r so that each of r and its children has either 2 or 3 children
    1son = false

    Else make r a 1-child node (with 3 grandchildren);

    1son = true; return.

Splice

Splice(T1, T2) will merge two trees into one. all values in T1 is no greater than those in T2. It can be done in O(logn).

Suppose the number of nodes on T1 and T2 are n1 and n2, then the height of the two trees are h1 and h2, where the h1, h2 are in the order of logn1 and logn2. There are three possible cases:

  1. h1 = h2:
    we can simply add one node as the root of the new tree, whose left child is T1, right child is T2. We choose the key values of root. The operation needed is O(1)
  2. h1 < h2:
    The basic idea is to insert T1 as the subtree of T2. We take the leftest path from the root of T2 to the node N1. The height of N1 in T2 is h1+1.
    • If N1 has only two children, we can insert T1 as the leftest child of N1. Then done. It takes O(h2-h1) operation.
    • If N1 has already three children, N1 will have four children after the insertion of T1. We need to split the N1 into two new nodes. If N1 has no parent, the tree grows in height by one, the height of new tree is h2+1 now; if N1’s parent becomes full, we split the parent; otherwise we are done. This split takes O(h2-h1). So totally it’s still O(h2-h1).
  3. h1 > h2:
    Similar to case 2.

We need first get the height of T1 and T2 in MAX(O(logn1),O(logn2)). Then merging takes no more than MAX(O(logn1),O(logn2)).

Split

Split(x, T) will split tree T into 2 trees according to x value. So that all values in left tree is less than or equal to x; all values in right tree is no less than x. It can be done in O(logn). Basically, it’s search and split along path.

We march from the root T to the key x, the node we pass are n1, …, nm. We delete the edge bewtween ni and ni+1. So every time deleting an edge, we split one tree into 2 or 3 parts. One is a 2-3 tree, we keep marching on this part. The others are trees whose root’s degree is 1, with a 2-3 tree below the root. We throw away the root, keep the rest as R[i] if that part is on the right of the original tree, keep the rest as L[i] if that part is on the left of original tree. So finally we we come down to the x, we have a sequence of logn trees, L[1],L[2],...,L[k] and R[1],...,R[m]. L and R are in decreasing order of height. This marching takes O(logn).

Furthermore, we can merge all trees in L and R in increasing order of height. Each merging/splicing takes O(h2h1), the difference of the height. Then merge two L and R trees. Total time is O(logn).

Find first k largest

DFS, time is O(logn+k).

Node data structure

  1. key values: k1, k2, k3
  2. pointers to subtree: p1, p2, p3
    • if p1 is null, it means this is a leaf node.

Why do we prefer a 2-3 tree instead of 2-4, 2-5?

  1. 2-5 tree can waste space. More than half space can be empty. We want fill as many as possible in a node
    • 2-3 tree: A little bit more than half
  2. 2-4 tree cannot hold a set with 3 or 5 … leaves

Heap 0919

A heap is a complete binary tree with possibly some rightmost leaves removed.

  • At most one with only one child
  • Father’s value cannot be larger than its children’s - min-heap
    • The largest element must be a leaf

Operations

Insert

Insert to the right of the rightmost leaf, then swim it up (swap larger father with it). It takes O(logn).

Delete

Give the pointer of the node to delete, put the rightmost leaf at that position, swim it up or sink it down.

Search

Heap cannot do search in O(logn).

Implementation

Store the heap in an array; maintain the position of end, lastIndex.

  • If index starting with 1, two sons of father i are 2i and 2i+1.
  • If index starting with 0, two sons of father i are 2i+1 and 2i+2.

Reference

This is my class note of CSCE 629 in Texas A&M taught by Dr. Jianer Chen. Credit to him.