Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

awk, sed, cut, content manipulating techniques in shell



awk is a powerful tool. It can deal with rows and columns at the same time. Many C functions can be used with it.

Its basic pattern is awk 'BEGIN{print "start"} pattern {commands} END {print "end"} file'. BEGIN and END are optional. They are actions before process and after process, respectively.

Variable syntax

  • NR: number of current row
  • NF: number of fields, default delimeter is space ` `
  • $0: content of current row
  • $1: the content of the first field


  1. Execute the BEGIN{ }
  2. Process
    • Read a row of content (file or stdin).
    • If the content matches pattern, Execute {commands}. Else, pass. If the pattern does not exist, execute.
    • Repeat
  3. Execute the END{ }


  • awk '{print $1}' student.csv: Print the first field
  • awk '/Tom/ {print $2}' student.csv: If the line contains Tom, print the second field
  • awk -F ',' '{print $NF}' student.csv: Set the delimeter to be ,; print the last field
  • awk '{s+=$3} END {print s}' student.csv: Calculate the sum of column 3, without header
  • awk 'BEGIN {getline; print $0} {s+=$3} END {print s}' student.csv: Jump the headline; calculate the sum of column 3
  • awk 'END{print NR}' file: Get how many lines
  • awk -F"," 'BEGIN{getline} max < $3 {max = $3; maxline=$0} END{print maxline}' student.csv: Calculate the max of column 3; print this line
  • awk -F"," 'BEGIN{OFS=","} {tmp=$3; $3=$4; $4=tmp; print $0}' student.csv: Swap column 3 and column 4. OFS is Output Field Separator, space by default.
  • awk 'BEGIN {getline; print "id," $0} {print NR-1 "," $0}' student.csv: Add a column showing row number


sed is a stream editor. It can print, delete and substitute text. Its basic format is sed [options] commands [file-to-edit]. command is the key component. The pattern of commands is [addr]X[options]. file-to-edit is the file to be edited; it can also deal with stdin as input.

  • addr specifies the range of rows we are going to modify, e.g. the 1st row, No. 3 to 100 row. It can be determined by regular expression.
  • X is a char sed command, e.g. p is print; d is delete; s is substitute.
  • options is options for X, e.g. g with command s means global.


sed will do echo for matched lines by default. -n will suppress this action.

  • sed '' filename: Like cat
  • sed -n '1p' filename: Print the first line
  • sed -n '10,20p' filename: Print 10-20 line
  • sed -n '10,+9p' filename: Print 10 lines starting from line 10
  • sed -n '1~2p' filename: Print from line 1 to the end, except line 2


  • sed '1d' filename: Delete the first line
  • sed -i '1d' filename: In-place, modify the file directly
  • sed -i.bak '1d' filename: In-place but do backup first
  • sed '2,10d' filename: Delete 2-10 line
  • sed /^$/d filename: Delete blank line
  • sed /^foo/d filename: Delete line starting with foo
  • sed /ERROR/!d filename: Delete line without ERROR, ! is to negate the range


The format is sed 's/regex/replacement/' filename. We can specify range before s as well.

  • sed 's/this/This/' filename: Substitute only the first occurrance
  • sed 's/this/This/g' filename: g, Global
  • sed 's/this/This/2 filename: Substitute the second occurrance in matched row
    • echo "thisthisthis" | sed 's/this/This/2'
  • sed -n 's/this/This/2p' filename: Print the substituting lines
  • sed 's/this/This/i filename: i, case insensitive
  • sed -e 's/this/This/' -e 's/that/That/' filename: Multiple sed


See tutorials as below.

  1. The Basics of Using the Sed Stream Editor to Manipulate Text in Linux
  2. sed, a stream editor


cut can do some simple manipulations on csv files.


  • -d: field delimeter
  • -f: fields


  • cut -d ',' -f1 filename: Get the first column
  • cut -d ',' -f1,3 filename: Get the first and the second columns
  • cut -d':' -f2-4 filename: Get the second to the fourth columns with delimeter :
  • cut -d ',' -f3 --complement filename: Get all columns other than the third

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.