Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

A multi-threading high-performance web crawler

2018-04-17

Introduction

This is a multi-threading high-performance web crawler developed with C++, Socket, etc. Multiple network protocols such as TCP, HTTP/1.0, HTTP/1.1 are utilized. 1 million URLs can be crawled and parsed in a minute.

The code can be found at https://github.com/melonskin/multi-thread-web-crawler.

Design

main() is the entry point of our program, it will estimate whether it’s a single request or multiple requests based on number of input. We will focus on the case of multiple requests for the following description.

Useful functions are stored in Util. First, program will read the file and store urls in a queue. Then a SharedData object is created with the queue and other sharing info among threads.

There are two type of threads in this program. One is stat thread for printing stats every 2 seconds. The other is crawling threads. The crawling threads will keep getting URLs from the queue, parse it with our UrlParser tool. Then send robots request and get request, which are simply done by calling functions. The stats will be updated during the process. Returned HTML content will be parsed with HTMLParser to obtain link information.

After all URLs in queue are consumed, crawling threads without running job will be closed. Stat thread will wait for all crawling threads to be closed. Finally, a summary will be printed based on the result collected from entire process.

The core functions is done by MySock class. It’s designed to have multiple methods like Connect() and Send() and so on to process a HTTP request.

This program is designed to be multi-layered in order to avoid tedious and unreadable codes. Programmers just need to focus on high-level functions with the convenience provided by wrapped basic methods.

Also, this implementation follows the principle of “Do not repeat yourself”.

Note

This program is capable of dechunking response in the case of HTTP/1.1. Test it with a single URL as the command line argument. A test run is as below:

PS C:\...\Release> .\p1-web-client.exe http://www.eecs.mit.edu
URL: http://www.eecs.mit.edu
          Parsing URL... host www.eecs.mit.edu, port 80, request /
          Doing DNS... done in 20 ms, found 18.62.0.96
        * Connecting on page... done in 73 ms
          Loading... done in 306 ms with 42657 bytes
          Verifying header... status code 200
          Dechunking... body size was 42134, now 42121
        + Parsing page... done in 0 ms with 126 links

--------------------------------------------------
HTTP/1.1 200 OK
Date: Fri, 09 Feb 2018 06:30:59 GMT
Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips PHP/5.3.29
X-Content-Type-Options: nosniff
X-Powered-By: PHP/5.3.29
X-Drupal-Cache: HIT
Etag: "1518156591-0"
Content-Language: en
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=3600
Last-Modified: Fri, 09 Feb 2018 06:09:51 GMT
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8

Trace

Here is a sample trace processing 1M URLs with 5000 threads.

PS C:\U...\x64\Release> .\p1-web-client.exe 5000 .\MURL-input-1M.txt
Opened .\MURL-input-1M.txt with size 66152005
[  2] 4607 Q 974850 E   25154 H   4606 D      0 I     0 R     0 C     0 L    0K
      *** crawling  0.0 pps @  0.0 Mbps
[  4] 5000 Q 906885 E   93119 H  14104 D  12594 I  9328 R  2560 C  1327 L   28K
      *** crawling 664.0 pps @ 62.9 Mbps
[  6] 5000 Q 838380 E  161624 H  24282 D  22528 I 17242 R  5576 C  4497 L  153K
      *** crawling 1584.5 pps @ 267.3 Mbps
[  8] 5000 Q 756405 E  243599 H  35185 D  32993 I 25434 R  8732 C  7647 L  309K
      *** crawling 1575.0 pps @ 326.0 Mbps
[ 10] 5000 Q 696888 E  303116 H  45352 D  42551 I 32761 R 11729 C 10707 L  462K
      *** crawling 1530.5 pps @ 341.2 Mbps
[ 12] 5000 Q 654731 E  345273 H  52377 D  49019 I 37634 R 13825 C 12933 L  574K
      *** crawling 1112.5 pps @ 236.8 Mbps
[ 14] 5000 Q 605692 E  394312 H  58835 D  55153 I 42185 R 15691 C 14840 L  680K
      *** crawling 953.5 pps @ 206.6 Mbps
[ 16] 5000 Q 553339 E  446665 H  65485 D  61491 I 46798 R 17544 C 16765 L  767K
      *** crawling 963.5 pps @ 191.5 Mbps
[ 18] 5000 Q 498188 E  501816 H  71903 D  67600 I 51238 R 19294 C 18534 L  854K
      *** crawling 883.5 pps @ 191.1 Mbps
[ 20] 5000 Q 445940 E  554064 H  77964 D  73306 I 55199 R 20857 C 20144 L  924K
      *** crawling 805.0 pps @ 160.7 Mbps
[ 22] 5000 Q 401189 E  598815 H  84589 D  79563 I 59021 R 22426 C 21751 L 1010K
      *** crawling 803.5 pps @ 161.2 Mbps
[ 24] 5000 Q 347486 E  652518 H  91911 D  86287 I 62202 R 23692 C 23078 L 1070K
      *** crawling 663.5 pps @ 128.9 Mbps
[ 26] 5000 Q 299165 E  700839 H  99064 D  92603 I 65133 R 24840 C 24273 L 1131K
      *** crawling 597.5 pps @ 121.4 Mbps
[ 28] 5000 Q 240205 E  759799 H 107349 D 100383 I 68476 R 26076 C 25544 L 1180K
      *** crawling 635.5 pps @ 109.8 Mbps
[ 30] 5000 Q 177613 E  822391 H 116634 D 109165 I 72147 R 27513 C 26930 L 1234K
      *** crawling 693.0 pps @ 125.3 Mbps
[ 32] 5000 Q 105144 E  894860 H 126443 D 118464 I 76058 R 29002 C 28396 L 1294K
      *** crawling 733.0 pps @ 140.7 Mbps
[ 34] 5000 Q  32757 E  967247 H 136733 D 128401 I 80301 R 30577 C 29894 L 1368K
      *** crawling 749.0 pps @ 154.8 Mbps
[ 36] 3262 Q      0 E 1000004 H 139300 D 131362 I 81628 R 31314 C 30879 L 1423K
      *** crawling 492.5 pps @ 112.0 Mbps
[ 38] 2624 Q      0 E 1000004 H 139300 D 131381 I 81630 R 31420 C 31041 L 1439K
      *** crawling 81.0 pps @ 29.6 Mbps
[ 40] 2203 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31432 C 31175 L 1445K
      *** crawling 67.0 pps @ 15.1 Mbps
[ 42] 1859 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31434 C 31256 L 1447K
      *** crawling 40.5 pps @  7.4 Mbps
[ 44] 1607 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31264 L 1449K
      *** crawling  4.0 pps @  3.7 Mbps
[ 46] 1353 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31266 L 1449K
      *** crawling  1.0 pps @  0.1 Mbps
[ 48] 1115 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31267 L 1449K
      *** crawling  0.5 pps @  0.0 Mbps
[ 50]  863 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31268 L 1449K
      *** crawling  0.5 pps @  0.0 Mbps
[ 52]  591 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31269 L 1449K
      *** crawling  0.5 pps @  0.1 Mbps
[ 54]  305 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
      *** crawling  0.5 pps @  0.0 Mbps
[ 56]    4 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
      *** crawling  0.0 pps @  0.0 Mbps
[ 58]    1 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
      *** crawling  0.0 pps @  0.0 Mbps
[ 60]    0 Q      0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
      *** crawling  0.0 pps @  0.0 Mbps

Extracted 1000004 URLs @ 16485/s
Looked up 139300 DNS names @ 2296/s
Downloaded 81634 robots @ 1345/s
Crawled 31270 pages @ 515/s (773.55 MB)
Parsed 1449288 links @ 23892/s
HTTP codes: 2xx = 21987, 3xx = 3784, 4xx = 5305, 5xx = 194, other = 0

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2024. All rights reserved by melonskin. Powered by Jekyll.