Introduction
This is a multi-threading high-performance web crawler developed with C++, Socket, etc. Multiple network protocols such as TCP, HTTP/1.0, HTTP/1.1 are utilized. 1 million URLs can be crawled and parsed in a minute.
The code can be found at https://github.com/melonskin/multi-thread-web-crawler.
Design
main()
is the entry point of our program, it will estimate whether it’s a single request or multiple requests based on number of input. We will focus on the case of multiple requests for the following description.
Useful functions are stored in Util
. First, program will read the file and store urls in a queue. Then a SharedData
object is created with the queue and other sharing info among threads.
There are two type of threads in this program. One is stat thread for printing stats every 2 seconds. The other is crawling threads. The crawling threads will keep getting URLs from the queue, parse it with our UrlParser
tool. Then send robots request and get request, which are simply done by calling functions. The stats will be updated during the process. Returned HTML content will be parsed with HTMLParser
to obtain link information.
After all URLs in queue are consumed, crawling threads without running job will be closed. Stat thread will wait for all crawling threads to be closed. Finally, a summary will be printed based on the result collected from entire process.
The core functions is done by MySock
class. It’s designed to have multiple methods like Connect()
and Send()
and so on to process a HTTP request.
This program is designed to be multi-layered in order to avoid tedious and unreadable codes. Programmers just need to focus on high-level functions with the convenience provided by wrapped basic methods.
Also, this implementation follows the principle of “Do not repeat yourself”.
Note
This program is capable of dechunking response in the case of HTTP/1.1
. Test it with a single URL as the command line argument. A test run is as below:
PS C:\...\Release> .\p1-web-client.exe http://www.eecs.mit.edu
URL: http://www.eecs.mit.edu
Parsing URL... host www.eecs.mit.edu, port 80, request /
Doing DNS... done in 20 ms, found 18.62.0.96
* Connecting on page... done in 73 ms
Loading... done in 306 ms with 42657 bytes
Verifying header... status code 200
Dechunking... body size was 42134, now 42121
+ Parsing page... done in 0 ms with 126 links
--------------------------------------------------
HTTP/1.1 200 OK
Date: Fri, 09 Feb 2018 06:30:59 GMT
Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 OpenSSL/1.0.1e-fips PHP/5.3.29
X-Content-Type-Options: nosniff
X-Powered-By: PHP/5.3.29
X-Drupal-Cache: HIT
Etag: "1518156591-0"
Content-Language: en
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=3600
Last-Modified: Fri, 09 Feb 2018 06:09:51 GMT
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
Trace
Here is a sample trace processing 1M URLs with 5000 threads.
PS C:\U...\x64\Release> .\p1-web-client.exe 5000 .\MURL-input-1M.txt
Opened .\MURL-input-1M.txt with size 66152005
[ 2] 4607 Q 974850 E 25154 H 4606 D 0 I 0 R 0 C 0 L 0K
*** crawling 0.0 pps @ 0.0 Mbps
[ 4] 5000 Q 906885 E 93119 H 14104 D 12594 I 9328 R 2560 C 1327 L 28K
*** crawling 664.0 pps @ 62.9 Mbps
[ 6] 5000 Q 838380 E 161624 H 24282 D 22528 I 17242 R 5576 C 4497 L 153K
*** crawling 1584.5 pps @ 267.3 Mbps
[ 8] 5000 Q 756405 E 243599 H 35185 D 32993 I 25434 R 8732 C 7647 L 309K
*** crawling 1575.0 pps @ 326.0 Mbps
[ 10] 5000 Q 696888 E 303116 H 45352 D 42551 I 32761 R 11729 C 10707 L 462K
*** crawling 1530.5 pps @ 341.2 Mbps
[ 12] 5000 Q 654731 E 345273 H 52377 D 49019 I 37634 R 13825 C 12933 L 574K
*** crawling 1112.5 pps @ 236.8 Mbps
[ 14] 5000 Q 605692 E 394312 H 58835 D 55153 I 42185 R 15691 C 14840 L 680K
*** crawling 953.5 pps @ 206.6 Mbps
[ 16] 5000 Q 553339 E 446665 H 65485 D 61491 I 46798 R 17544 C 16765 L 767K
*** crawling 963.5 pps @ 191.5 Mbps
[ 18] 5000 Q 498188 E 501816 H 71903 D 67600 I 51238 R 19294 C 18534 L 854K
*** crawling 883.5 pps @ 191.1 Mbps
[ 20] 5000 Q 445940 E 554064 H 77964 D 73306 I 55199 R 20857 C 20144 L 924K
*** crawling 805.0 pps @ 160.7 Mbps
[ 22] 5000 Q 401189 E 598815 H 84589 D 79563 I 59021 R 22426 C 21751 L 1010K
*** crawling 803.5 pps @ 161.2 Mbps
[ 24] 5000 Q 347486 E 652518 H 91911 D 86287 I 62202 R 23692 C 23078 L 1070K
*** crawling 663.5 pps @ 128.9 Mbps
[ 26] 5000 Q 299165 E 700839 H 99064 D 92603 I 65133 R 24840 C 24273 L 1131K
*** crawling 597.5 pps @ 121.4 Mbps
[ 28] 5000 Q 240205 E 759799 H 107349 D 100383 I 68476 R 26076 C 25544 L 1180K
*** crawling 635.5 pps @ 109.8 Mbps
[ 30] 5000 Q 177613 E 822391 H 116634 D 109165 I 72147 R 27513 C 26930 L 1234K
*** crawling 693.0 pps @ 125.3 Mbps
[ 32] 5000 Q 105144 E 894860 H 126443 D 118464 I 76058 R 29002 C 28396 L 1294K
*** crawling 733.0 pps @ 140.7 Mbps
[ 34] 5000 Q 32757 E 967247 H 136733 D 128401 I 80301 R 30577 C 29894 L 1368K
*** crawling 749.0 pps @ 154.8 Mbps
[ 36] 3262 Q 0 E 1000004 H 139300 D 131362 I 81628 R 31314 C 30879 L 1423K
*** crawling 492.5 pps @ 112.0 Mbps
[ 38] 2624 Q 0 E 1000004 H 139300 D 131381 I 81630 R 31420 C 31041 L 1439K
*** crawling 81.0 pps @ 29.6 Mbps
[ 40] 2203 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31432 C 31175 L 1445K
*** crawling 67.0 pps @ 15.1 Mbps
[ 42] 1859 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31434 C 31256 L 1447K
*** crawling 40.5 pps @ 7.4 Mbps
[ 44] 1607 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31264 L 1449K
*** crawling 4.0 pps @ 3.7 Mbps
[ 46] 1353 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31266 L 1449K
*** crawling 1.0 pps @ 0.1 Mbps
[ 48] 1115 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31267 L 1449K
*** crawling 0.5 pps @ 0.0 Mbps
[ 50] 863 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31268 L 1449K
*** crawling 0.5 pps @ 0.0 Mbps
[ 52] 591 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31269 L 1449K
*** crawling 0.5 pps @ 0.1 Mbps
[ 54] 305 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
*** crawling 0.5 pps @ 0.0 Mbps
[ 56] 4 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
*** crawling 0.0 pps @ 0.0 Mbps
[ 58] 1 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
*** crawling 0.0 pps @ 0.0 Mbps
[ 60] 0 Q 0 E 1000004 H 139300 D 131387 I 81634 R 31435 C 31270 L 1449K
*** crawling 0.0 pps @ 0.0 Mbps
Extracted 1000004 URLs @ 16485/s
Looked up 139300 DNS names @ 2296/s
Downloaded 81634 robots @ 1345/s
Crawled 31270 pages @ 515/s (773.55 MB)
Parsed 1449288 links @ 23892/s
HTTP codes: 2xx = 21987, 3xx = 3784, 4xx = 5305, 5xx = 194, other = 0