January 15, 2024

Python Crawler

Web Crawling Ethics

Always check the website crawler protocol before crawling. Adding /robots.txt at the end of the domain name to check it.

Part of sample for https://www.google.com/robots.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static

# AdsBot
User-agent: AdsBot-Google
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/

# Crawlers of certain social media sites are allowed to access page markup when google.com/imgres* links are shared. To learn more, please contact [email protected].
User-agent: Twitterbot
Allow: /imgres
Allow: /search
Disallow: /groups

User-agent: facebookexternalhit
Allow: /imgres
Allow: /search
Disallow: /groups

Sitemap: https://www.google.com/sitemap.xml

User-agent: * # All kinds of crawlers

Disallow: /search # Don’t allow crawling /search.htm

Allow: /search/about # Allow crawling /search/about.htm


Web Crawling Steps

  1. Acquire data
  2. Parse data
  3. Extract data
  4. Store data

Libraries


Step 1. Acquire data

Step 2. Parse data


Step 3. Extract data


Step 4 Store data



References

About this Post

This post is written by Andy, licensed under CC BY-NC 4.0.

#Python#Crawler