python最简单的爬虫代码

新星源码网 2月 24日 2 0

python
import requests
from bs4 import BeautifulSoup

# 定义要爬取的网页链接
url = 'http://example.com'

# 发送HTTP请求获取页面内容
response = requests.get(url)

# 使用BeautifulSoup解析HTML内容
soup = BeautifulSoup(response.text, 'html.parser')

# 输出解析结果
print(soup.prettify())

这段代码执行使用requests.get()函数发送HTTP GET请求获取指定URL的页面内容。使用BeautifulSoup库的BeautifulSoup类来解析页面内容。输出解析结果，可以使用prettify()方法将HTML内容格式化输出，使其更易读。

提取特定元素内容

python
# 提取标题文本
title = soup.title.text
print("标题:", title)

# 提取所有链接
links = soup.find_all('a')
for link in links:
    print("链接:", link.get('href'))

# 提取段落文本
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print("段落:", paragraph.text)

数据存储

你可以将爬取到的数据存储到文件或者数据库中，例如：

python
# 将数据存储到文件
with open('output.html', 'w') as file:
    file.write(soup.prettify())

# 将数据存储到数据库
# 这里需要使用适当的数据库连接和操作方法，例如使用SQLite或者MongoDB等

处理动态加载内容

如果你需要处理JavaScript动态加载的内容，requests库本身无法处理，你可以考虑使用Selenium库模拟浏览器行为。

设置请求头和代理

在发送请求时，有些网站可能会要求设置请求头信息，并且可能会阻止频繁的爬取。你可以设置请求头来模拟浏览器行为，也可以使用代理来隐藏你的IP地址。

python
# 设置请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}
response = requests.get(url, headers=headers)

# 使用代理
proxies = {
    'http': 'http://yourproxy.com',
    'https': 'https://yourproxy.com',
}
response = requests.get(url, proxies=proxies)

异常处理

在实际爬取过程中，可能会遇到各种网络问题或者页面结构变化等异常情况，你可以使用异常处理来增强程序的稳定性。

python
try:
    response = requests.get(url)
    response.raise_for_status()  # 如果返回的状态码不是200，会抛出异常
    soup = BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.RequestException as e:
    print("请求出错:", e)
except Exception as e:
    print("解析出错:", e)

python最简单的爬虫代码

提取特定元素内容

数据存储

处理动态加载内容

设置请求头和代理

异常处理

热门文章

文章目录

提取特定元素内容

数据存储

处理动态加载内容

设置请求头和代理

异常处理

Related

热门文章

文章目录