python网页爬虫代码

新星源码网 2月 24日 2 0

python
import requests
from bs4 import BeautifulSoup

# 定义要爬取的网页链接
url = 'http://example.com'

# 发送GET请求获取页面内容
response = requests.get(url)

# 检查响应状态码
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 在这里可以根据网页结构提取你需要的内容
    # 例如，查找所有的<a>标签
    links = soup.find_all('a')
    
    # 打印所有链接
    for link in links:
        print(link.get('href'))

else:
    print("Failed to retrieve the webpage")

在这个示例中，我们首先使用requests库发送了一个GET请求来获取指定网页的内容。然后，我们使用BeautifulSoup解析HTML内容，以便轻松地提取出我们需要的信息。在这个例子中，我们查找了所有的<a>标签，并打印了它们的href属性，即链接地址。

请确保在使用网页爬虫时

python
import requests
from bs4 import BeautifulSoup

# 定义要爬取的网页链接
url = 'http://example.com'

# 发送GET请求获取页面内容
response = requests.get(url)

# 检查响应状态码
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取标题
    title = soup.title.text
    print("标题:", title)
    
    # 提取内容
    content_paragraphs = soup.find_all('p')
    content = ''
    for paragraph in content_paragraphs:
        content += paragraph.text.strip() + '\n'
    
    print("内容:", content)
    
else:
    print("Failed to retrieve the webpage")

在这个例子中，我们通过soup.title.text提取了网页的标题，然后通过soup.find_all('p')找到了所有的<p>标签，表示段落。随后，我们将这些段落拼接起来，形成了整个网页的内容。最后，我们打印了标题和内容。

需要注意的是，不同的网页结构可能会有所不同，你需要根据目标网站的HTML结构来调整代码，确保能够正确地提取出你需要的信息。

Related

热门文章