Python简单爬虫案例

用pyhton从网页中爬取数据，是比较常用的爬虫方式。网页一般由html编写，里面包含大量的标签，我们所需的内容都包含在这些标签之中，除了对python的基础语法有了解之外，还要对html的结构以及标签选择有简单的认知，下面就用爬取fl小说网的案例带大家进入爬虫的世界

一、实现步骤1.1 导入依赖

网页内容依赖

import requests，如没有下载依赖，在terminal处输出pip install requests，系统会自动导入依赖

解析内容依赖

常用的有BeautifulSoup、parsel、re等等

与上面步骤一样，如没有依赖，则在terminal处导入依赖

导入BeautifulSoup依赖

pip install bs4

导入pasel依赖

pip install parsel

使用依赖

from bs4 import BeautifulSoupimport requestsimport parselimport re1.2 获取数据

简单的获取网页，网页文本

response = requests.get(url).text

对于很多网站可能需要用户身份登录，此时用headers伪装，此内容可以在浏览器f12获得

headers = { 'Cookie': 'cookie，非真实的', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'}headers = { 'Host': 'www.qidian.com', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache', 'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'navigate'}

伪装后获取网页数据

response = requests.get(url=url,headers=headers).get.text

甚至还有些跟SSL证书相关，还需设置proxies

proxies = { 'http': 'http://127.0.0.1:9000', 'https': 'http://127.0.0.1:9000'}response = requests.get(url=url,headers=headers, proxies=proxies).get.text1.3 解析数据

数据的解析有几种方式，比如xpath，css, re。

css顾名思义，就是html标签解析方式了。

re是正则表达式解析。

1.4 写入文件with open(titleName + '.txt', mode='w', encoding='utf-8') as f: f.write(content)

open函数打开文件IO，with函数让你不用手动关闭IO流，类似Java中Try catch模块中try()引入IO流。

第一个函数为文件名，mode为输入模式，encoding为编码，还有更多的参数，可以自行研究。

write为写入文件。

二、完整案例import requestsimport parsellink = '小说起始地址，法律原因不给出具体的'link_data = requests.get(url=link).textlink_selector = parsel.Selector(link_data)href = link_selector.css('.DivTr a::attr(href)').getall()for index in href: url = f'https:{index}' print(url) response = requests.get(url, headers) html_data = response.text selector = parsel.Selector(html_data) title = selector.css('.c_l_title h1::text').get() content_list = selector.css('div.noveContent p::text').getall() content = '\n'.join(content_list) with open(title + '.txt', mode='w', encoding='utf-8') as f: f.write(content)

以上案例可以获取fl小说网的免费章节，那么付费章节呢

付费章节是照片形式的存在，找到照片然后用百度云计算解析照片的文字即可，爬取付费内容是违法行为，这部分代码不能提供

作者：天道佩恩链接：https://juejin.cn/post/7385350484411056154

热门分类