时间:2023-05-08 05:54:01 | 来源:网站运营
时间:2023-05-08 05:54:01 来源:网站运营
手把手教你使用python抓取并存储网页数据!:https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3
现在启动Jupyter notebook,并运行以下代码import requestsurl = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'res = requests.get('url')print(res.status_code)#200
在上面的代码中,我们完成了下面三件事from bs4 import BeautifulSouppage = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')title = soup.title.text print(title)# 热门视频排行榜 - 哔哩哔哩 (゜-゜)つロ 干杯~-bilibili
在上面的代码中,我们通过bs4中的BeautifulSoup类将上一步得到的html格式字符串转换为一个BeautifulSoup对象,注意在使用时需要制定一个解析器,这里使用的是html.parser。all_products = []products = soup.select('li.rank-item')for product in products: rank = product.select('div.num')[0].text name = product.select('div.info > a')[0].text.strip() play = product.select('span.data-box')[0].text comment = product.select('span.data-box')[1].text up = product.select('span.data-box')[2].text url = product.select('div.info > a')[0].attrs['href'] all_products.append({ "视频排名":rank, "视频名": name, "播放量": play, "弹幕量": comment, "up主": up, "视频链接": url })
在上面的代码中,我们先使用soup.select('li.rank-item'),此时返回一个list包含每一个视频信息,接着遍历每一个视频信息,依旧使用CSS选择器来提取我们要的字段信息,并以字典的形式存储在开头定义好的空列表中。import csvkeys = all_products[0].keys()with open('B站视频热榜TOP100.csv', 'w', newline='', encoding='utf-8-sig') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(all_products)
如果你熟悉pandas的话,更是可以轻松将字典转换为DataFrame,一行代码即可完成import pandas as pdkeys = all_products[0].keys()pd.DataFrame(all_products,columns=keys).to_csv('B站视频热榜TOP100.csv', encoding='utf-8-sig')
import requestsfrom bs4 import BeautifulSoupimport csvimport pandas as pdurl = 'https://www.bilibili.com/ranking?spm_id_from=333.851.b_7072696d61727950616765546162.3'page = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')all_products = []products = soup.select('li.rank-item')for product in products: rank = product.select('div.num')[0].text name = product.select('div.info > a')[0].text.strip() play = product.select('span.data-box')[0].text comment = product.select('span.data-box')[1].text up = product.select('span.data-box')[2].text url = product.select('div.info > a')[0].attrs['href'] all_products.append({ "视频排名":rank, "视频名": name, "播放量": play, "弹幕量": comment, "up主": up, "视频链接": url })keys = all_products[0].keys()with open('B站视频热榜TOP100.csv', 'w', newline='', encoding='utf-8-sig') as output_file: dict_writer = csv.DictWriter(output_file, keys) dict_writer.writeheader() dict_writer.writerows(all_products)### 使用pandas写入数据pd.DataFrame(all_products,columns=keys).to_csv('B站视频热榜TOP100.csv', encoding='utf-8-sig')
关键词:数据,把手,使用