15158846557 或

在线咨询

所在位置：首页 > 营销资讯 > 网站运营 > python 爬虫爬微博分析数据

python 爬虫爬微博分析数据

时间：2023-05-20 10:36:02 | 来源：网站运营

时间：2023-05-20 10:36:02 来源：网站运营

python 爬虫爬微博分析数据：

python 爬虫爬微博分析数据

最近刚看完爱情公寓5，里面的大力也太好看了吧。。。

打开成果的微博，小作文一样的微博看着也太爽了吧。。。

@犬来八荒

来用python分析分析狗哥这几年微博的干了些啥。

需要的工具有：

scrapy + pyecharts + pymysql

这些库的使用我就不说自己百度学吧。

第一步：当然是进入狗哥的微博分析了

这里我推选微博手机版的网站，因为手机版的网站比较简单，没那么花里胡哨提取信息方便一点

狗哥的微博：https://m.weibo.cn/u/1927305954?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C

点击上述图片的位置

然后刷新下网站，加载json数据

通过分析这个就是加载微博数据的json文件，打开看一下

里的data->cards->mblog,就是存放微博文章的各种信息，比如文章点赞数，评论数等等。

把这个json文件翻到最下面

看到最下面的是1月29号发的微博，也就是说一个json文件存了3月15号到1月29号的微博

那么怎么获取 1月29号前面的呢？

这里仔细分析还是有规律的

我们在狗哥的主页上向下翻，翻到最后面会自己滚动加载新的json文件

新加载的json文件

打开后把前一个连接与这个比较一些

第一个连接：https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954

第二个加载的：https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954&since_id=4464357465265600

仔细发现前面都一样，唯一不同的是后面

第二个加载的多了一盒since_id

然后我们打开第一和json文件

这里有个since_id

这时我们就可以大胆推测一下了

第一次加载的json文件里面有个 since_id

而这个 since_id 也就是下一个要加载的json文件

然后下一个的 json文件里的since_id 也就是下一个的下一个的json文件

………………………………

这样就可把所有的json文件找出来了

你也可以自己找几个验证一下

有了这些数据那就开始爬虫了

第二部：爬取数据

我们可以设置：start_urls 为第一个出现的json文件连接

start_urls = "https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954"

since_id # 下下面的id created_at # 创建的日期 text # 发布的内容 source # 发布文章的设备 scheme # 原文连接 reposts_count # 转发数量 textLength # 文章字数 comments_count # 评论个数 attitudes_count # 点赞个数

这些是 json里面的数据，可以直接通过字典来获取

然后我也直接贴代码了

import jsonimport scrapyfrom weibo.items import WeiboItemfrom bs4 import BeautifulSoupclass weibo_spider(scrapy.Spider):    name = "weibo"    start_urls =["https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954"]    url = "https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954&since_id="    #start_urls = ["https://m.weibo.cn/"]    allowed_domains = ["weibo.com", "weibo.cn"]    since_id = ""       # 下下面的id    created_at = ""     # 创建的日期    text = ""           # 发布的内容    source = ""         # 发布文章的设备    scheme = ""         # 原文连接    reposts_count = 0   # 转发数量    textLength = 0      # 文章字数    comments_count = 0  # 评论个数    attitudes_count = 0 # 点赞个数    def parse(self, response):        text_json = json.loads(response.body_as_unicode())        self.since_id = text_json.get('data').get('cardlistInfo').get('since_id')        cards = text_json.get('data').get('cards')        for it in cards:            it_son = it.get('mblog')            if it_son:                self.created_at = it_son['created_at']                self.text = it_son['text']                self.source = it_son['source']                self.scheme = it['scheme']                self.reposts_count = it_son['reposts_count']                self.comments_count = it_son['comments_count']                self.attitudes_count = it_son['attitudes_count']                soup = BeautifulSoup(str(self.text), "html.parser") # 抓取的数据是有html标签 去除一下                self.text = soup.get_text()                if len(self.created_at) < 6 :                    self.created_at = "%s%s"%("2020-", self.created_at) #由于今年的微博没有年份 所有给数据处理一下                self.textLength = len(self.text)                items = WeiboItem(created_at=self.created_at, text=self.text, source=self.source, scheme=self.scheme,                                  reposts_count=self.reposts_count, comments_count=self.comments_count, attitudes_count=self.attitudes_count, textLength=self.textLength) # 将数据写入items 文件中                yield items        if not self.since_id:            return        urls = "%s%s"%(self.url, str(self.since_id)) # 获取的下一个json链接        yield scrapy.Request(urls, callback=self.parse)

scrapy 的 itmes.py 文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass WeiboItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    since_id = scrapy.Field()           # 下下面的id    created_at = scrapy.Field()         # 创建的日期    text = scrapy.Field()               # 发布的内容    source = scrapy.Field()             # 发布文章的设备    scheme = scrapy.Field()             # 原文连接    reposts_count = scrapy.Field()      # 转发数量    textLength = scrapy.Field()         # 文章字数    comments_count = scrapy.Field()     # 评论个数    attitudes_count = scrapy.Field()    # 点赞个数

接下来就是导入数据库了

scrapy 的 pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlimport jsonclass WeiboPipeline(object):    account = {        'user': 'root',        'password': '*******',        'host': 'localhost',        'database': 'python'    }    def mysqlConnect(self):        connect = pymysql.connect(**self.account)        return connect    def __init__(self):        self.connect = self.mysqlConnect()  # 连接数据库        self. cursor = self.connect.cursor(cursor = pymysql.cursors.DictCursor)        #### 以json写入        #self.fp = open("xiaofuren.json", 'w', encoding='utf-8')    def insertMsg(self, scheme, text, source, reposts_count, comments_count, attitudes_count, textLength, created_at):        try:            self.cursor.execute(                "INSERT INTO %s VALUES( /'%s/' ,/' %s/' ,/' %s/',/' %d/',/' %d/',/' %d/',/' %d/',/' %s/')" % (                    "weibo", scheme, text, source, reposts_count, comments_count, attitudes_count, textLength, created_at)                )            self.connect.commit()        except Exception as e:            print("insert_sql error: " + e)    def open_spider(self, spider):        print("爬虫开始了******************")    def process_item(self, item, spider):        self.insertMsg( item['scheme'], item['text'], item['source'], item['reposts_count'], item['comments_count'], item['attitudes_count'], item['textLength'], item['created_at'])        return item        #### 以json写入        # itme_json = json.dumps(dict(item), ensure_ascii=False)        # self.fp.write(itme_json + '/n')        # return item    def close_spider(self, spider):        print("爬虫结束***************")        print("数据写入成功")        self.cursor.close() # since_id = ""       # 下下面的id #    created_at = ""     # 创建的日期 #    text = ""           # 发布的内容 #    source = ""         # 发布文章的设备 #    scheme = ""         # 原文连接 #    reposts_count = 0   # 转发数量 #    textLength = 0      # 文章字数 #    comments_count = 0  # 评论个数 #    attitudes_count = 0 # 点赞个数

运行了快5分钟吧，比较慢因为有个去除 html标签可能解析的慢

然后看下数据库

总共221条微博，去主页验证一下

发现少了20多条，可能有的转发的没有爬到，不过验证最后一天是正确的。

有了数据就开始分析了

第三步：数据分析

我用的pyecharts

这个可视化库很厉害，有地图（虽然没用上）。

官方文档：http://gallery.pyecharts.org/#/Line/temperature_change_line_chart

导出数据库的信息

import datetimeimport pymysqlaccount = {    'user' : 'root',    'password' : 'zhaobo123..',    'host' : 'localhost',    'database' : 'python'}def mysqlConnect(account):    connect = pymysql.connect(**account)    return connectdef getMessage(cursor, month, day, year, phone, dianzan, zhuanfa, pinlun, textLength, dates):    sql = 'select * from weibo ORDER BY created_at'    cursor.execute(sql)    row = cursor.fetchall()    Day = {} #建立字典便于统计每天发送的微博    Year = {}    Month = {}    for i in range(1, 32):        Day[i] = 0    for i in range(1, 13):        Month[i] = 0    for i in range(2013, 2021):        Year[i] = 0    for it in row:        date = datetime.datetime.strptime(it['created_at'],  " %Y-%m-%d")        Year[date.year] += 1        Day[date.day] += 1        Month[date.month] += 1        phone.append(it['source'])        dianzan.append(it['attitudes_count'])        zhuanfa.append(it['reposts_count'])        pinlun.append(it['comments_count'])        textLength.append(it['textLength'])        dates.append(it['created_at'])    for i in range(1, 32):        day.append(Day[i])    for i in range(1, 13):        month.append(Month[i])    for i in range(2013, 2021):        year.append(Year[i])if __name__ == '__main__':    month = []  # 按照月发送的微博    year = []   # 按照年发送的微博    day = []    # 按照日发送的微博    phone = []  # 手机的种类    dianzan = [] # 点赞数    zhuanfa = [] # 转发数    pinlun = [] # 评论数    textLength = [] #发送微博长度    dates = [] # 时间    connect = mysqlConnect(account)    cursor = connect.cursor(cursor=pymysql.cursors.DictCursor)    getMessage(cursor, month, day, year, phone, dianzan, zhuanfa, pinlun, textLength, dates)

代码里有注释我就不解释了。

然后就是数据可视化了

先按照狗哥按天，年，月发的微博，可视化

#按照日 发微博的个数    xday = []    for i in range(1, 32):        xday.append(i)    bar = (        Bar()            .add_xaxis(xday)            .add_yaxis("每天发送的微博", day)            .set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))    )    bar.render(path= 'day.html')    # 按月    xmonth = []    for i in range(1, 13):        xmonth.append(i)    bar = (        Bar()            .add_xaxis(xmonth)            .add_yaxis("每月发送的微博", month)            .set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))    )    bar.render(path = 'month.html')    # 按年    xyear = []    for i in range(2013, 2021):        xyear.append(i)    bar = (        Bar()            .add_xaxis(xyear)            .add_yaxis("每年发送的微博", year)            .set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))    )    bar.render(path = 'year.html')

天：

这些年每月 28号发的最多，应该狗哥的小作文式的微博，都喜欢在月尾的时候发，来记录一下这个月的经历吧。

月：

看这些数据，狗哥喜欢在1月发微博，可能过年的时候比较闲吧，没事发发微博。

年：

应该是2020年最多（毕竟才过了4个月）刚出道微博宣传吧。。。。

18年到19年小作文式的微博比较多，刚步入社会没事发发微博恼骚一下。。。

发微博的设备

代码我就放在后面了。。。

直接上图吧

苹果的忠实粉丝

看看这些年的人气变化

这些年发的微博点赞数

没啥好分析的狗哥因为爱情公寓火的今年的点赞肯定爆炸式增长。

但是第一篇有三万多赞，肯定那些忠实粉丝看完了所有微博在最后一篇点个赞。

转发：

转发多应该是狗哥的小作文式的微博，毕竟还是挺有意思的。

评论数

和点赞一样最后一个特别多，都是来挖祖坟的

发布的微博内容长度：

看来狗哥喜欢每隔一段时间发布一篇小作文。。。。

ok结束了。

微博反爬机制不严获取微博不用登录，登录也不用验证吗，和本站不一样，不登录看不了文章，而且验证码还特别麻烦。

但是微博爬评论就要登录了

下一篇给大家介绍下如何登录微博爬取微博评论。

项目源码：github链接：https://github.com/zhaobo0564/project.git

关键词：分析,数据,爬虫

网站
营销
设计
运营
优化
效率
专注
电商
方案
推广

解决方案&服务

客户&案例

营销资讯

关于我们

解决方案&服务

客户&案例

营销资讯

关于我们

微信公众号

为了最佳展示效果，本站不支持IE9及以下版本的浏览器，建议您使用谷歌Chrome浏览器。点击下载Chrome浏览器

关闭

在线咨询

快捷入口

python 爬虫爬微博分析数据

python 爬虫爬微博分析数据

第一步：当然是进入狗哥的微博分析了

第二部：爬取数据

第三步：数据分析

先按照狗哥按天，年，月发的微博，可视化

天：

月：

年：

发微博的设备

看看这些年的人气变化

转发：

评论数

发布的微博内容长度：

项目源码：github链接：https://github.com/zhaobo0564/project.git

别急，手机网站制作先注意好这些信息！

从零开始学习网站建设

SEO优化对网站空间有什么要求

公众号扫码登录最佳实践

商城网站建设是如何报价的？角点科技为你揭秘报价

网站制作流程费用明细

定西seo优化

电商设计/运营 | 项目分析法

分享16个好玩到爆的网站，打开你就会爱上

网站设计规划：企业网站如何建设？企业网站如何运营推广？

在线咨询

快捷入口

python 爬虫 爬微博 分析 数据

python 爬虫 爬微博分析 数据

第一步：当然是进入狗哥的微博分析了

第二部：爬取数据

第三步：数据分析

先按照狗哥按天， 年 ，月发的微博，可视化

天：

月：

年：

发微博的设备

看看这些年的人气变化

转发：

评论数

发布的微博内容长度：

项目源码：github链接：https://github.com/zhaobo0564/project.git

推荐文章

python 爬虫 爬微博 分析 数据

自媒体人必备的热点舆情分析，3个网站轻松搞定

新浪微博爬虫实现（附核心Python代码）

21个Python爬虫开源项目代码，包含微信、淘宝、豆瓣、知乎、微博等

爬虫|爬取微博动态

腾讯回应“腾讯云数据库泄露”传闻；特斯拉将推出13万元刹车套件；PH

外贸网站怎么建设？外贸英文网站建站方案分析

肇庆seo优化网站关键词优化技巧分析

韩国网站导航相关数据

声网 Agora-杨浦：数据产品经理/数据分析师/数据平台业务平台/高级

别急，手机网站制作先注意好这些信息！

从零开始学习网站建设

SEO优化对网站空间有什么要求

公众号扫码登录最佳实践

商城网站建设是如何报价的？角点科技为你揭秘报价

网站制作流程费用明细

定西seo优化

电商设计/运营 | 项目分析法

分享16个好玩到爆的网站，打开你就会爱上

网站设计规划：企业网站如何建设？企业网站如何运营推广？

python 爬虫爬微博分析数据

python 爬虫爬微博分析数据

先按照狗哥按天，年，月发的微博，可视化

python 爬虫爬微博分析数据