时间:2023-10-04 10:42:01 | 来源:网站运营
时间:2023-10-04 10:42:01 来源:网站运营
数据解读:国内最大设计论坛站酷首页:站酷是国内最大的设计设计师交流论坛,涵盖各类设计以及纯艺术。我爬取了站酷首页1-100页的数据(因为站酷首页100页以后网站不保存了拿不到数据)。import requestsfrom lxml import etreeimport htmlimport reimport timedomain='https://www.zcool.com.cn'baseurl="/?p={}#tab_anchor"patten=re.compile('<.*?>')def getpagenum(a): url=domain+baseurl.format(a) res=requests.get(url) time.sleep(4) return res.textdef gerpage(c): tree = etree.HTML(c) table_row = tree.xpath('//div[@class="card-box"]') boards = [] for row in table_row: board = {} try: board['类别'] = row.xpath('div[@class="card-info"]/p[@class="card-info-type"]')[0].text board['点赞'] = row.xpath('div[2]/p[3]/span[3]')[0].text name2=row.xpath('div[3]/span[1]/a')[0] name2 = etree.tostring(name2).decode('utf-8') name2 = html.unescape(name2) name2 = patten.sub('', name2) name2=name2.strip() board['评论']=row.xpath('div[2]/p[3]/span[2]')[0].text board['作者'] = name2 except Exception as err: #print('error:',err) pass boards.append(board) return boardsdef main(): n=[] for i in range(0,99): c=getpagenum(i) page=gerpage(c) n.append(page) print(n)if __name__ == '__main__': main()
关键词:论坛,设计,数据