时间:2023-09-04 22:18:01 | 来源:网站运营
时间:2023-09-04 22:18:01 来源:网站运营
大神kennethreitz写出requests-html,号称为人设计的网页解析库:requests库的作者kennethreitz又设计出一个新的库requests-html. 目前stars数高达9195pip install requests-html
第一页 https://book.douban.com/tag/小说第二页 https://book.douban.com/tag/小说?start=20&type=T第三页 https://book.douban.com/tag/小说?start=40&type=T第四页 https://book.douban.com/tag/小说?start=60&type=T
from bs4 import BeautifulSoupimport requests base = 'https://book.douban.com/tag/小说?start={page}&type=Theaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}for i in range(100): url = base.format(page=i*20) resp = requests.get(url, headers=headers) bsObj = BeautifulSoup(resp.text, 'html.parser')
from requests_html import HTMLSessionsession = HTMLSession()r = session.get('https://book.douban.com/tag/小说')for html in r.html: print(html)
<HTML url='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'>
from requests_html import HTMLSessionsession = HTMLSession()r = session.get('https://python.org/')r
<Response [200]>
r.text[:50]
'<!doctype html>/n<!--[if lt IE 7]> <html class="n'
r.content[:50]
b'<!doctype html>/n<!--[if lt IE 7]> <html class="n'
r.html
<HTML url='https://www.python.org/'>
#混合着绝对和相对网址print(len(r.html.links))list(r.html.links)[:5]
119['/success-stories/category/arts/', 'https://kivy.org/', 'https://www.python.org/psf/codeofconduct/', 'http://www.scipy.org', 'https://docs.python.org/3/license.html']
print(len(r.html.absolute_links))list(r.html.absolute_links)[:5]
119['https://kivy.org/', 'https://www.python.org/psf/codeofconduct/', 'http://www.scipy.org', 'https://jobs.python.org', 'https://docs.python.org/3/license.html']
https://pythonclock.org/
, 我们看到有一个倒计时时间表。这个页面内置了from requests_html import HTMLSessionsession = HTMLSession()r2 = session.get('https://pythonclock.org/')r2.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]
'</h1>/n </div>/n <div class="python-27-clock"></div>/n <div class="center">/n <div class="guido-button-block">/n <button class="js-guido-mode guido-button">'
r2.html.render()r2.html.search('Python 2 will retire in only {months} months!')
'</h1>/n </div>/n <div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>/n <div class="center">/n <div class="guido-button-block">/n <button class="js-guido-mode guido-button">'
periods = [element.text for element in r.html.find('.countdown-period')]amounts = [element.text for element in r.html.find('.countdown-amount')]countdown_data = dict(zip(periods, amounts))countdown_data
{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}
r.html.find('#about')
[<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>]
about = r.html.find('#about',first=True)about
<Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>
r = session.get('https://github.com/')htmlObj = r.htmlhtmlObj.xpath('a',first=True)
<Element 'a' class=('btn', 'ml-2') href='https://help.github.com/articles/supported-browsers'>
关键词:设计,号称