Beautiful Soup库 解析HTML页面
pip install beautifulsoup4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 BeautifulSoup库的初次使用方法 import requestsfrom bs4 import BeautifulSoupurl = "http://python123.io/ws/demo.html" try : r = requests.get(url) r.raise_for_status() demo = r.text soup = BeautifulSoup(demo, 'html.parser' ) print(soup.prettify()) except : print("爬取失败" )
1 2 3 4 5 6 Beautiful Soup库是解析、遍历、维护“标签树”的功能库 Beautiful Soup库引用方式: from bs4 import BeautifulSoupsoup1 = BeautifulSoup("<html>data</html>" , 'html.parser' ) soup2 = BeautifulSoup(open("D://demo.html" ), 'html.parser' )
BeautifulSoup库解析器
使用方法
条件
bs4的HTML解析器
BeautifulSoup(mk, ‘html.parser’)
安装bs4库
lxml的HTML解析器
BeautifulSoup(mk, ‘lxml’)
pip install lxml
lxml的XML解析器
BeautifulSoup(mk, ‘xml’)
pip install lxml
html5lib的解析器
BeautifulSoup(mk, ‘html5lib’)
pip install html5lib
BeautifulSoup类的基本元素
说明
Tag
标签,分别用<>和</>表明开头和结尾,返回标签及其内容
Name
标签的名字,…
的名字是’p’(字符串类型),格式:.name
Attributes
标签的属性,字典形式组织,格式:.attrs
NavigableString
标签内非属性字符串,<>…</>中的字符串,格式:.string
Comment
标签内字符串的注释部分,一种特殊的Comment类型,格式:.string
Tag标签 :HTML语法中的标签都可以用soup.访问获得,当文档中存在多个相同的对应的内容是,soup.返回第一个
NavigableString :可以跨越多个层次
NavigableString和Comment都是.string,判断是非属性字符串还是注释要用type()判断
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import requestsfrom bs4 import BeautifulSoupurl = 'http://python123.io/ws/demo.html' try : r = requests.get(url) r.raise_for_status() demo = r.text print("页面源代码:\n" , demo) soup = BeautifulSoup(demo, 'html.parser' ) print("格式化树:\n" , soup) print("titlt标签:\n" , soup.title) print("a标签:\n" , soup.a) print("a标签的名字:\n" , soup.a.name) print("a标签父亲的名字:\n" , soup.a.parent.name) print("a标签的父亲的父亲的名字:\n" , soup.a.parent.parent.name) tag = soup.a print("tag标签的全部属性:\n" , tag.attrs) print("tag标签的类型:\n" , type(tag)) print("tag.attrs的类型:\n" , type(tag.attrs)) print("class属性:\n" , tag.attrs['class' ]) print("href属性:\n" , tag.attrs['href' ]) print("a标签:\n" , soup.a) print("a标签内非属性字符串:\n" , soup.a.string) print("p标签:\n" , soup.p) print("p标签内的非属性字符串:\n" , soup.p.string) print("类型:\n" , type(soup.p.string)) soup2 = BeautifulSoup("<b><!--A This is a comment--></b><p>B This is a comment</p>" , 'html.parser' ) print("soup2.b.string:\n" , soup2.b.string) print("类型:" , type(soup2.b.string)) print("soup2.p.string:\n" , soup2.p.string) print("类型:" , type(soup2.p.string)) except : print("爬取失败" )
基于bs4库的HTML内容遍历方法
标签树的下行遍历(BeautifulSoup类型是标签树的根节点)
属性
说明
.contents
子节点的列表,将所有儿子节点存入列表(包括\n)
.children
子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants
子孙节点的迭代类型,包含所有的子孙节点,用于循环遍历
标签树的上行遍历
属性
说明
.parent
节点的父亲标签
.parents
节点先辈标签的迭代类型,用于循环遍历先辈节点
标签树的平行遍历
属性
说明
.next_sibling
返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling
返回按照HTML文本顺序的上一个平行节点标签
.next_siblings
迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings
迭代类型,返回按照HTML文本顺序的前续所有平行节点标签【】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 import requestsfrom bs4 import BeautifulSoupurl = 'http://python123.io/ws/demo.html' try : r = requests.get(url) r.raise_for_status() soup = BeautifulSoup(r.text, 'html.parser' ) print(soup) print("soup.head:\n" , soup.head) print("soup.head的子列表:\n" , soup.head.contents) print("soup.body的子列表:\n" , soup.body.contents) print("长面的子列表长度为:\n" , len(soup.body.contents)) print("soup.body.contents[1]:\n" , soup.body.contents[1 ]) print("------------------------------------------------------" ) for child in soup.body.children: print(child) print("------------------------------------------------------" ) for child in soup.body.descendants: print("*HZB*" , child) print("------------------------------------------------------" ) print("soup.title.parent:\n" , soup.title.parent) print("soup.html.parent:\n" , soup.html.parent) print("soup.parent:\n" , soup.parent) for parent in soup.a.parents: if parent is None : print(parent) else : print(parent.name) print("soup.a.next_sibling:\n" , soup.a.next_sibling) print("soup.a.next_sibling.next_sibling:\n" , soup.a.next_sibling.next_sibling) print("soup.a.previous_sibling:\n" , soup.a.previous_sibling) print("soup.a.previous_sibling.previous_sibling:\n" , soup.a.previous_sibling.previous_sibling) print("soup.a.parent:\n" , soup.a.parent) print("----" ) for sibling in soup.a.next_siblings: print(sibling) print("----" ) for sibling in soup.a.previous_siblings: print(sibling) print("----" ) except : print("爬取失败" )
bs4库的prettify()方法可以格式化html代码,它会为HTML文本<>及其内容怎加'\n',python3默认支持utf-8编码,bs4库将任何HTML输入都编程utf-8编码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import requestsfrom bs4 import BeautifulSoupurl = 'http://python123.io/ws/demo.html' try : r = requests.get(url) r.raise_for_status() soup = BeautifulSoup(r.text, 'html.parser' ) print("格式化代码:\n" , soup.prettify()) print("----------" ) print("标签格式化:\n" , soup.a.prettify()) print("----------" ) soup2 = BeautifulSoup("<p>中文</p>" , 'html.parser' ) print(soup2.p.prettify()) except : print("爬取失败" )