Python网络爬虫（四）

三种信息标记形式

XML 最早的通用信息标记语言，可拓展性好，但繁琐。适合Internet上的信息交互与传递

<person>
    <firstName>Tian</firstName>
    <lastName>Song</lastName>
    <address>
    	<streetAddr>中关村南大街5号</streetAddr>
    	<city>北京市</city> <zipcode>100081</zipcode>
    </address>
    <prof>Computer System</prof><prof>Security</prof>
</person>

JSON 信息有类型，适合程序处理（js)，较XML简洁。适合应用云端和节点的信息通信，但没有注释

{
	“firstName”: “Tian”,
    “lastName”: “Song”,
    “address”: { 
        “streetAddr” :“中关村南大街5号”,
        “city” :“北京市”,
        “zipcode” :“100081”
    },
    “prof”: [  “Computer System” ,“Security” ]
}

YAML 信息无类型，文本信息比例较高，可读性好。适合各类系统的配置文件，有注释易读

firstName: Tian
lastName: Song
address:
	streetAddr: 中关村南大街5号
	city: 北京市
	zipcode: 100081
prof:
‐Computer System
‐Security

# 提取HTML中的所有URL链接
from bs4 import BeautifulSoup
import requests

url = 'http://python123.io/ws/demo.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, 'html.parser')

    #提取链接
    for link in soup.find_all('a'):
        print(link.get('href'))
except:
    print("爬取失败")

<>.find_all(name, attrs, recursive, string, **kwargs)
返回一个列表类型，存储查找的结果

name: 对标签名称的检索字符串
attrs: 对标签属性值的检索字符串，可标注属性检索
recursive: 是否对子孙全部检索，默认为True。搜索当前节点儿子层面为False
string: <>...</>中字符串区域的检索字符串

<tag>(...) 等价于 <tag>.find_all(...)
soup(...) 等价于 soup.find_all(...)

import requests
import re
from bs4 import BeautifulSoup

url = 'http://python123.io/ws/demo.html'
try:
    r = requests.get(url)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, 'html.parser')

    print("查找a标签：\n", soup.find_all('a'))
    print("查找a,b标签：\n", soup.find_all(['a', 'b']))

    print("显示所有标签信息：\n", soup.find_all(True))
    print("----")
    for tag in soup.find_all(True):
        print(tag.name)
    print("----")
    for tag in soup.find_all(re.compile('b')):
        print(tag.name)

    print("属性值检索：\n", soup.find_all('p', 'course'))
    print("标签属性值检索：\n", soup.find_all(id="link1"))
    print("检索：\n", soup.find_all(id="link"))
    print("检索：\n", soup.find_all(id=re.compile('link')))

    print("在儿子层面搜索，不搜索子孙全部：\n", soup.find_all('a', recursive=False))

    print("检索：\n", soup.find_all(string="Basic Python"))
    print("检索：\n", soup.find_all(string=re.compile('python')))
except:
    print("爬取失败")

检索方法	说明
<>.find_all()
<>.find()	搜索且只返回一个结果，同.find_all()参数
<>.find_parents()	在先辈节点中搜索，返回列表类型
<>.find_parent()	在先辈节点中返回一个结果
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型
<>.find_next_sibling()	在后续平行节点中返回一个结果
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型
<>.find_previous_sibling()	在前序平行节点中返回一个结果

Python网络爬虫（四）

谢谢大爷~