BeautifulSoup を用いた HTML 解析の基礎 (タグの検索と解析)

Python で HTML 解析を行う手順の備忘録.

BeautifulSoup4 のインストール

pip でインストール可能

pip install beautifulsoup4
pip install lxml

lxml は HTML パーサーであり, BeautifulSoup において内部的に使用される (lxml を使わない方法もあるが, lxml は高速なので特に理由がなければ使った方がよい)

パース

requests 等で取得した html のパースを行う

from bs4 import BeautifulSoup

soup = BeautifulSoup($HTML_STR, "lxml")

print(type(soup)) # <class 'bs4.BeautifulSoup'>

タグの抽出

BeautifulSoup では find や find_all を用いてタグの検索が可能
https://example.com に対する例は以下の通り

soup = BeautifulSoup("https://example.com", "lxml")

p_tag = soup.find("p")

p_tags = soup.find_all("p")

print("p_tag")
print(p_tag)

print("p_tags")
print(p_tags)

出力は以下の通り

p_tag
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

p_tags
[<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>, <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

ファイル内に一つしか含まれないタグ (body など) に対しては find を, それ以外は基本的に find_all を用いる
返されるタグのオブジェクトの型は bs4.element.Tag

タグの解析

抽出したタグの持つ属性は辞書形式でアクセス可能 (型としては Dict ではない)

a_tag = soup.find("a")

print(a_tag["href"]) # 'https://www.iana.org/domains/example'

上記属性アクセス方法では keys のような属性の一括取得はできない
辞書としてすべての属性を調べたい場合は .attrs を用いる

print(a_tag.attrs) # {'href': 'https://www.iana.org/domains/example'}
print(type(a_tag.attrs)) # <class 'bs4.element.XMLAttributeDict'>

.attrs では辞書型のように keys などを用いることができる