BeautifulSoup を用いた HTML 解析の基礎 (タグの検索と解析)

Python で HTML 解析を行う手順の備忘録.

BeautifulSoup4 のインストール

pip install beautifulsoup4
pip install lxml
        

パース

from bs4 import BeautifulSoup

soup = BeautifulSoup($HTML_STR, "lxml")

print(type(soup)) # <class 'bs4.BeautifulSoup'>
        

タグの抽出

soup = BeautifulSoup("https://example.com", "lxml")

p_tag = soup.find("p")

p_tags = soup.find_all("p")

print("p_tag")
print(p_tag)

print("p_tags")
print(p_tags)
        
p_tag
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

p_tags
[<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>, <p><a href="https://www.iana.org/domains/example">More information...</a></p>]
        

タグの解析

a_tag = soup.find("a")

print(a_tag["href"]) # 'https://www.iana.org/domains/example'
        
print(a_tag.attrs) # {'href': 'https://www.iana.org/domains/example'}
print(type(a_tag.attrs)) # <class 'bs4.element.XMLAttributeDict'>