In this tutorial, we’re gonna look at way to use BeautifulSoup
module to parse HTML in Python.
Parse HTML in Python using BeautifulSoup
Assume that we want to parse a simple HTML file with some different tags and attributes like this:
ozenero ozenero.com
javasampleapproach.com
Programming Tutorials
Java, Javascript, Python Technology
Be bold in stating your key points. Put them in a list:
- The first item in your list
- The second item; italicize key words
Improve your image by including an image.
Add a link to your favorite Web site. Break up your page with a horizontal rule or two.
Finally, link to another page in your own Web site.
© ozenero 2019
BeautifulSoup
is a module that allows us to extract data from an HTML page. You will find it working with HTML easier than regex. We will:
– able to use simple methods and Pythonic idioms searching tree, then extract what we need without boilerplate code.
– not have to think about encoding (or just have to specify original encoding) because BeautifulSoup
automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
Use BeautifulSoup for parsing HTML data
Install BeautifulSoup module
Open cmd, then run:
pip install beautifulsoup4
Once the installation is successful, we can see beautifulsoup4
folder at Python\Python[version]\Lib\site-packages
.
Now we can import the module by running import bs4
.
Create BeautifulSoup object
From response of a website
When our PC connects to internet, we can use requests
module to download HTML file.
Run cmd: pip install requests
to install the module.
>>> import requests, bs4 >>> response = requests.get('https://ozenero.com/wp-content/uploads/2019/01/ozenero.html') >>> response.raise_for_status() >>> gkzSoup = bs4.BeautifulSoup(response.text) >>> type(gkzSoup)
raise_for_status()
method ensures that our program halts if a bad download occurs.
From HTML file on PC
We can load HTML file on pC by passing a File
object to bs4.BeautifulSoup()
function.
>>> import bs4 >>> gkzFile = open('ozenero.html') >>> gkzSoup = bs4.BeautifulSoup(gkzFile.read()) >>> type(gkzSoup)
Find HTML elements
Once we have BeautifulSoup
object, we can use its select('selector')
method with selector
as input string to search for appropriate elements we need.
Here are some useful selectors:
select('certain-tag')
: all elements which HTML tag arecertain-tag
select('#certain-id')
: element withid
attribute ofcertain-id
select('.certain-class')
: all elements that usecertain-class
as CSS classselect('tag-a tag-b')
: alltag-a
elements which are insidetag-b
elementsselect('tag-a > tag-b')
: alltag-a
elements which are directly insidetag-b
elements (without any element between)select('certain-tag[att]')
: allcertain-tag
elements that haveatt
attributeselect('certain-tag[att="val"]')
: allcertain-tag
elements that haveatt
attribute with valueval
# select('tag') >>> gkzSoup.select('li') [
Finally, link to another page in your own Web site.
] # select('.class') >>> gkzSoup.select('.gkz-large') [ozenero.com
,Programming Tutorials
Java, Javascript, Python Technology
- The first item in your list
- The second item; italicize key words
ozenero.com
,javasampleapproach.com
] # select('tag[att="val"]') >>> gkzSoup.select('h1[site="ozenero.com"]') [ozenero.com
]In addition to the selectors above, we can also make more custom ones such as: select('.certain-class certain-tag')
, select('tag-a tag-b tag-c')
, select('.class-a .class-b')
…
>>> gkzSoup.select('.gkz-large h2') [Programming Tutorials
] >>> gkzSoup.select('div ul li') [
Programming Tutorials
Java, Javascript, Python Technology
BeautifulSoup also provides select_one()
method that finds only the first tag that matches the selector.
>>> gkzSoup.select_one('li')
Parse data from HTML elements
On the HTML element, we:
– use getText()
to get element’s text/ inner HTML.
– call attrs
for element’s attributes.
– use get('attr')
to access element’s attr
attribute.
>>> els = gkzSoup.select('div h1') # [ #ozenero.com
, #javasampleapproach.com
# ] >>> els[0].getText() 'ozenero.com' >>> els[0].attrs {'class': ['gkz-large'], 'site': 'ozenero.com'} >>> els[0].get('class') ['gkz-large'] >>> els[1].get('site') 'javasampleapproach.com'