Python Program to Scrape Website Content

Q: Can BeautifulSoup scrape JavaScript websites?

BeautifulSoup can parse HTML returned by a server, but it cannot run JavaScript. Content created only after JavaScript executes requires an API response or permitted browser automation.

A Python program to scrape website content downloads HTML and extracts selected elements. BeautifulSoup turns the markup into a searchable tree of tags, attributes, text, and links. The workflow is request, validate, parse, select, and save. In web.run, Python executes through Pyodide, so its network and package activity also appears in Run Details.

Python Program to Scrape Website HTML

from pyodide.http import pyfetch
from bs4 import BeautifulSoup

url = ("https://raw.githubusercontent.com/mdn/learning-area/main/html/"
       "introduction-to-html/document_and_website_structure/index.html")
response = await pyfetch(url)
if response.status != 200:
    raise RuntimeError(f"HTTP {response.status}")
soup = BeautifulSoup(await response.string(), "html.parser")
print("Title:", soup.find("h1").get_text(strip=True))
print("Links found:", len(soup.find_all("a")))

Output:

Output will appear here...

Output:

Title: Header
Links found: 9

How This Example Works

pyfetch sends an asynchronous request, so the program uses top-level await.
The status check stops error pages from being parsed as the intended document.
response.string() reads the body, which BeautifulSoup parses with html.parser.
find("h1") returns the first heading, while find_all("a") returns every link.

Inspect the Scraper with Run Details

Open Run Details after the program finishes. It separates the page request made by the scraper from the package activity required to execute the code.

Run Details signal	What it means
Requests shows a GET to raw.githubusercontent.com	The scraper attempted to download the HTML fixture
The request status is 200	The browser received the page successfully
The request is failed or blocked	Check connectivity and whether the target permits cross-origin browser requests
Packages lists beautifulsoup4	web.run detected the bs4 import and loaded its supported package
Status is 200 but extracted values are missing	The network worked; inspect the HTML and selectors

The duration beside the request measures the page fetch, while the total duration also includes package loading, parsing, and Python execution. BeautifulSoup is normally downloaded on the first run and then remains available in the current runtime, so later runs can complete faster and may show no new package activity.

How to Scrape a Website with Python

Choose a page that permits automated access and browser requests.
Fetch its HTML and reject non-success HTTP statuses before parsing.
Create one BeautifulSoup object from the response body.
Inspect the markup and select elements by semantic tag, attribute, or CSS selector.
Normalize extracted text and resolve relative URLs before saving the results.

More Python BeautifulSoup Examples

Extract all section headings

Use one CSS selector to collect several heading levels:

for heading in soup.select("h2, h3"):
    print(heading.get_text(" ", strip=True))

The separator keeps text from nested tags apart, while strip=True removes surrounding whitespace.

Extract and normalize links

Resolve absolute, relative, and fragment links against the page URL:

from urllib.parse import urljoin

for link in soup.select("a[href]"):
    print(urljoin(url, link["href"]))

The [href] selector excludes anchors without a destination, making attribute access safe.

Choosing BeautifulSoup Selectors

Task	Selector	Result
First page heading	`soup.find("h1")`	One tag or `None`
Every link	`soup.find_all("a")`	A list of tags
Links with destinations	`soup.select("a[href]")`	Tags matching a CSS selector
Items inside an article	`soup.select("article .item")`	Matching descendants only

Use find when one element is expected, find_all for a tag-based collection, and select when relationships or attributes make a CSS selector clearer.

Python Web Scraping Pitfalls

Not every URL is reachable. Pyodide runs in the browser, so a page can open in a tab yet reject a cross-origin request from web.run.
Missing elements need an explicit check. BeautifulSoup returns no tag when a selector has no match; reading its text immediately would raise an error.
JavaScript-generated content is absent from the downloaded HTML. BeautifulSoup parses the server response but does not execute scripts, so inspect the response body when a selector returns no matches.

Respect crawling rules, limit request frequency, cache unchanged pages, and back off when the server reports too many requests.

For the surrounding control flow, see Python exception handling and Python conditional checks.

FAQ

Which Python library is used for web scraping?

BeautifulSoup parses and queries downloaded HTML. A network client such as pyfetch retrieves the document, while BeautifulSoup selects elements by tag, attribute, or CSS selector.

Can BeautifulSoup scrape JavaScript websites?

BeautifulSoup parses server-provided HTML but does not run JavaScript. Elements created only after scripts execute will not appear in its parsed document.