Python Program to Scrape Website Content
A Python program to scrape website content downloads HTML and extracts selected elements. BeautifulSoup turns the markup into a searchable tree of tags, attributes, text, and links. The workflow is request, validate, parse, select, and save. In web.run, Python executes through Pyodide, so its network and package activity also appears in Run Details.
Python Program to Scrape Website HTML
Output:
Output will appear here...
Output:
Title: Header
Links found: 9
How This Example Works
pyfetchsends an asynchronous request, so the program uses top-levelawait.- The status check stops error pages from being parsed as the intended document.
response.string()reads the body, which BeautifulSoup parses withhtml.parser.find("h1")returns the first heading, whilefind_all("a")returns every link.
Inspect the Scraper with Run Details
Open Run Details after the program finishes. It separates the page request made by the scraper from the package activity required to execute the code.
| Run Details signal | What it means |
|---|---|
| Requests shows a GET to raw.githubusercontent.com | The scraper attempted to download the HTML fixture |
| The request status is 200 | The browser received the page successfully |
| The request is failed or blocked | Check connectivity and whether the target permits cross-origin browser requests |
| Packages lists beautifulsoup4 | web.run detected the bs4 import and loaded its supported package |
| Status is 200 but extracted values are missing | The network worked; inspect the HTML and selectors |
The duration beside the request measures the page fetch, while the total duration also includes package loading, parsing, and Python execution. BeautifulSoup is normally downloaded on the first run and then remains available in the current runtime, so later runs can complete faster and may show no new package activity.
How to Scrape a Website with Python
- Choose a page that permits automated access and browser requests.
- Fetch its HTML and reject non-success HTTP statuses before parsing.
- Create one BeautifulSoup object from the response body.
- Inspect the markup and select elements by semantic tag, attribute, or CSS selector.
- Normalize extracted text and resolve relative URLs before saving the results.
More Python BeautifulSoup Examples
Extract all section headings
Use one CSS selector to collect several heading levels:
for heading in soup.select("h2, h3"):
print(heading.get_text(" ", strip=True))
The separator keeps text from nested tags apart, while strip=True removes surrounding whitespace.
Extract and normalize links
Resolve absolute, relative, and fragment links against the page URL:
from urllib.parse import urljoin
for link in soup.select("a[href]"):
print(urljoin(url, link["href"]))
The [href] selector excludes anchors without a destination, making attribute access safe.
Choosing BeautifulSoup Selectors
| Task | Selector | Result |
|---|---|---|
| First page heading | soup.find("h1") | One tag or None |
| Every link | soup.find_all("a") | A list of tags |
| Links with destinations | soup.select("a[href]") | Tags matching a CSS selector |
| Items inside an article | soup.select("article .item") | Matching descendants only |
Use find when one element is expected, find_all for a tag-based collection, and select when relationships or attributes make a CSS selector clearer.
Python Web Scraping Pitfalls
- Not every URL is reachable. Pyodide runs in the browser, so a page can open in a tab yet reject a cross-origin request from web.run.
- Missing elements need an explicit check. BeautifulSoup returns no tag when a selector has no match; reading its text immediately would raise an error.
- JavaScript-generated content is absent from the downloaded HTML. BeautifulSoup parses the server response but does not execute scripts, so inspect the response body when a selector returns no matches.
Respect crawling rules, limit request frequency, cache unchanged pages, and back off when the server reports too many requests.
For the surrounding control flow, see Python exception handling and Python conditional checks.
FAQ
Which Python library is used for web scraping?
BeautifulSoup parses and queries downloaded HTML. A network client such as pyfetch retrieves the document, while BeautifulSoup selects elements by tag, attribute, or CSS selector.
Can BeautifulSoup scrape JavaScript websites?
BeautifulSoup parses server-provided HTML but does not run JavaScript. Elements created only after scripts execute will not appear in its parsed document.