Using LLMs to Extract Structured Data from Messy Web Pages

The fragility problem

Every web scraper has the same weakness: CSS selectors. You write document.querySelector('.product-price .amount') and it works perfectly — until the site redesigns, renames the class to price-value, wraps it in a new div, or switches from server-rendered HTML to a JavaScript-loaded component.

When your scraper monitors 50 sites, at least one of them changes something every week. You spend more time maintaining selectors than building new features. It’s a treadmill.

This is where LLMs change the game. Instead of telling a scraper exactly where to find the price (fragile CSS path), you tell an LLM “extract the price from this page” and it understands what a price looks like — regardless of how the HTML is structured.

How LLM-based extraction works

The basic pattern is straightforward:

Fetch the page content (HTML or rendered text)
Clean it up (remove navigation, footers, scripts — keep the main content)
Send it to an LLM with a prompt defining what to extract and the expected output schema
Validate the structured response against your schema
Store the validated data

In practice, say you need to extract product information from an e-commerce page. The prompt might be:

Extract the following information from this product page:
- name: the product name
- price: the current price as a number (not the original/crossed-out price)
- currency: the currency code (EUR, USD, GBP, etc.)
- in_stock: boolean, whether the product is available
- description: a brief product description (max 200 chars)

Return valid JSON matching this schema. If a field cannot be determined, use null.

The LLM reads the page content — messy HTML, inconsistent formatting, multiple prices (original, discounted, member-only) — and returns clean JSON. It understands that “19,99 EUR” and “$19.99” and “Price: 19.99$” all represent prices. It knows that “In Stock”, “Available”, “Ships in 2-3 days”, and a green checkmark icon all mean the product is available.

No CSS selectors. No XPath. No regex that breaks when someone adds a space.

When to use LLMs vs traditional parsing

LLMs aren’t a replacement for traditional scraping — they’re a complement. Here’s when each approach makes sense:

Use traditional CSS/XPath selectors when:

You’re scraping a single site with a stable structure
The data is in predictable, consistent locations
You need maximum speed and minimum cost
Volume is extremely high (millions of pages per day)

Use LLM extraction when:

You’re scraping many sites with different structures
Layouts change frequently
The data requires interpretation (not just location)
You need to extract from unstructured text, not just structured HTML
Volume is moderate (thousands to tens of thousands of pages per day)

Use both when:

You have a mix of stable and unstable sources
Some fields are easy to locate (CSS) and others require understanding (LLM)
You want CSS selectors for speed with LLM as a fallback when selectors fail

The hybrid approach is the most powerful — similar to choosing between APIs and scraping. Try fast, cheap CSS extraction first. If it fails or returns suspicious results, fall back to LLM extraction. Log the fallback so you can update your selectors later if you want.

Making it cost-effective

The obvious concern with LLM extraction is cost. Sending full HTML pages to an LLM API isn’t cheap — and most of that HTML is irrelevant navigation, headers, footers, and scripts.

Strip before sending. Remove everything that isn’t the main content. Strip all <script>, <style>, <nav>, <header>, <footer> tags. Remove HTML attributes that don’t carry semantic meaning (class names, IDs, data attributes). Convert the remaining HTML to clean text or minimal markdown. This typically reduces token count by 80-90%.

Use the right model for the job. Not every extraction needs Claude Opus or GPT-4o. For straightforward extractions (name, price, availability), smaller models like Claude Haiku or GPT-4o mini deliver 95%+ accuracy at a fraction of the cost. Reserve the more capable models for complex extractions that require reasoning — interpreting ambiguous descriptions, handling multiple price tiers, or extracting from heavily unstructured content.

Batch similar pages. If you’re extracting the same fields from 100 pages on the same site, you can often send multiple page contents in a single API call with a shared prompt. This reduces the overhead of the system prompt and schema definition being repeated for every page.

Cache intelligently. If a page hasn’t changed since last extraction (check with ETags or content hashing), don’t re-extract. This alone can reduce LLM API calls by 60-70% for sites that update infrequently.

With these optimizations, the cost of LLM extraction typically lands at $0.001-0.01 per page — comparable to the proxy costs you’re already paying for scraping.

Structured outputs: the reliability breakthrough

The single biggest improvement in LLM-based extraction over the past year is structured output support. Both Claude and GPT now support constrained generation — the model is forced to return valid JSON matching a specific schema. No more parsing free-text responses and hoping the format is correct.

This changes everything for production use. Instead of prompting the model and then writing brittle regex to extract values from its response, you define a JSON schema and the API guarantees the response matches it. Every field has the right type. Required fields are always present. No unexpected keys or missing brackets.

Combined with schema validation on your end (Zod, Pydantic), this creates a double safety net that makes LLM extraction as reliable as traditional parsing for most use cases.

Handling the edge cases

LLMs are impressively good at extraction, but they’re not perfect. Here’s how to handle the cases where they stumble:

Ambiguous data. A page shows three prices — original, sale, and member-only. Which one should the LLM return? Be explicit in your prompt. “Extract the currently displayed sale price visible to non-logged-in users, not the original price or member price.”

Hallucinated values. Occasionally, an LLM will generate plausible-looking data that isn’t on the page. Cross-reference extracted values against the raw content. If the LLM returns a price of $29.99 but the string “29.99” doesn’t appear anywhere in the page content, flag it for review.

Multi-language content. Pages in languages the model handles less well may have lower extraction accuracy. For critical use cases, test extraction quality per language and consider using language-specific prompts or pre-translation.

Large pages. Some pages exceed the model’s context window even after stripping. For these, extract the relevant section first (the product container, the article body) and only send that section to the LLM.

A real-world example

We built a competitive monitoring system for a European retail company tracking prices across 120 different e-commerce sites in 8 countries. Traditional approach would have required 120 different scraper configurations, each with site-specific CSS selectors, maintained individually.

Instead, we built 3 scraper templates (one for single-product pages, one for listing pages, one for search results) powered by LLM extraction. The same prompt schema handles French, German, Italian, Spanish, and English product pages without modification. When a site redesigns, the extractor keeps working because it’s reading content, not DOM structure.

Maintenance dropped from roughly 15 hours per week (fixing broken selectors) to about 2 hours per week (mostly handling new site structures we hadn’t seen before). Extraction accuracy sits at 97.5% across all sites, validated against manual spot checks.

At SilentFlow, this combination of intelligent scraping and LLM-powered extraction is at the core of what we build. The web is messy, unstructured, and constantly changing. Traditional scrapers fight against that reality. LLMs embrace it — understanding content the way a human would, but at a scale no human team can match. The result is data extraction that’s more resilient, more adaptable, and ultimately more reliable than any selector-based approach.