The Lead Generation Playbook Nobody Talks About: Mining the Public Web

The $50,000 lead list that was 40% garbage

A SaaS founder I know bought a lead database last year. Fifty thousand contacts, segmented by industry and company size, from a well-known data provider. Cost: around $12,000. He loaded the list into his CRM, launched an email campaign, and waited.

Bounce rate: 38%. Of the emails that actually landed, most went to people who had changed roles months ago. A handful triggered spam complaints because the recipients had never heard of the company and certainly hadn’t opted into anything.

Net result: 11 qualified leads from 50,000 contacts. That’s a 0.02% conversion rate on a $12,000 investment, plus the time his sales team wasted chasing dead ends.

The problem isn’t his sales pitch. The problem is that bought lead lists are snapshots of a world that no longer exists. By the time the data is packaged, sold, and loaded into your CRM, a significant chunk of it is already wrong — people changed jobs, companies pivoted, emails bounced, phone numbers disconnected.

The leads you actually want are sitting on public websites

Here’s what I find fascinating. While companies spend thousands on stale databases, the freshest, most accurate lead data is freely available on the public web. It just requires some effort to collect and structure.

Think about it:

Company websites list their team members, technologies they use, office locations, and often their current challenges (in blog posts, case studies, and job descriptions)
Job boards reveal which companies are hiring for specific roles — a company hiring 3 data engineers is probably building a data team and might need tools or services
LinkedIn profiles show real-time job titles, company affiliations, and career trajectories
Business directories aggregate company information with revenue ranges, employee counts, and industry classifications
Review platforms like G2 or Capterra show which tools companies use — and which ones they’re unhappy with

Each of these is a signal. Stacked together, they paint a portrait of a potential customer that no lead list vendor can match, because the data is current — not six months old.

Building a lead engine that doesn’t go stale

The difference between scraping random contact info and building a lead generation engine is intent. You’re not collecting emails for the sake of it. You’re identifying companies that match your ideal customer profile, finding the right people within those companies, and reaching out when the timing is right.

Here’s the architecture:

Layer 1: Company identification

Start with the question: “What signals indicate a company might need what I sell?”

If you sell data integration tools, a company posting job listings for “data engineer” or “ETL developer” is a strong signal. If you offer cybersecurity services, a company that just suffered a data breach (public news) or is hiring a CISO (job boards) is a warm prospect.

Build scrapers that monitor these signals continuously:

Job boards (Indeed, LinkedIn Jobs, Welcome to the Jungle) for specific role keywords
Company blogs and press releases for expansion announcements
Funding databases (Crunchbase, Dealroom) for recently funded startups
Industry directories for new company registrations

Layer 2: Company enrichment

Once you’ve identified a company worth targeting, enrich the profile:

Company size, revenue range, and growth trajectory from business databases
Technology stack from tools like BuiltWith or Wappalyzer (or scrape it directly from their site headers and scripts)
Recent news and press mentions for conversation starters
Social media presence and engagement levels

This is where automated data pipelines become essential. You’re pulling data from 5-10 different sources per company, normalizing it into a single record, and keeping it fresh.

Layer 3: Contact discovery

Now find the right person. Not the CEO (too busy), not the intern (no authority) — the person who feels the pain your product solves.

Public sources for contact discovery:

Company “About” or “Team” pages often list key personnel with titles
LinkedIn profiles (public data) show who holds which role
Conference speaker lists and podcast appearances reveal thought leaders
GitHub contributions and technical blog authors identify technical decision-makers

The key here is respecting boundaries. Stick to publicly available information. Don’t scrape private profiles, don’t guess email formats to bypass opt-in requirements, and always give people a clear way to opt out. GDPR and similar regulations aren’t suggestions — they’re the law, and following them is also just good business practice.

Layer 4: Timing and scoring

Not every identified lead is worth pursuing right now. Build a scoring model based on the signals you’ve collected:

High intent: Company posted 3 relevant job listings this month + recently raised funding + uses a competitor’s product → reach out now
Medium intent: Company matches ICP + growing headcount + no competitor product detected → add to nurture sequence
Low intent: Company matches ICP but no active signals → monitor for changes

The automated pipeline re-scores leads as new data comes in. A company that was “low intent” last month might jump to “high intent” when they post a relevant job listing or announce a new initiative.

Real numbers from a real pipeline

Let me share what this looks like in practice for a B2B SaaS company selling to mid-market (50-500 employees) tech companies in Europe.

Monthly pipeline:

Job board monitoring identifies ~400 companies posting relevant roles
Filtering by geography, size, and industry narrows to ~120 qualified companies
Enrichment adds company details, tech stack, and recent news
Contact discovery finds 2-3 relevant decision-makers per company
Lead scoring prioritizes ~40 high-intent companies per month

Results after 6 months:

Average email response rate: 12% (vs. 2% from bought lists)
Qualified meetings booked per month: 8-12
Cost per qualified lead: ~$15 (infrastructure + tools)
Compared to: $80-150 per lead from traditional data providers

The response rate difference is the real story. When your outreach references the specific job listing they posted, mentions the technology stack they use, or acknowledges a recent company milestone — it doesn’t feel like spam. It feels like someone who did their homework.

The enrichment layer that changes everything

Raw lead data is useful. Enriched lead data is powerful. The difference is context.

Compare these two outreach approaches:

Without enrichment: “Hi, I see you’re the Head of Data at TechCorp. We help companies with data integration…”

With enrichment: “Hi, I noticed TechCorp just posted for two senior data engineers and recently migrated to Snowflake (spotted it in your job descriptions). We’ve helped three similar-sized companies in fintech reduce their integration overhead by 60% during exactly this kind of scaling phase…”

Same lead. Wildly different response rates. The second message is only possible because the pipeline collected, connected, and surfaced the right context automatically.

This enrichment step is where AI-powered extraction earns its keep. Job descriptions, company blog posts, and press releases are unstructured text. An LLM can read a job listing and extract “this company uses Snowflake, dbt, and Airflow” without anyone manually parsing keywords.

What not to do

I need to be direct about this because the line between smart lead generation and spam is thinner than people think.

Don’t mass-scrape personal email addresses. Even if they’re technically public, blasting thousands of people with unsolicited emails will get your domain blacklisted and possibly get you fined under GDPR. Use personal data to inform your outreach strategy, not as a bulk mailing list.

Don’t pretend automation is personalization. Inserting {first_name} and {company_name} into a template isn’t personalization. Real personalization uses the enriched data to write messages that demonstrate genuine understanding of the prospect’s situation.

Don’t ignore opt-out requests. If someone tells you to stop emailing them, stop. Immediately. No exceptions. It’s the law and it’s basic decency.

Don’t scrape behind login walls. Public web means public web. If you need to create an account to access the data, it’s not public, and scraping it likely violates the platform’s terms of service.

The companies that do lead generation well treat their data pipeline as a research tool, not a spam cannon. The goal is fewer, better conversations — not more, worse ones.

Start small, then compound

You don’t need to build this entire system at once. Start with one signal source — say, job board monitoring for your specific keywords — and pipe it into a simple spreadsheet or CRM. Run manual outreach for a month to validate that the signal quality is worth automating further.

Once you’ve proven the model, add enrichment layers one at a time. Company data. Tech stack. Recent news. Contact discovery. Each layer improves your conversion rate and reduces the time your sales team spends on research.

At SilentFlow, we build exactly these kinds of automated data pipelines — from scraping job boards, company directories, and public profiles to enriching and scoring leads with AI. Our clients typically see qualified lead costs drop by 70-80% compared to buying data, with response rates that make their sales teams actually enjoy outbound again.

The best leads aren’t in a database you can buy. They’re in signals scattered across the public web, waiting for someone to connect the dots. The companies that build the infrastructure to do this systematically don’t just generate more leads — they generate better ones.