Back to blog
· 8 min read

How to Build a Data Pipeline That Doesn't Break Every Monday Morning

data pipelinedata engineeringautomationETLreliability

The Monday morning panic

You know the feeling. You get to your desk on Monday, open your dashboard, and the data hasn’t updated since Friday afternoon. The scraper crashed because the target site changed a class name. The transformation script hit a null value nobody anticipated. The database connection pool was exhausted because three jobs ran simultaneously.

Nobody noticed until now because the alerting — if it exists at all — sends emails to a shared inbox nobody checks on weekends.

This is the default state of most data pipelines. They work fine 95% of the time, and the other 5% consumes 80% of your engineering time.

It doesn’t have to be this way.

The three layers that break independently

Every data pipeline has three layers, and understanding them separately is the key to reliability:

Collection — getting data from external sources (APIs, websites, databases, files). This is where most failures originate because you’re interacting with systems you don’t control.

Transformation — cleaning, normalizing, enriching, and structuring the raw data. This is where edge cases in the data create bugs in your logic.

Delivery — loading the processed data into its destination (database, data warehouse, dashboard, API). This is where infrastructure issues and schema mismatches cause failures.

Each layer needs its own error handling strategy because the failure modes are completely different.

Making collection resilient

Collection failures are the most common because external sources are unpredictable. Websites change layouts (as any headless browser scraping practitioner knows). APIs return unexpected responses. Rate limits kick in. Servers go down.

Retry with backoff, but not forever. When a request fails, retry it — but with exponential backoff (wait 1 second, then 2, then 4, then 8). Set a maximum retry count. If a source has been failing for 30 minutes, something is actually wrong, and retrying every few seconds is just hammering a server that’s already in trouble.

Checkpoint your progress. If you’re scraping 10,000 pages and it fails on page 7,342, you should be able to resume from page 7,342 — not restart from the beginning. Store checkpoints (the last successfully processed item) so that recovery doesn’t mean re-doing everything.

Separate collection from transformation. Store the raw collected data before processing it. If your scraper grabs HTML and immediately tries to parse it, a parsing bug means you lose the raw data and have to re-scrape. If you store the raw HTML first, you can fix the parser and reprocess without re-collecting.

Monitor freshness, not just success. A scraper that “succeeds” but returns zero results because the site’s structure changed is worse than one that fails loudly. Track the number of items collected per run and alert when it deviates significantly from the baseline.

Transformation patterns that survive edge cases

The transformation layer is where “it worked in testing” meets “production data is chaos.”

Validate inputs before processing. Before transforming a record, check that required fields exist and have expected types. A price field that’s suddenly a string (“Contact us for pricing”) instead of a number will cascade errors through your entire pipeline if you don’t catch it early.

Handle nulls explicitly. Every field from an external source can be null. Every one. Design your transformation logic to handle missing data gracefully — either with sensible defaults, by flagging the record for review, or by skipping it with a log entry. Never let a null produce a silent wrong result.

Schema validation at boundaries. Define explicit schemas for your input data and output data. Tools like Zod (TypeScript), Pydantic (Python), or JSON Schema work perfectly for this. Validate every record at the entry point and exit point of your transformation. Records that don’t match the schema go to a “dead letter” queue for investigation, not into your production database.

Make transformations idempotent. Running the same transformation twice on the same input should produce the same output without side effects. This seems obvious, but it’s surprisingly easy to violate — especially with operations that increment counters, append to lists, or generate timestamps. Idempotency means you can safely re-run any step without fear of duplicating or corrupting data.

Delivery that doesn’t corrupt your database

The delivery layer seems simple — just write the data to the destination. But this is where data integrity lives or dies.

Use upserts, not inserts. If you’re loading product data daily, and a product already exists, you want to update it — not create a duplicate. Upsert operations (insert if new, update if existing) based on a unique key prevent the slow accumulation of duplicate records that plagues most data pipelines.

Batch writes with transactions. Don’t write records one at a time. Batch them into transactions of 100-1,000 records. This is faster (fewer round trips to the database) and safer (if the batch fails, the transaction rolls back cleanly instead of leaving your database in a half-updated state).

Schema migrations are not optional. When your data model changes — a new field, a changed type, a renamed column — handle it with explicit migrations, not ad hoc ALTER TABLE statements run in production. Tools like Prisma Migrate, Alembic, or Flyway track schema changes as versioned files that can be applied consistently across environments.

Orchestration: the glue that holds it together

Individual pipeline steps can be reliable, but the orchestration — when things run, in what order, and what happens when something fails — is what determines overall system reliability.

Don’t use cron jobs for anything complex. Cron is fine for “run this script every hour.” It’s terrible for “run step A, then step B if A succeeded, then step C, but retry step B up to 3 times if it fails, and send an alert if the whole pipeline hasn’t completed within 2 hours.” For anything beyond simple scheduling, use a proper workflow orchestrator.

In 2026, the practical options are:

  • Temporal for complex, long-running workflows with sophisticated retry and compensation logic
  • Dagster or Prefect for data-specific pipelines with built-in data quality checks
  • Simple queue-based systems (BullMQ, SQS) for straightforward step-by-step processing

Most pipelines don’t need Airflow. Despite its popularity, Airflow’s complexity is overkill for 90% of data collection use cases. A well-designed queue-based system with proper error handling is simpler and more reliable.

Alerting that people actually respond to

The best error handling in the world is useless if nobody sees the alerts.

Alert on anomalies, not just errors. A pipeline that runs successfully but collects 50 records instead of the usual 5,000 is probably broken. Track baseline metrics and alert when values deviate by more than a reasonable threshold.

One channel, high signal. If your alerting sends 20 messages a day, people stop reading them. Reduce noise by deduplicating alerts, batching non-critical notifications, and having clear severity levels. Critical alerts (data pipeline completely stopped) should go to Slack or PagerDuty. Warnings (slower than usual, some records skipped) can go to a daily summary email.

Include context in alerts. “Pipeline failed” is useless. “Collection step failed for source X: HTTP 403 on page 234/10000, last successful run was 2 hours ago, 233 pages already collected and saved” — that’s actionable. Include what failed, why, what was already completed, and ideally a link to the relevant logs.

Putting it all together

At SilentFlow, we’ve built data pipelines that process millions of records daily from hundreds of sources. The ones that run reliably all share the same DNA: separated concerns between collection, transformation, and delivery. Raw data stored before processing. Schema validation at every boundary. Idempotent operations throughout. Meaningful alerting with context.

None of this is rocket science. It’s just discipline — applying known patterns consistently instead of taking shortcuts that save time today and cost weekends later. The Monday morning panic is optional. You just have to build the pipeline that makes it unnecessary.

Launch your AI project

Want to integrate AI into your workflows? Tell us what you need, we'll get back to you within 24 hours.

Send message