Real-Time Data Pipelines: 5 Lessons from Building BorderBot

Real-time data sounds simple: collect data, process it, display it. In practice, every step hides complexity that only surfaces in production.

BorderBot was my crash course in real-time data. Scraping border wait times every fifteen minutes, processing them for predictions, and serving results to users who needed accurate, timely information for their daily commute. The stakes weren't life-or-death, but they were daily: a bad prediction meant sitting in traffic for three hours instead of thirty minutes.

Here's what real-time data actually looks like from the builder's side.

Data Freshness Is a Spectrum

"Real-time" is a marketing term, not a technical specification. Every data system has latency: the gap between when something happens and when your system knows about it.

BorderBot's data pipeline had multiple latency sources:

Source latency. CBP updates wait times periodically, not continuously. The "current" wait time on their website might be five to thirty minutes old. I have no control over this latency.

Scraping latency. My scraper ran every fifteen minutes. Between scrapes, the displayed data was up to fifteen minutes stale. Reducing the interval would improve freshness but risk rate-limiting or blocking by CBP's servers.

Processing latency. After scraping, the data needed validation, cleaning, and insertion into the database. This took seconds, but seconds add up when freshness matters.

Delivery latency. Users see data when they open the app or when it refreshes. If a user opened the app at minute 14 of a scrape cycle, they saw data that was nearly thirty minutes old by the time they made a decision.

Total latency: anywhere from 5 to 45 minutes. For a prediction product, this was acceptable. For a real-time monitoring product, it would have been inadequate.

The lesson: understand your total latency stack and communicate it to users. "Updated 12 minutes ago" is more honest and more useful than "real-time."

Data Quality Is Never Guaranteed

Scraped data is dirty data. Even from authoritative sources like government websites, the data had problems:

Missing values. Sometimes the scraper returned empty results: the page loaded but the wait time wasn't displayed. Server error? Rendering issue? Genuine zero-minute wait? Without context, it was impossible to distinguish.

Impossible values. Occasionally, the reported wait time was clearly wrong: zero minutes during peak hours, or twelve hours during off-peak. These outliers needed to be detected and handled.

Format changes. The source website changed its HTML structure twice during BorderBot's operation. Each change broke the scraper silently. It continued running but returned garbage data until I noticed and updated the parsing logic.

Timezone inconsistencies. Wait times were reported in local time, but "local" wasn't always clear for border crossings that span two time zones (Tijuana is in a different timezone than San Diego, sometimes).

Each data quality issue required a specific fix:

Missing values: use the last known value with a staleness indicator
Impossible values: define acceptable ranges and flag outliers for review
Format changes: monitor scraper output structure and alert on deviation
Timezone issues: normalize everything to UTC internally, display in local time

The meta-lesson: every data source will surprise you. Build monitoring that catches surprises quickly, and build fallbacks that degrade gracefully when data quality drops.

Predictions Need Confidence Intervals

BorderBot's core value was predictions, not just what the wait time is now, but what it will be in one hour. Early versions displayed predictions as single numbers: "Predicted wait at 8 AM: 90 minutes."

Users treated this as a guarantee. When the actual wait was 120 minutes, they felt betrayed. The prediction was presented with false precision.

I added confidence intervals: "Predicted wait at 8 AM: 75-105 minutes." This changed user behavior dramatically. Instead of expecting exactly 90 minutes, they planned for the range. Satisfaction improved even though the predictions didn't change. The presentation did.

Confidence intervals also built trust. A system that says "we're 80% confident the wait will be between X and Y" is more trustworthy than a system that says "the wait will be exactly Z." The first acknowledges uncertainty honestly. The second pretends certainty that doesn't exist.

This principle applies to every data product: communicate uncertainty alongside predictions. Users handle uncertainty well when it's presented honestly. They handle false precision badly when it's disproven by reality.

Caching Strategy Matters More Than You Think

BorderBot's caching strategy went through three iterations:

v1: No cache. Every user request hit the database. Simple but slow, and the database was the bottleneck.

v2: Static TTL cache. Cache predictions for five minutes. Fast but inflexible. The cache didn't account for rapid changes during peak hours or the stability of off-peak periods.

v3: Dynamic TTL cache. Cache duration varies by time of day and day of week. During peak hours (when wait times change rapidly), cache for two minutes. During off-peak hours (when wait times are stable), cache for fifteen minutes.

The dynamic cache reduced database load by 80% while keeping peak-hour data fresh enough for useful predictions.

The broader lesson: caching isn't just a performance optimization. It's a product decision. The cache duration determines how fresh the data feels to users. Too short and you overload your infrastructure. Too long and users see stale data. The right duration depends on how quickly the underlying data changes and how sensitive users are to staleness.

Monitoring Changes Everything

The most important infrastructure investment for BorderBot wasn't the prediction model or the frontend. It was monitoring.

I built alerts for:

Scraper failures (no data returned for two consecutive cycles)
Data quality drops (values outside expected ranges)
Prediction accuracy degradation (predicted vs. actual divergence exceeding threshold)
User-reported inaccuracies (feedback that predictions were wrong)

Each alert triggered a specific response. Scraper failure? Check the source website. Data quality drop? Review recent values and check for source changes. Prediction divergence? Retrain the model with recent data.

Without monitoring, problems accumulated silently. With monitoring, problems were caught within minutes and fixed within hours. The difference in user experience was substantial. A system that's occasionally wrong but quickly corrected is trusted. A system that's occasionally wrong and stays wrong is abandoned.

User Expectations Scale Faster Than Capability

When BorderBot launched with basic wait time display and simple predictions, users were impressed. "This is way better than checking CBP's website manually!"

Within weeks, users expected more. "Can you predict for next Tuesday?" "Can you show predictions for all three crossings simultaneously?" "Can you factor in special events?" "Can you integrate with Google Maps for route planning?"

Each request was reasonable. Collectively, they represented a tenfold increase in scope. The gap between what users want and what a solo developer can build grows faster than the developer can close it.

I learned to manage expectations through transparency. "Here's what the system can do. Here's what's planned. Here's what's beyond current scope." Users who understand the constraints are more patient than users who assume infinite capability.

This expectation management is critical for every data product. Data products create an illusion of omniscience ("the system knows things") that inflates user expectations beyond what any reasonable system can deliver.

Real-time data is never as real-time, as accurate, or as comprehensive as users expect. The builder's job isn't to meet those expectations. It's to set them correctly and exceed them consistently within achievable bounds.

Real-Time Data Pipelines: 5 Lessons from Building BorderBot

Data Freshness Is a Spectrum

Data Quality Is Never Guaranteed

Predictions Need Confidence Intervals

Caching Strategy Matters More Than You Think

Monitoring Changes Everything

User Expectations Scale Faster Than Capability

Enjoyed this article?

Keep reading

Getting Started with Allem SDK: React Hooks for AI, Forms & Auth

Getting Started with Allem UI: React & React Native Components

The Agento Suite: Building 6 AI Products in Parallel