hb.dev

Not everything needs real-time inference

3/19/2026Mico Boje

There is a pattern we see often. A team trains a model, it works well in a notebook, and the next question is: how do we serve this? The default answer, increasingly, is "build an API." Wrap the model in FastAPI, deploy it behind a load balancer, add authentication, caching, monitoring, and you have a real-time inference service.

Sometimes that is exactly right. But most of the time, it is not.

The majority of ML predictions in production do not need to happen in real time. They can run overnight as a batch job, write results to a database, and let the application read from that table. The infrastructure is simpler, the failure modes are fewer, and the cost is a fraction of what a live API requires.

The question is not "can we serve this in real time?" You always can. The question is "does the prediction change based on something the user just did?" If the answer is no, batch is almost certainly the better choice.

When batch is enough

Most forecasting, scoring, and recommendation workloads fall into this category. The inputs are historical data that updates daily or weekly. The model runs once, produces predictions for every entity (every product, every customer, every sensor), and the results are valid until the next run.

A demand forecasting system for a fashion brand is a good example. The model predicts how many units of each product will sell over the coming weeks. The inputs are historical sales, inventory levels, seasonality patterns, and collection metadata. None of these change while a planner is looking at the dashboard. The model can run every night, write forecasts to PostgreSQL, and the application serves pre-computed results instantly.

The architecture for this is simple: a scheduled job (cron, Airflow, or a cloud scheduler) triggers the model, it reads from the data warehouse, computes predictions, and writes them back. The application never touches the model directly. It reads from a table.

This has real advantages. The model can take as long as it needs. A batch job that runs for 20 minutes is fine if it runs at 3am. There is no latency budget, no connection pooling, no autoscaling concern. If it fails, you retry. If the output looks wrong, you still have yesterday's predictions in the table as a fallback. Debugging is straightforward because every run produces a complete, inspectable output.

When you actually need an API

Real-time inference is justified when the prediction depends on input that only exists at the moment the user (or system) asks for it. Two patterns come up consistently.

Sensor and streaming data. IoT devices, video feeds, network traffic monitors. The input is continuous and the value of the prediction degrades rapidly with delay. A drone detection system needs to classify objects as they appear in the sensor feed. A predictive maintenance system monitoring vibration data needs to flag anomalies as they happen. Batch is useless here because by the time the job runs, the moment has passed.

User-driven configuration. This is the less obvious case, but it comes up in any system where the prediction depends on a choice the user makes at query time. In demand forecasting, the batch model can handle existing products with known attributes. But when a planner is configuring a new product, they need to specify which seasonal collection it belongs to, what price point it sits at, and which channels it will sell through. These are inputs the model has never seen, and the prediction changes meaningfully depending on the choices made.

A pair of shorts assigned to a summer collection will have a fundamentally different demand curve than the same shorts assigned to winter. The planner needs to see that difference in real time as they make decisions. That requires a live API that accepts user input and returns a prediction on the fly.

This is where the architectural split matters. The same forecasting system can run batch predictions for thousands of known products overnight, while exposing a lightweight API for the interactive new-product configuration workflow. The batch path handles volume. The API path handles interactivity. They share the same model and the same business logic, but they serve different needs.

What a real-time inference API actually requires

Once you decide you need an API, the scope of the problem expands considerably. A model in a notebook takes input and returns output. A production API needs to handle authentication, multi-tenancy, connection pooling, caching, timeouts, observability, and failure gracefully.

This is where most teams underestimate the effort. Data scientists build excellent models, but a production inference API is a software engineering problem: authentication, input validation, error handling, connection management, observability, graceful degradation under load. These are not skills most data science teams have depth in, and they should not have to.

The compute profile is also different from batch. Batch can use large instances with high memory. An API needs to respond in hundreds of milliseconds. That means pre-loading data into memory, using fast dataframe libraries like Polars instead of Pandas, and being disciplined about what hits the database per request versus what can be cached.

Connection management becomes critical. A FastAPI service handling concurrent requests needs async database connections with proper pooling. Without it, a burst of traffic can exhaust database connections and cascade into failures across the entire application. Query timeouts need to be set aggressively. A query that runs for two minutes is acceptable in a batch job. In an API, it is a denial of service.

Start with batch, promote to real-time

Our recommendation for most teams: start every prediction workload as a batch job. Serve pre-computed results from a database. It is faster to build, easier to debug, and cheaper to run.

Then identify the specific use cases where the user experience genuinely requires a live prediction. Build an API only for those cases. Keep the batch path for everything else.

This is not a temporary compromise. It is the target architecture. Even at scale, the vast majority of predictions in a well-designed system will be pre-computed. The real-time API handles the interactive edge cases where user input changes the output.

The mistake we see most often is teams building a real-time inference API for predictions that could have been a nightly cron job. The API works, but it adds latency to a dashboard that could have been instant, introduces a live dependency that could have been a static table read, and costs 10x more in infrastructure for no improvement in the user experience.

Not every model needs to be an API. Most should not be. The ones that do should be the ones where the value of real-time genuinely justifies the complexity.