Automated simulation pipeline for defence data foundations

1 run at a time

Manual

Thousands per day

Automated

Training ML models to predict real-world behaviour requires large volumes of high-quality simulation data. When simulations are run manually, one at a time, with results gathered by hand, the data bottleneck blocks the entire ML initiative before it starts.

The challenge

Our client needed to run thousands of simulation scenarios across a wide range of input parameters to build a data foundation for ML models. The existing process was entirely manual: an engineer would configure a simulation, run it, wait for it to complete, and then collect and organise the output by hand.

This created several problems:

A single engineer could only run a handful of simulations per day, far short of the volume needed for meaningful ML training data
There was no traceability: it was impossible to reliably link a batch of output data back to the exact configuration that produced it
When simulations failed, there was no visibility into what went wrong. A failed run meant starting over from scratch
No automatic recovery: transient failures required manual intervention and re-runs
The manual process could not scale, and it was blocking downstream ML work that depended on having a large, well-structured dataset

Our approach

We designed and built a fully automated simulation pipeline that could schedule, execute, monitor, and store simulation runs around the clock with no manual intervention.

Scheduling

Prefect · Cron triggers

→

Simulation Execution

Monte-carlo scenarios

→

Data Collection

MinIO · Batch capture

→

Processing

Validation · Transforms

→

Storage

PostgreSQL · MinIO

→

ML-ready Dataset

Training · Evaluation

Pipeline orchestration

Prefect serves as the orchestration engine. The client's infrastructure spans multiple on-premise machines: dedicated servers for running simulations and separate servers for post-processing. Prefect's worker and work pool model was a natural fit, allowing us to assign workloads to the right machines without building custom distribution logic. Simulation runs are scheduled via configurable triggers, and Prefect manages the full lifecycle of each run: dispatching to the correct worker pool, monitoring progress, handling retries on failure, and recording outcomes. The pipeline runs autonomously 24/7. Failed simulations are automatically retried with configurable backoff, and operators have full visibility into run status, throughput, and failure rates through Prefect's built-in dashboards.

Traceability and metadata

Every simulation run is linked to its exact configuration parameters in PostgreSQL. This gives the team full lineage from input settings through to output dataset batches. Any data point in the resulting dataset can be traced back to the simulation run and configuration that produced it. This audit trail supports both reproducibility and compliance requirements.

Data storage and processing

MinIO provides S3-compatible object storage for simulation outputs, running entirely on-premise. Automated post-processing pipelines validate and transform raw simulation output into ML-ready datasets. Batch tracking gives clear visibility into which simulation batches completed successfully and which need attention, making it straightforward to identify and re-run problematic batches.

Security and reliability

The entire stack runs on-premise on Linux servers with no external network dependencies. PostgreSQL, MinIO, and Prefect are all self-hosted within the client's infrastructure. No simulation data leaves the network, meeting strict data sovereignty requirements.

Reliability was a core design constraint. The pipeline handles hardware failures and transient errors gracefully: every task is idempotent and safely retryable. Prefect tracks the state of every run, so if a machine goes down mid-simulation, the work is automatically reassigned and restarted. Structured logging and alerting ensure the team is notified of failures before they compound. The result is a system that runs unattended for weeks at a time without data loss or silent failures.

Results

The pipeline replaced a manual process that could produce a handful of simulation runs per day with a fully autonomous system.

1000s

Simulations per day

up from single manual runs

24/7

Autonomous operation

no manual intervention

Full

Data lineage

traceability and sovereignty

Automatic

Failure recovery

retry, logging, and alerting

Full on-premise deployment with zero external network dependencies. The data bottleneck was removed, and the client's ML team could train models on a data foundation that was previously impossible to produce at the required scale: all running within their own infrastructure, with full traceability and complete data sovereignty.

Client feedback

“What used to take an engineer all day now runs overnight without anyone touching it.”

“We finally have the data volume to train models properly. The pipeline removed the bottleneck we had been stuck on for months.”

We build production AI for organisations that can't afford it to break.

If that sounds like your problem, let's talk.

Let's talk