Machine Learning in Production: Beyond the Jupyter Notebook

A working model is not a working product. The full production ML stack — data pipelines, deployment patterns, monitoring, drift detection.

Marco RossiMay 6, 20265 min read

Machine Learning in Production: Beyond the Jupyter Notebook

A working model is not a working product. The gap between a notebook with 95 percent accuracy and a deployed system that holds up in production is where most ML projects quietly die.

This is the production checklist we run for every machine learning engagement.

The model is 20 percent of the work

In a typical ML production engagement, training the model takes about a fifth of the timeline. The rest is data pipelines, deployment, monitoring, and the human processes that surround the model.

Teams that skip the other 80 percent ship something that works at launch and degrades silently over weeks.

Data pipelines come first

The first thing we build is the data pipeline that will feed the production model. Where does the data come from. How is it cleaned. How are features computed. What happens when an upstream source changes its schema.

If the pipeline is fragile, the model is fragile no matter how good the architecture is. We instrument every step with logs and data quality checks before we touch the model.

Deployment patterns that work

For most production ML systems we use one of three patterns:

Batch inference. Predictions are precomputed nightly and served from a key-value store. Cheap, fast, and ideal for use cases like daily recommendations or weekly forecasts.

Real-time API. A model wrapped in a REST or gRPC service, autoscaled. Right when freshness matters more than cost — fraud detection, dynamic pricing, search ranking.

Edge deployment. The model runs on the device. Mandatory for low-latency requirements like computer vision on cameras, or privacy-sensitive contexts like clinical decision support.

Each pattern has different operational costs. The right choice is rarely about model accuracy and almost always about latency, cost, and privacy constraints.

Monitoring beyond accuracy

Production ML monitoring covers four things, in order of importance:

Input distribution drift. Are the features the model sees today different from training data. This is the earliest warning sign that performance will degrade.

Output distribution drift. Are predictions shifting in suspicious ways. A recommender that suddenly recommends one category 80 percent of the time has a problem.

Latency and error rate. Operational basics. Most teams already track these.

Ground truth accuracy. When you can collect labels — clicks, purchases, manual review — measure whether the model is still right.

We dashboard all four. The first time a drift alert fires is usually weeks before user-visible problems.

Retraining on a schedule

Set a retraining cadence on day one. Weekly, monthly, quarterly — pick what your domain demands and automate it. ML teams that retrain when they get around to it never get around to it. The model rots.

Versioning is not optional

Every deployed model has a version. Every prediction logs the model version that produced it. Every dataset used for training has a version. When something goes wrong six months later, version history is the only path to debugging it.

We use MLflow or DVC depending on the team. The tool matters less than the discipline.

The human processes that decide outcomes

The last 20 percent that no one talks about: who owns the model. Who gets paged when accuracy drops. Who decides when to retrain. Who decides to roll back a bad release.

Production machine learning is a product, not a project. Without an owner, it slowly stops working.

If you are deploying ML in production and want a second opinion on the surrounding stack, book a free consultation — we have shipped this on healthcare, retail, finance, and logistics teams and the patterns transfer.

#machine-learning #mlops #production #ml-deployment #engineering