Skip to content


A closer look at the infrastructure needed for deployment and serving of ML applications.
Goku Mohandas
· ·

πŸ“¬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.


We’ve already covered the methods of deployment via our API, Docker, and CI/CD lessons where the ML system is its own microservice, as opposed to being tied to a monolithic general application. This way we’re able to scale our ML workflows as needed and use it to deliver value to downstream applications. So in this lesson, we’ll instead discuss the types of tasks, serving strategies and how to optimize, orchestrate and scale them. We highly recommend using a framework such as Metaflow to seamlessly interact with all the infrastructure that we'll be discussing.


Before we talk about the infrastructure needed for ML tasks, we need to talk about the fundamental types of ML tasks. A task can involve features that don't change over time. For example if an API classifies uploaded images, all the input features come from the image the user just uploaded. If that same image is uploaded later and the same model is used, the prediction will remain unchanged. However, a task can also involve features that change over time. For example, if you want to predict whether a user would enjoy a movie, you'll want to retrieve the latest available data for that user's behavior. Using the exact same model, your prediction can change as the user's features change over time. This subtle difference can drive key architectural choices when it comes how to store, process and retrieve your data (feature stores, data streams, etc.).


The first decision is whether to serve predictions via batches or real-time, which is entirely based on the feature space (finite vs. unbound).

Batch serving

We can make batch predictions on a finite set of inputs which are then written to a database for low latency inference. When a user or downstream process makes an inference request in real-time, cached results from the database are returned.

  • βœ…  generate and cache predictions for very fast inference for users.
  • βœ…  the model doesn't need to be spun up as it's own service since it's never used in real-time.
  • ❌  predictions can become stale if user develops new interests that aren’t captured by the old data that the current predictions are based on.
  • ❌  input feature space must be finite because we need to generate all the predictions before they're needed for real-time.

Batch serving tasks

What are some tasks where batch serving is ideal?

Show answer

Recommend content that existing users will like based on their viewing history. However, new users may just receive some generic recommendations based on their explicit interests until we process their history the next day. And even if we're not doing batch serving, it might still be useful to cache very popular sets of input features (ex. combination of explicit interests > recommended content) so that we can serve those predictions faster.

Real-time serving

We can also serve live predictions, typically through an HTTPS call with the appropriate input data. This will involve spinning up our ML application as a microservice since users or downstream processes will interact directly with the model.

  • βœ…  can yield more up-to-date predictions which may yield a more meaningful user experience, etc.
  • ❌  requires managed microservices to handle request traffic.
  • ❌  requires real-time monitoring since input space in unbounded, which could yield erroneous predictions.

Real-time serving tasks

In our example task for batch serving above, how can real-time serving significantly improve content recommendations?

Show answer

With batch processing, we generate content recommendations for users offline using their history. These recommendations won't change until we process the batch the next day using the updated user features. But what is the user's taste significantly changes during the day (ex. user is searching for horror movies to watch). With real-time serving, we can use these recent features to recommend highly relevant content based on the immediate searches.

Besides wrapping our model(s) as separate, scalable microservices, we can also have a purpose-built model server to host our models. Model servers, such as MLFlow, TorchServe, RedisAI or Nvidia's Triton inference server, provide a common interface to interact with models for inspection, inference, etc. In fact, modules like RedisAI can even offer added benefits such as data locality for super fast inference.


We also have control over the features that we use to generate our real-time predictions.

Our use case doesn't necessarily involve entity features changing over time so it makes sense to just have one processing pipeline. However, not all entities in ML applications work this way. Using our content recommendation example, a given user can have certain features that are updated over time, such as favorite genres, click rate, etc. As we'll see below, we have the option to batch process features for users at a previous time or we could process features in a stream as they become available and use them to make relevant predictions.

Batch processing

Batch process features for a given entity at a previous point in time, which are later used for generating real-time predictions.

  • βœ…  can perform heavy feature computations offline and have it ready for fast inference.
  • ❌  features can become stale since they were predetermined a while ago. This can be a huge disadvantage when your prediction depends on very recent events. (ex. catching fraudulent transactions as quickly as possible).

We discuss more about different data management architectures, such as databases and data warehouses (DWH) in the pipelines lesson.

Stream processing

Perform inference on a given set of inputs with near real-time, streaming, features for a given entity.

  • βœ…  we can generate better predictions by providing real-time, streaming, features to the model.
  • ❌  extra infrastructure needed for maintaining data streams (Kafka, Kinesis, etc.) and for stream processing (Apache Flink, Beam, etc.).

Recommend content based on the real-time history that the users have generated. Note that the same model is used but the input data can change and grow.

If we infinitely reduce the time between each batch processing event, we’ll effectively have stream (real-time) processing since the features will always be up-to-date.


Even if our application requires stream processing, it's a good idea to implement the system with batch processing first if it's technically easier. If our task is high-stakes and requires stream processing even for the initial deployments, we can still experiment with batch processing for internal releases. This can allow us to start collecting feedback, generating more data to label, etc.


So far, while we have the option to use batch / streaming features and serve batch / real-time predictions, we've kept the model fixed. This, however, is another decision that we have to make depending on the use case and what our infrastructure allows for.

Offline learning

The traditional approach is to train our models offline and then deploy them to inference. We may periodically retrain them offline as new data becomes labeled, validated, etc. and deploy them after evaluation. We may also expedite retraining if we discover an issue during monitoring such as drift.

  • βœ…  don't need to worry about provisioning resources for compute since it happens offline.
  • βœ…  no urgency to get recent data immediately labeled and validated.
  • ❌  the model can become stale and may not adapt to recent changes until some monitoring alerts trigger retraining.

Learn more about the executing MLOps pipeline tasks using a workflow orchestrator in the Pipelines lesson.

Online learning

In order to truly serve the most informed predictions, we should have a model trained on the most recent data. However, instead of using expensive stateless batch learning, a stateful and incremental learning approach is adopted. Here the model is trained offline, as usual, on the initial dataset but is then stochastically updated at a single instance or mini-batch level as new data becomes available. This removes the compute costs associated with traditional stateless, redundant training on same same past data.

  • βœ…  model is aware of distributional shifts and can quickly adapt to provide highly informed predictions.
  • βœ…  stateful training can significantly lower compute costs and provide faster convergence.
  • βœ…  possible for tasks where the event that occurs is the label (user clicks, time-series, etc.)
  • ❌  may not be possible for tasks that involve explicit labeling or delayed outcomes.
  • ❌  prone to catastrophic inference where the model is learning from malicious live production data (mitigated with monitoring and rollbacks).
  • ❌  models may suffer from catastrophic forgetting as we continue to update it using new data.

What about new feature values?

With online learning, how can we encode new feature values without retraining from scratch?

Show answer

We can use clever tricks to represent out-of-vocabulary feature values such encoding based on mapped feature values or hashing. For example, we may wan to encode the name of a few restaurant but it's not mapped explicitly by our encoder. Instead we could choose to represent restaurants based on it's location, cuisine, etc. and so any new restaurant who has these feature values can be represented in a similar manner as restaurants we had available during training. Similarly, hashing can map OOV values but keep in mind that this is a one-way encoding (can't reverse the hashing to see what the value was) and we have to choose a hash size large enough to avoid collisions (<10%).


Once our application is deployed after our offline tests, there are several types of online tests that we can run to determine the performance quality in real-time.

AB tests

AB tests involve sending production traffic to the different systems that we're evaluating and then using statistical hypothesis testing to decide which system is better. There are several common issues with AB testing such as accounting for different sources of bias, such as the novelty effect of showing some users the new system. We also need to ensure that the same users continue to interact with the same systems so we can compare the results without contamination. In many cases, if we're simply trying to compare the different versions for a certain metric, multi-armed bandits will be a better approach.

Canary tests

Canary tests involve sending most of the production traffic to the currently deployed system but sending traffic from a small cohort of users to the new system we're trying to evaluate. Again we need to make sure that the same users continue to interact with the same system as we gradually roll out the new system.

Shadow tests

Shadow testing involves sending the same production traffic to the different systems. We don't have to worry about system contamination and it's very safe compared to the previous approaches since the new system's results are not served. However, we do need to ensure that we're replicating as much of the production system as possible so we can catch issues that are unique to production early on. But overall, shadow testing is easy to monitor, validate operational consistency, etc.

What can go wrong?

If shadow tests allow us to test our updated system without having to actually serve the new results, why doesn't everyone adopt it?

Show answer

With shadow deployment, we'll miss out on any live feedback signals (explicit/implicit) from our users since users are not directly interacting with the product using our new version.

We also need to ensure that we're replicating as much of the production system as possible so we can catch issues that are unique to production early on. This is rarely possible because, while your ML system may be a standalone microservice, it ultimately interacts with an intricate production environment that has many dependencies.


The way we process our features and serve predictions dictates how we deploy our application. Depending on the pipeline components, scale, etc. we have several different options for how we deploy.

Compute engines

Compute engines such as AWS EC2, Google Compute, Azure VM, on-prem, etc. that can launch our application across multiple workers.

  • Pros: easy to deploy and manage these single instances.
  • Cons: when we do need to scale, it's not easy to manage these instances individually.

Container orchestration

Container orchestration via Kubernetes (K8s) for managed deployment, scaling, etc. There are several ML specific platforms to help us self-manage K8s via control planes such as Seldon, KFServing, etc. However, there are also fully-managed solutions, such as SageMaker, Cortex, BentoML, etc. Many of these tools also come with additional features such as experiment tracking, monitoring, etc.

  • Pros: very easy to scale our services since it's all managers with the proper components (load balancers, control planes, etc.)
  • Cons: can introduce too much complexity overhead.


Serverless options such as AWS Lambda, Google Cloud Functions, etc.

  • Pros: no need to manage any servers and it all scale automatically depending on the request traffic.
  • Cons: size limits on function storage, payload, etc. based on provider and usually no accelerators (GPU, TPU, etc.)

Be sure to explore the CI/CD workflows that accompany many of these deployment and serving options so you can have a continuous training, validation and serving process.


To cite this lesson, please use:

    author       = {Goku Mohandas},
    title        = { Infrastructure - Made With ML },
    howpublished = {\url{}},
    year         = {2021}