Skip to content

Designing Machine Learning Products

A template to guide the development cycle for machine learning systems that factors in product requirements, design docs and project considerations.
Goku Mohandas
Goku Mohandas
· · ·

πŸ“¬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.


In this course, we'll not only develop the machine learning models but talk about all the important ML system and software design components required to put our models into production in a reproducible, reliable and robust manner. We'll start by setting the scene for the precise product we'll be building. While this is a technical course, this initial product design process is extremely crucial and is what separates great products from mediocre ones. This lesson will offer the structure for how to think about ML + product.


This template is designed to guide machine learning product development. While this template will initially be completed in sequential order, it will naturally involve nonlinear engagement based on iterative feedback. We should follow this template for every major release of our products so that all the decision making is transparent and documented.

Product (What & Why) β†’ Engineering (How) β†’ Project (Who & When)

While our documentation will be detailed, we can start the process by walking through a machine learning canvas:

machine learning canvas

πŸ‘‰   Download a PDF of the ML canvas to use for your own products β†’ ml-canvas.pdf (right click the link and hit "Save Link As...")

From this high-level canvas, we can create detailed documentation for each release:

# Documentation
πŸ“‚ project/
β”œβ”€β”€ πŸ“„ Overview
β”œβ”€β”€ πŸ“‚ release-1
| β”œβ”€β”€ πŸ“„ product requirements [Product]
| β”œβ”€β”€ πŸ“„ design documentation [Engineering]
| β”œβ”€β”€ πŸ“„ project planning     [Project]
β”œβ”€β”€ ...
└── πŸ“‚ release-n

Throughout this lesson, we'll state and justify the assumptions we made to simplify the problem space.

Product Management

[What & Why]: motivate the need for the product and outline the objectives and key results.


Each section below has a dropdown component called "Our task", which will discuss the specific topic with respect to the specific product that we're trying to build.



Set the scene for what we're trying to do through a customer-centric approach:

  • customer: profile of the customer we want to address
  • goal: main goal for the customer
  • pains: obstacles in the way of the customer achieving the goal
  • gains: what would make the job easier for the customer?
Our task
  • customer: machine learning developers and researchers.
  • goal: stay up-to-date on ML content for work, knowledge, etc.
  • pains: too much uncategorized content scattered around the internet.
  • gains: a central location with categorized content from trusted 3rd party sources.

Assumption Reality Reason
Assume we have customers on our platform already. Doesn't matter what you build if no one is there to use it. This is a course on ML, not cold start growth.
Customers want to stay up-to-date with ML content. Thorough customer studies required to confirm this. We need a task that we can model in this course.

Value proposition

Propose the value we can create through a product-centric approach:

  • product: what needs to be build to help the customer reach their goal?
  • alleviates: how will the product reduce pains?
  • advantages: how will the product create gains?
Our task
  • product: service that discovers and categorizes ML content from popular sources.
  • alleviates: timely display categorized content for customers to discover.
  • advantages: customers only have to visit our product to stay up-to-date.

product mockup

Yes, we actually did build this before realizing it exacerbated noise and hype. And so, we pivoted into teaching the community how to responsibly develop, deploy & maintain ML.


Breakdown the product into key objectives that we want to focus on.

Our task
  • Allow customers to add and categorize their own projects.
  • Discover ML content from trusted sources to bring into our platform.
  • Classify incoming content (with high precision) for our customers to easily discover. [OUR FOCUS]
  • Display categorized content on our platform (recent, popular, recommended, etc.)

Assumption Reality Reason
Assume we have a pipeline that delivers ML content from popular sources (Reddit, Twitter, etc.). We would have to build this as a batch service and is not trivial. This is a course on ML, not batch web scraping.


Describe the solution required to meet our objectives, including it's core features, integration, alternatives, constraints and what's out-of-scope.

May require separate documentation (wireframes, user stories, mock-ups, etc.).

Our task

Develop a model that can classify the incoming content so that it can be organized by category on our platform.

Core features:

  • ML service that will predict the correct categories for incoming content. [OUR FOCUS]
  • user feedback process for incorrectly classified content.
  • workflows to categorize content that the service was incorrect about or not as confident in.
  • duplicate screening for content that already exists on the platform.


  • categorized content will be sent to the UI service to be displayed.
  • classification feedback from users will sent to labeling workflows.


  • allow users to add content manually (bottleneck).


  • maintain low latency (>100ms) when classifying incoming content. [Latency]
  • only recommend tags from our list of approved tags. [Security]
  • avoid duplicate content from being added to the platform. [UI/UX]


  • identify relevant tags beyond our approved list of tags.
  • using full-text HTML from content links to aid in classification.
  • interpretability for why we recommend certain tags.
  • identifying multiple categories (see dataset section for details).


How feasible is our solution and do we have the required resources to deliver it (data, $, team, etc.)?

Our task

We have a dataset of ML content that our users have manually added to the platform. We'll need to assess if it has the necessary signals to meet our objectives.

Sample data point
    "id": 443,
    "created_on": "2020-04-10 17:51:39",
    "title": "AllenNLP Interpret",
    "description": "A Framework for Explaining Predictions of NLP Models",
    "tag": "natural-language-processing"

Assumption Reality Reason
This dataset is of high quality because they were added by actual users. Need to assess the quality of the labels, especially since it was created by users! The dataset is of good quality but we've left some errors in there so we can discover them during the evaluation process.


[How]: can we engineer our approach for building the product.


Describe the training and production (batches/streams) sources of data.

Our task
  • training:
    • access to input data and labels for training.
    • information on feature origins and schemas.
    • was there sampling of any kind applied to create this dataset?
    • are we introducing any data leaks?
  • production:
    • access to timely batches of ML content from scattered sources (Reddit, Twitter, etc.)
    • how can we trust that this stream only has data that is consistent with what we have historically seen?

Assumption Reality Reason
ML stream only has ML relevant content. Filter to remove spam content from these 3rd party streams. Would require us to source relevant data and build another model.


Describe the labeling process and how we settled on the features and labels.

Our task

Labeling: labeled using categories of machine learning (a subset of which our platform is interested in).

Features: text features (title and description) to provide signal for the classification task.

Labels: reflect the content categories we currently display on our platform:


Assumption Reality Reason
Content can only belong to one category (multiclass). Content can belong to more than one category (multilabel). For simplicity and many libraries don't support or complicate multilabel scenarios.


Before we can model our objective, we need to be able to evaluate how we’re performing.


One of the hardest challenges with evaluation is tying our core objectives, many of which may be qualitative, with quantitative metrics that our model can optimize on.

Our task

We want to be able to classify incoming data with high precision so we can display them properly. For the projects that we categorize as other, we can recall any misclassified content using manual labeling workflows. We may also want to evaluate performance for specific classes or slices of data.

What are our priorities

How do we decide which metrics to prioritize?

Show answer

It entirely depends on the specific task. For example, in an email spam detector, precision is very important because it's better than we some spam then completely miss an important email. Overtime, we need to iterate on our solution so all evaluation metrics improve but it's important to know which one's we can't comprise on from the get-go.

Offline evaluation

Offline evaluation requires a gold standard labeled dataset that we can use to benchmark all of our modeling.

Our task

We'll be using the historical dataset for offline evaluation. We'll also be creating slices of data that we want to evaluate in isolation.

Online evaluation

Online evaluation ensures that our model continues to perform well in production and can be performed using labels or, in the event we don't readily have labels, proxy signals.

Our task
  • manually label a subset of incoming data to evaluate periodically.
  • asking the initial set of users viewing a newly categorized content if it's correctly classified.
  • allow users to report misclassified content by our model.

It's important that we measure real-time performance before committing to replace our existing version of the system.

  • Internal canary rollout, monitoring for proxy/actual performance, etc.
  • Rollout to the larger internal team for more feedback.
  • A/B rollout to a subset of the population to better understand UX, utility, etc.

Not all releases have to be high stakes and external facing. We can always include internal releases, gather feedback and iterate until we’re ready to increase the scope.


While the specific methodology we employ can differ based on the problem, there are core principles we always want to follow:

  • End-to-end utility: the end result from every iteration should deliver minimum end-to-end utility so that we can benchmark iterations against each other and plug-and-play with the system.
  • Manual before ML: incorporate deterministic components where we define the rules before using probabilistic ones that infer rules from data β†’ baselines.
  • Augment vs. automate: allow the system to supplement the decision making process as opposed to making the final decision.
  • Internal vs. external: not all early releases have to be end-user facing. We can use early versions for internal validation, feedback, data collection, etc.
  • Thorough: every approach needs to be well tested (code, data + models) and evaluated, so we can objectively benchmark different approaches.
Our task
  • v1: creating a gold-standard labeled dataset that is representative of the problem space.
  • v2: rule-based text matching approaches to categorize content.
  • v3: probabilistically predicting labels from content title and description.
  • v4: ...

Assumption Reality Reason
Solution needs to involve ML due to unstructured data and open domain space. An iterative approach where we start with simple rule-based solutions and slowly add complexity. This course is about responsibly delivering value with ML, so we'll jump to it right away.

Decouple POCs and implementations

Each of these approaches would involve proof-of-concept (POC) release and an implementation release after validating it's utility over previous approaches. We should decouple POCs and implementations because if a POC doesn't prove successful, then we can't do the implementation and all the associated planning is no longer applicable.

Utility in starting simple

Some of the earlier, simpler, approaches may not deliver on a certain performance objective. What are some advantages of still starting simple?

Show answer
  • get internal feedback on end-to-end utility.
  • perform A/B testing to understand UI/UX design.
  • deployed locally to start generating more data required for more complex approaches.


How do we receive feedback on our system and incorporate it into the next iteration? This can involve both human-in-the-loop feedback as well as automatic feedback via monitoring, etc.

Our task
  • enforce human-in-loop checks when there is low confidence in classifications.
  • allow users to report issues related to misclassification.

Always return to the value proposition

While it's important to iterate and optimize the internals of our workflows, it's even more important to ensure that our ML systems are actually making an impact. We need to constantly engage with stakeholders (management, users) to iterate on why our ML system exists.

product development cycle

Project Management

[Who & When]: organizing all the product requirements into manageable timelines so we can deliver on the vision.


Which teams and specific members from those teams need to be involved in this project? It’s important to consider even the minor features so that everyone is aware of it and so we can properly scope and prioritize our timelines. Keep in mind that this isn’t the only project that people might be working on.

Our task
  • Product: the members responsible for outlining the product requirements and approving them may involve product managers, executives, external stakeholders, etc.
  • System design:
    • Data engineering: responsible for the data dependencies, which include robust workflows to continually deliver the data and ensuring that it’s properly validated and ready for downstream applications.
    • Machine learning: develop the probabilistic systems with appropriate evaluation.
    • DevOps: deploy the application and help autoscale based on traffic.
    • UI/UX: consume the system’s outputs to deliver the new experience to the user.
    • Accessibility: help educate the community for the new rollouts and to assist with decisions around sensitive issues.
    • Site reliability: maintain the application and to potentially oversee that online evaluation/monitoring workflows are working as they should.
  • Project: the members responsible for iterative engagement with the product and engineering teams to ensure that the right product is being built and that it’s being built appropriately may include project managers, engineering managers, etc.


We need to break down all the objectives for a particular release into clear deliverables that specify the deliverable, contributors, dependencies, acceptance criteria and status. This will become the granular checklist that our teams will use to decide what to prioritize.

Our task

Objective Priority Release Status
Classify incoming content (with high precision) for our customers to easily discover. High v1

Deliverable Contributors Dependencies Acceptance criteria Status
Labeled dataset for training Project DRI, labeling team, data engineer Access to location of content with relevant metadata Validation of ground-truth labels Complete
Trained model with high precision Data scientist Labeled dataset Versioned, reproducible, test coverage report and evaluation results In-progress
Scalable service for inference ML engineer, DevOps engineer Versioned, reproducible, tested and evaluated model Stress tests to ensure autoscaling capabilities Pending
... ... ... ... ...


This is where the project scoping begins to take place. Often, the stakeholders will have a desired time for release and the functionality to be delivered. There will be a lot of back and forth on this based on the results from the feasibility studies, so it's very important to be thorough and transparent to set expectations.

Our task

v1: classify incoming content (with high precision) for our customers to easily discover.

  • Exploration studies conducted by XX
  • Pushed to dev for A/B testing by XX
  • Pushed to staging with on-boarding hooks by XX
  • Pushed to prod by XX

This is an extremely simplified timeline. An actual timeline would depict timelines from all the different teams stacked on top of each other with vertical lines at specified time-constraints or version releases.

To cite this content, please use:

    author       = {Goku Mohandas},
    title        = { Product - Made With ML },
    howpublished = {\url{}},
    year         = {2022}