Designing Machine Learning Products
In this course, we'll not only develop the machine learning models but talk about all the important ML system and software design components required to put our models into production in a reproducible, reliable and robust manner. We'll start by setting the scene for the precise product we'll be building. While this is a technical course, this initial product design process is extremely crucial and is what separates great products from mediocre ones. This lesson will offer the structure for how to think about ML + product.
This template is designed to guide machine learning product development. While this template will initially be completed in sequential order, it will naturally involve nonlinear engagement based on iterative feedback. We should follow this template for every major release of our products so that all the decision making is transparent and documented.
While our documentation will be detailed, we can start the process by walking through a machine learning canvas:
👉 Download a PDF of the ML canvas to use for your own products → ml-canvas.pdf (right click the link and hit "Save Link As...")
From this high-level canvas, we can create detailed documentation for each release:
# Documentation 📂 project/ ├── 📄 Overview ├── 📂 release-1 | ├── 📄 product requirements [Product] | ├── 📄 design documentation [Engineering] | ├── 📄 project planning [Project] ├── ... └── 📂 release-n
Throughout this lesson, we'll state and justify the assumptions we made to simplify the problem space.
[What & Why]: motivate the need for the product and outline the objectives and key results.
Each section below has a dropdown component called "Our task", which will discuss the specific topic with respect to the specific product that we're trying to build.
Set the scene for what we're trying to do through a customer-centric approach:
customer: profile of the customer we want to address
goal: main goal for the customer
pains: obstacles in the way of the customer achieving the goal
gains: what would make the job easier for the customer?
customer: machine learning developers and researchers.
goal: stay up-to-date on ML content for work, knowledge, etc.
pains: too much uncategorized content scattered around the internet.
gains: a central location with categorized content from trusted 3rd party sources.
|Assume we have customers on our platform already.||Doesn't matter what you build if no one is there to use it.||This is a course on ML, not cold start growth.|
|Customers want to stay up-to-date with ML content.||Thorough customer studies required to confirm this.||We need a task that we can model in this course.|
Propose the value we can create through a product-centric approach:
product: what needs to be build to help the customer reach their goal?
alleviates: how will the product reduce pains?
advantages: how will the product create gains?
product: service that discovers and categorizes ML content from popular sources.
alleviates: timely display categorized content for customers to discover.
advantages: customers only have to visit our product to stay up-to-date.
Yes, we actually did build this before realizing it exacerbated noise and hype. And so, we pivoted into teaching the community how to responsibly deliver value with ML.
Breakdown the product into key objectives that we want to focus on.
- Allow customers to add and categorize their own projects.
- Discover ML content from trusted sources to bring into our platform.
- Classify incoming content (with high precision) for our customers to easily discover. [OUR FOCUS]
- Display categorized content on our platform (recent, popular, recommended, etc.)
|Assume we have a pipeline that delivers ML content from popular sources (Reddit, Twitter, etc.).||We would have to build this as a batch service and is not trivial.||This is a course on ML, not batch web scraping.|
Describe the solution required to meet our objectives, including it's core features, integration, alternatives, constraints and what's out-of-scope.
May require separate documentation (wireframes, user stories, mock-ups, etc.).
Develop a model that can classify the incoming content so that it can be organized by category on our platform.
- ML service that will predict the correct categories for incoming content. [OUR FOCUS]
- user feedback process for incorrectly classified content.
- workflows to categorize content that the service was incorrect about or not as confident in.
- duplicate screening for content that already exists on the platform.
- categorized content will be sent to the UI service to be displayed.
- classification feedback from users will sent to labeling workflows.
- allow users to add content manually (bottleneck).
- maintain low latency (>100ms) when classifying incoming content. [Latency]
- only recommend tags from our list of approved tags. [Security]
- avoid duplicate content from being added to the platform. [UI/UX]
- identify relevant tags beyond our approved list of tags.
- using full-text HTML from content links to aid in classification.
- interpretability for why we recommend certain tags.
- identifying multiple categories (see dataset section for details).
How feasible is our solution and do we have the required resources to deliver it (data, $, team, etc.)?
We have a dataset of ML content that our users have manually added to the platform. We'll need to assess if it has the necessary signals to meet our objectives.
Sample data point
1 2 3 4 5 6 7
|This dataset is of high quality because they were added by actual users.||Need to assess the quality of the labels, especially since it was created by users!||The dataset is of good quality but we've left some errors in there so we can discover them during the evaluation process.|
[How]: can we engineer our approach for building the product.
Describe the training and production (batches/streams) sources of data.
- access to a labeled and validated dataset for training.
- information on feature origins and schemas.
- was there sampling of any kind applied to create this dataset?
- are we introducing any data leaks?
- access to timely batches of ML content from scattered sources (Reddit, Twitter, etc.)
- how can we trust that this stream only has data that is consistent with what we have historically seen?
|ML stream only has ML relevant content.||Filter to remove spam content from these 3rd party streams.||Would require us to source relevant data and build another model.|
Describe the labeling process and how we settled on the features and labels.
Labeling: manually labeled historical data.
Features: text features (title and description) to provide signal for the classification task.
Labels: reflect the content categories we currently display on our platform:
1 2 3 4 5
|Content can only belong to one category (multiclass).||Content can belong to more than one category (multilabel).||For simplicity and many libraries don't support or complicate multilabel scenarios.|
Before we can model our objective, we need to be able to evaluate how we’re performing.
One of the hardest challenges with evaluation is tying our core objectives (may be qualitative) with quantitative metrics that our model can optimize on.
We want to be able to classify incoming data with high precision so we can display them properly. For the projects that we categorize as
other, we can recall any misclassified content using manual labeling workflows. We may also want to evaluate performance for specific classes or slices of data.
What are our priorities
How do we decide which metrics to prioritize?
It entirely depends on the specific task. For example, in an email spam detector, precision is very important because it's better than we some spam then completely miss an important email. Overtime, we need to iterate on our solution so all evaluation metrics improve but it's important to know which one's we can't comprise on from the get-go.
- manually label a subset of incoming data to evaluate periodically.
- asking the initial set of users viewing a newly categorized content if it's correctly classified.
- allow users to report misclassified content by our model.
It's important that we measure real-time performance before committing to replace our existing version of the system.
- Internal canary rollout, monitoring for proxy/actual performance, etc.
- Rollout to the larger internal team for more feedback.
- A/B rollout to a subset of the population to better understand UX, utility, etc.
Not all releases have to be high stakes and external facing. We can always include internal releases, gather feedback and iterate until we’re ready to increase the scope.
While the specific methodology we employ can differ based on the problem, there are core principles we always want to follow:
- End-to-end utility: the end result from every iteration should deliver minimum end-to-end utility so that we can benchmark iterations against each other and plug-and-play with the system.
- Manual before ML: incorporate deterministic components where we define the rules before using probabilistic ones that infer rules from data → baselines.
- Augment vs. automate: allow the system to supplement the decision making process as opposed to making the final decision.
- Internal vs. external: not all early releases have to be end-user facing. We can use early versions for internal validation, feedback, data collection, etc.
- Thorough: every approach needs to be well tested (code, data + models) and evaluated, so we can objectively benchmark different approaches.
- v1: creating a gold-standard labeled dataset that is representative of the problem space.
- v2: rule-based text matching approaches to categorize content.
- v3: probabilistically predicting labels from content title and description.
- v4: ...
|Solution needs to involve ML due to unstructured data and open domain space.||An iterative approach where we start with simple rule-based solutions and slowly add complexity.||This course is about responsibly deliver value with ML, so we'll jump to it right away.|
Decouple POCs and implementations
Each of these approaches would involve proof-of-concept (POC) release and an implementation release after validating it's utility over previous approaches. We should decouple POCs and implementations because if a POC doesn't prove successful, then we can't do the implementation and all the associated planning is no longer applicable.
Utility in starting simple
Some of the earlier, simpler, approaches may not deliver on a certain performance objective. What are some advantages of still starting simple?
- get internal feedback on end-to-end utility.
- perform A/B testing to understand UI/UX design.
- deployed locally to start generating more data required for more complex approaches.
How do we receive feedback on our system and incorporate it into the next iteration? This can involve both human-in-the-loop feedback as well as automatic feedback via monitoring, etc.
- enforce human-in-loop checks when there is low confidence in classifications.
- allow users to report issues related to misclassification.
Always return to the value proposition
While it's important to iterate and optimize the internals of our workflows, it's even more important to ensure that our ML systems are actually making an impact. We need to constantly engage with stakeholders (management, users) to iterate on why our ML system exists.
[Who & When]: organizing all the product requirements into manageable timelines so we can deliver on the vision.
Which teams and specific members from those teams need to be involved in this project? It’s important to consider even the minor features so that everyone is aware of it and so we can properly scope and prioritize our timelines. Keep in mind that this isn’t the only project that people might be working on.
- Product: the members responsible for outlining the product requirements and approving them may involve product managers, executives, external stakeholders, etc.
- System design:
- Data engineering: responsible for the data dependencies, which include robust workflows to continually deliver the data and ensuring that it’s properly validated and ready for downstream applications.
- Machine learning: develop the probabilistic systems with appropriate evaluation.
- DevOps: deploy the application and help autoscale based on traffic.
- UI/UX: consume the system’s outputs to deliver the new experience to the user.
- Accessibility: help educate the community for the new rollouts and to assist with decisions around sensitive issues.
- Site reliability: maintain the application and to potentially oversee that online evaluation/monitoring workflows are working as they should.
- Project: the members responsible for iterative engagement with the product and engineering teams to ensure that the right product is being built and that it’s being built appropriately may include project managers, engineering managers, etc.
We need to break down all the objectives for a particular release into clear deliverables that specify the deliverable, contributors, dependencies, acceptance criteria and status. This will become the granular checklist that our teams will use to decide what to prioritize.
|Classify incoming content (with high precision) for our customers to easily discover.||High||v1|
|Labeled dataset for training||Project DRI, labeling team, data engineer||Access to location of content with relevant metadata||Validation of ground-truth labels||Complete|
|Trained model with high precision||Data scientist||Labeled dataset||Versioned, reproducible, test coverage report and evaluation results||In-progress|
|Scalable service for inference||ML engineer, DevOps engineer||Versioned, reproducible, tested and evaluated model||Stress tests to ensure autoscaling capabilities||Pending|
This is where the project scoping begins to take place. Often, the stakeholders will have a desired time for release and the functionality to be delivered. There will be a lot of back and forth on this based on the results from the feasibility studies, so it's very important to be thorough and transparent to set expectations.
v1: classify incoming content (with high precision) for our customers to easily discover.
- Exploration studies conducted by XX
- Pushed to dev for A/B testing by XX
- Pushed to staging with on-boarding hooks by XX
- Pushed to prod by XX
This is an extremely simplified timeline. An actual timeline would depict timelines from all the different teams stacked on top of each other with vertical lines at specified time-constraints or version releases.
To cite this lesson, please use:
1 2 3 4 5 6