Skip to content

Testing ML Systems: Code, Data and Models


Testing code, data and models to ensure consistent behavior in ML systems.
Goku Mohandas
· ·
Repository

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 20K+ developers in learning how to responsibly deliver value with ML.

Intuition

Tests are a way for us to ensure that something works as intended. We're incentivized to implement tests and discover sources of error as early in the development cycle as possible so that we can reduce increasing downstream costs and wasted time. Once we've designed our tests, we can automatically execute them every time we implement a change to our system, and continue to build on them.

Types of tests

There are many four majors types of tests which are utilized at different points in the development cycle:

  1. Unit tests: tests on individual components that each have a single responsibility (ex. function that filters a list).
  2. Integration tests: tests on the combined functionality of individual components (ex. data processing).
  3. System tests: tests on the design of a system for expected outputs given inputs (ex. training, inference, etc.).
  4. Acceptance tests: tests to verify that requirements have been met, usually referred to as User Acceptance Testing (UAT).
  5. Regression tests: testing errors we've seen before to ensure new changes don't reintroduce them.

Note

There are many other types of functional and non-functional tests as well, such as smoke tests (quick health checks), performance tests (load, stress), security tests, etc. but we can generalize these under the system tests above.

How should we test?

The framework to use when composing tests is the Arrange Act Assert methodology.

  • Arrange: set up the different inputs to test on.
  • Act: apply the inputs on the component we want to test.
  • Assert: confirm that we received the expected output.

Tip

Cleaning is an unofficial fourth step to this methodology because it's important to not leave remnants of a previous state which may affect subsequent tests. We can use packages such as pytest-randomly to test against state dependency by executing tests randomly.

In Python, there are many tools, such as unittest, pytest, etc., that allow us to easily implement our tests while adhering to the Arrange Act Assert framework above. These tools come with powerful built-in functionality such as parametrization, filters, and more, to test many conditions at scale.

Note

When arranging our inputs and asserting our expected outputs, it's important to test across the entire gambit of inputs and outputs:

  • inputs: data types, format, length, edge cases (min/max, small/large, etc.)
  • outputs: data types, formats, exceptions, intermediary and final outputs

Best practices

Regardless of the framework we use, it's important to strongly tie testing into the development process.

  • atomic: when creating unit components, we need to ensure that they have a single responsibility so that we can easily test them. If not, we'll need to split them into more granular units.
  • compose: when we create new components, we want to compose tests to validate their functionality. It's a great way to ensure reliability and catch errors early on.
  • regression: we want to account for new errors we come across with a regression test so we can ensure we don't reintroduce the same errors in the future.
  • coverage: we want to ensure that 100% of our codebase has been accounter for. This doesn't mean writing a test for every single line of code but rather accounting for every single line (more on this in the coverage section below).
  • automate: in the event we forget to run our tests before committing to a repository, we want to auto run tests for every commit. We'll learn how to do this locally using pre-commit hooks and remotely (ie. main branch) via GitHub actions in subsequent lessons.

Test-driven development

Test-driven development (TDD) is the process where you write a test before completely writing the functionality to ensure that tests are always written. This is in contrast to writing functionality first and then composing tests afterwards. Here are my thoughts on this:

  • good to write tests as we progress, but it's not the representation of correctness.
  • initial time should be spent on design before ever getting into the code or tests.
  • using a test as guide doesn't mean that our functionality is error free.

Perfect coverage doesn't mean that our application is error free if those tests aren't meaningful and don't encompass the field of possible inputs, intermediates and outputs. Therefore, we should work towards better design and agility when facing errors, quickly resolving them and writing test cases around them to avoid them next time.

Warning

This topic is still highly debated and I'm only reflecting on my experience and what's worked well for me at a large company (Apple), very early stage startup and running a company of my own. What's most important is that the team is producing reliable systems that can be tested and improved upon.

Application

In our application, we'll be testing the code, data and models. Be sure to look inside each of the different testing scripts after reading through the components below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
great_expectations/           # data tests
|   โ”œโ”€โ”€ expectations/
|   |   โ”œโ”€โ”€ projects.json
|   |   โ””โ”€โ”€ tags.json
|   โ”œโ”€โ”€ ...
tagifai/
|   โ”œโ”€โ”€ eval.py               # model tests
tests/                        # code tests
โ”œโ”€โ”€ app/
|   โ”œโ”€โ”€ test_api.py
|   โ””โ”€โ”€ test_cli.py
โ””โ”€โ”€ tagifai/
|   โ”œโ”€โ”€ test_config.py
|   โ”œโ”€โ”€ test_data.py
|   โ”œโ”€โ”€ test_eval.py
|   โ”œโ”€โ”€ test_models.py
|   โ”œโ”€โ”€ test_train.py
|   โ””โ”€โ”€ test_utils.py

Note

Alternatively, we could've organized our tests by types of tests as well (unit, integration, etc.) but I find it more intuitive for navigation by organizing by how our application is set up. We'll learn about markers below which will allow us to run any subset of tests by specifying filters.

๐Ÿงช  Pytest

We're going to be using pytest as our testing framework for it's powerful builtin features such as parametrization, fixtures, markers, etc.

Configuration

Pytest expects tests to be organized under a tests directory by default. However, we can also use our pyproject.toml file to configure any other test path directories as well. Once in the directory, pytest looks for python scripts starting with tests_*.py but we can configure it to read any other file patterns as well.

1
2
3
4
# Pytest
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"

Assertions

Let's see what a sample test and it's results look like. Assume we have a simple function that determines whether a fruit is crisp or not (notice: single responsibility):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# food/fruits.py
def is_crisp(fruit):
    if fruit:
        fruit = fruit.lower()
    if fruit in ["apple", "watermelon", "cherries"]:
        return True
    elif fruit in ["orange", "mango", "strawberry"]:
        return False
    else:
        raise ValueError(f"{fruit} not in known list of fruits.")
    return False

To test this function, we can use assert statements to map inputs with expected outputs:

1
2
3
4
5
6
7
8
# tests/food/test_fruits.py
def test_is_crisp():
    assert is_crisp(fruit="apple") #  or == True
    assert is_crisp(fruit="Apple")
    assert not is_crisp(fruit="orange")
    with pytest.raises(ValueError):
        is_crisp(fruit=None)
        is_crisp(fruit="pear")

Note

We can also have assertions about exceptions like we do in lines 6-8 where all the operations under the with statement are expected to raise the specified exception.

Execution

We can execute our tests above using several different levels of granularity:

1
2
3
4
pytest                                           # all tests
pytest tests/food                                # tests under a directory
pytest tests/food/test_fruits.py                 # tests for a single file
pytest tests/food/test_fruits.py::test_is_crisp  # tests for a single function

Running our specific test above would produce the following output:

tests/food/test_fruits.py::test_is_crisp PASSED      [100%]

Had any of our assertions in this test failed, we would see the failed assertions as well as the expected output and the output we received from our function.

Note

It's important to test for the variety of inputs and expected outputs that we outlined above and to never assume that a test is trivial. In our example above, it's important that we test for both "apple" and "Apple" in the event that our function didn't account for casing!

Classes

We can also test classes and their respective functions by creating test classes. Within our test class, we can optionally define functions which will automatically be executed when we setup or teardown a class instance or use a class method.

  • setup_class: set up the state for any class instance.
  • teardown_class: teardown the state created in setup_class.
  • setup_method: called before every method to setup any state.
  • teardown_method: called after every method to teardown any state.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class Fruit(object):
    def __init__(self, name):
        self.name = name


class TestFruit(object):
    @classmethod
    def setup_class(cls):
        """Set up the state for any class instance."""
        pass

    @classmethod
    def teardown_class(cls):
        """Teardown the state created in setup_class."""
        pass

    def setup_method(self):
        """Called before every method to setup any state."""
        self.fruit = Fruit(name="apple")

    def teardown_method(self):
        """Called after every method to teardown any state."""
        del self.fruit

    def test_init(self):
        assert self.fruit.name == "apple"

We can execute all the tests for our class by specifying the class name:

1
tests/food/test_fruits.py::TestFruit .  [100%]

We use test classes to test all of our class modules such as LabelEncoder, Tokenizer, CNN, etc.

Parametrize

So far, in our tests, we've had to create individual assert statements to validate different combinations of inputs and expected outputs. However, there's a bit of redundancy here because the inputs always feed into our functions as arguments and the outputs are compared with our expected outputs. To remove this redundancy, pytest has the @pytest.mark.parametrize decorator which allows us to represent our inputs and outputs as parameters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@pytest.mark.parametrize(
    "fruit, crisp",
    [
        ("apple", True),
        ("Apple", True),
        ("orange", False),
    ],
)
def test_is_crisp_parametrize(fruit, crisp):
    assert is_crisp(fruit=fruit) == crisp
pytest tests/food/test_is_crisp_parametrize.py ...   [100%]
  1. [Line 2]: define the names of the parameters under the decorator, ex. "fruit, crisp" (note that this is one string).
  2. [Lines 3-7]: provide a list of combinations of values for the parameters from Step 1.
  3. [Line 9]: pass in parameter names to the test function.
  4. [Line 10]: include necessary assert statements which will be executed for each of the combinations in the list from Step 2.

In our application, we use parametrization to test components that require varied sets of inputs and expected outputs such as preprocessing, filtering, etc.

Note

We could pass in an exception as the expected result as well:

1
2
3
4
5
6
7
8
9
@pytest.mark.parametrize(
    "fruit, exception",
    [
        ("pear", ValueError),
    ],
)
def test_is_crisp_exceptions(fruit, exception):
    with pytest.raises(exception):
        is_crisp(fruit=fruit)

Fixtures

Parametrization allows us to efficiently reduce redundancy inside test functions but what about its inputs? Here, we can use pytest's builtin fixture which is a function that is executed before the test function. This significantly reduces redundancy when multiple test functions require the same inputs.

1
2
3
4
5
6
7
8
@pytest.fixture
def my_fruit():
    fruit = Fruit(name="apple")
    return fruit


def test_fruit(my_fruit):
    assert my_fruit.name == "apple"

We can apply fixtures to classes as well where the fixture function will be invoked when any method in the class is called.

1
2
3
@pytest.mark.usefixtures("my_fruit")
class TestFruit:
    ...

We use fixtures to efficiently pass a set of inputs (ex. Pandas DataFrame) to different testing functions that require them (cleaning, splitting, etc.).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@pytest.fixture
def df():
    projects_fp = Path(config.DATA_DIR, "projects.json")
    projects_dict = utils.load_dict(filepath=projects_fp)
    df = pd.DataFrame(projects_dict)
    return df


def test_split(df):
    splits = split_data(df=df)
    ...

Note

Typically, when we have too many fixtures in a particular test file, we can organize them all in a fixtures.py script and invoke them as needed.

Markers

We've been able to execute our tests at various levels of granularity (all tests, script, function, etc.) but we can create custom granularity by using markers. We've already used one type of marker (parametrize) but there are several other builtin markers as well. For example, the skipif marker allows us to skip execution of a test if a condition is met.

1
2
3
4
5
6
@pytest.mark.skipif(
    not torch.cuda.is_available(),
    reason="Full training tests require a GPU."
)
def test_training():
    pass

We can also create our own custom markers with the exception of a few reserved marker names.

1
2
3
@pytest.mark.fruits
def test_fruit(my_fruit):
    assert my_fruit.name == "apple"

We can execute them by using the -m flag which requires a (case-sensitive) marker expression like below:

1
2
pytest -m "fruits"      #  runs all tests marked with `fruits`
pytest -m "not fruits"  #  runs all tests besides those marked with `fruits`

The proper way to use markers is to explicitly list the ones we've created in our pyproject.toml file. Here we can specify that all markers must be defined in this file with the --strict-markers flag and then declare our markers (with some info about them) in our markers list:

1
2
3
4
5
6
7
8
# Pytest
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"
addopts = "--strict-markers --disable-pytest-warnings"
markers = [
    "training: tests that involve training",
]

Once we do this, we can view all of our existing list of markers by executing pytest --markers and we'll also receive an error when we're trying to use a new marker that's not defined here.

We use custom markers to label which of our test functions involve training so we can separate long running tests from everything else.

1
2
3
4
5
6
@pytest.mark.training
def test_train_model():
    experiment_name = "test_experiment"
    run_name = "test_run"
    result = runner.invoke()
    ...

Note

Another way to run custom tests is to use the -k flag when running pytest. The k expression is much less strict compared to the marker expression where we can define expressions even based on names.

Coverage

As we're developing tests for our application's components, it's important to know how well we're covering our code base and to know if we've missed anything. We can use the Coverage library to track and visualize how much of our codebase our tests account for. With pytest, it's even easier to use this package thanks to the pytest-cov plugin.

1
pytest --cov tagifai --cov app --cov-report html

Here we're asking for coverage for all the code in our tagifai and app directories and to generate the report in HTML format. When we run this, we'll see the tests from our tests directory executing while the coverage plugin is keeping tracking of which lines in our application are being executed. Once our tests are complete, we can view the generated report (default is htmlcov/index.html) and click on individual files to see which parts were not covered by any tests. This is especially useful when we forget to test for certain conditions, exceptions, etc.

Warning

Though we have 100% coverage, this does not mean that our application is perfect. Coverage only indicates that a piece of code executed in a test, not necessarily that every part of it was tested, let alone thoroughly tested. Therefore, coverage should never be used as a representation of correctness. However, it is very useful to maintain coverage at 100% so we can know when new functionality has yet to be tested. In our CI/CD lesson, we'll see how to use GitHub actions to make 100% coverage a requirement when pushing to specific branches.

Exclusions

Sometimes it doesn't make sense to write tests to cover every single line in our application yet we still want to account for these lines so we can maintain 100% coverage. We have two levels of purview when applying exclusions:

  1. Excusing lines by adding this comment # pragma: no cover, <MESSAGE>

    1
    2
    if self.trial.should_prune():  # pragma: no cover, optuna pruning
        pass
    

  2. Excluding files by specifying them in our pyproject.toml configuration.

    1
    2
    3
    # Pytest coverage
    [tool.coverage.run]
    omit = ["app/main.py"]  #  sample API calls
    

The key here is that we were able to add justification to these exclusions through comments so our team can follow our reasoning.

Machine learning

Now that we have a foundation for testing traditional software, let's dive into testing our data and models in the context of machine learning systems.

๐Ÿ”ข  Data

We've already tested the functions that act on our data through unit and integration tests but we haven't tested the validity of the data itself. Once we define what our data should look like, we can use (and add to) these expectations as our dataset grows.

Expectations

There are many dimensions to what our data is expected to look like. We'll briefly talk about a few of them, including ones that may not directly be applicable to our task but, nonetheless, are very important to be aware of.

  • rows / cols: the most basic expectation is validating the presence of samples (rows) and features (columns). These can help identify mismatches between upstream backend database schema changes, upstream UI form changes, etc.
    • presence of specific features
    • row count (exact or range) of samples
  • individual values: we can also have expectations about the individual values of specific features.
    • missing values
    • type adherence (ex. feature values are all float)
    • values must be unique or from a predefined set
    • list (categorical) / range (continuous) of allowed values
    • feature value relationships with other feature values (ex. column 1 values must always be great that column 2)
  • aggregate values: we can also expectations about all the values of specific features.
    • value statistics (mean, std, median, max, min, sum, etc.)
    • distribution shift by comparing current values to previous values (useful for detecting drift)

To implement these expectations, we could compose assert statements or we could leverage the open-source library called Great Expectations. It's a fantastic library that already has many of these expectations builtin (map, aggregate, multi-column, distributional, etc.) and allows us to create custom expectations as well. It also provides modules to seamlessly connect with backend data sources such as local file systems, S3, databases and even DAG runners. Let's explore the library by implementing the expectations we'll need for our application.

Note

Though Great Expectations has all the data validation functionality we need, there are several other production-grade data validation options available as well, such as TFX, AWS Deequ, etc.

First we'll load the data we'd like to apply our expectations on. We can load our data from a variety of sources (filesystem, S3, DB, etc.) which we can then wrap around a Dataset module (Pandas / Spark DataFrame, SQLAlchemy).

1
2
3
4
5
6
7
8
9
from pathlib import Path
import great_expectations as ge
import pandas as pd
from tagifai import config, utils

# Create Pandas DataFrame
projects_fp = Path(config.DATA_DIR, "projects.json")
projects_dict = utils.load_dict(filepath=projects_fp)
df = ge.dataset.PandasDataset(projects_dict)
id created_on title description tags
0 1 2020-02-17 06:30:41 Machine Learning Basics A practical set of notebooks on machine learni... [code, tutorial, keras, pytorch, tensorflow, d...
1 2 2020-02-17 06:41:45 Deep Learning with Electronic Health Record (E... A comprehensive look at recent machine learnin... [article, tutorial, deep-learning, health, ehr]
2 3 2020-02-20 06:07:59 Automatic Parking Management using computer vi... Detecting empty and parked spaces in car parki... [code, tutorial, video, python, machine-learni...
3 4 2020-02-20 06:21:57 Easy street parking using region proposal netw... Get a text on your phone whenever a nearby par... [code, tutorial, python, pytorch, machine-lear...
4 5 2020-02-20 06:29:18 Deep Learning based parking management system ... Fastai provides easy to use wrappers to quickl... [code, tutorial, fastai, deep-learning, parkin...

Built-in

Once we have our data source wrapped in a Dataset module, we can compose and apply expectations on it. There are many built-in expectations to choose from:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Presence of features
expected_columns = ["id", "title", "description", "tags"]
df.expect_table_columns_to_match_ordered_list(column_list=expected_columns)

# Unique
df.expect_column_values_to_be_unique(column="id")

# No null values
df.expect_column_values_to_not_be_null(column="created_on")
df.expect_column_values_to_not_be_null(column="title")
df.expect_column_values_to_not_be_null(column="description")
df.expect_column_values_to_not_be_null(column="tags")

# Type
df.expect_column_values_to_be_of_type(column="title", type_="str")
df.expect_column_values_to_be_of_type(column="description", type_="str")
df.expect_column_values_to_be_of_type(column="tags", type_="list")

# Format
df.expect_column_values_to_match_strftime_format(
    column="created_on", strftime_format="%Y-%m-%d %H:%M:%S")

# Data leaks
df.expect_compound_columns_to_be_unique(column_list=["title", "description"])

Each of these expectations will create an output with details about success or failure, expected and observed values, expectations raised, etc. For example, the expectation df.expect_column_values_to_be_of_type(column="title", type_="str") would produce the following if successful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {},
  "expectation_config": {
    "kwargs": {
      "column": "title",
      "type_": "str",
      "result_format": "BASIC"
    },
    "meta": {},
    "expectation_type": "_expect_column_values_to_be_of_type__map"
  },
  "result": {
    "element_count": 2032,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  }
}

and this output if it failed (notice the counts and examples for what caused the failure):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "expectation_config": {
    "meta": {},
    "kwargs": {
      "column": "title",
      "type_": "int",
      "result_format": "BASIC"
    },
    "expectation_type": "_expect_column_values_to_be_of_type__map"
  },
  "result": {
    "element_count": 2032,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 2032,
    "unexpected_percent": 100.0,
    "unexpected_percent_nonmissing": 100.0,
    "partial_unexpected_list": [
      "How to Deal with Files in Google Colab: What You Need to Know",
      "Machine Learning Methods Explained (+ Examples)",
      "OpenMMLab Computer Vision",
      "...",
    ]
  },
  "meta": {}
}

We can group all the expectations together to create an Expectation Suite object which we can use to validate any Dataset module.

1
2
3
# Expectation suite
expectation_suite = df.get_expectation_suite()
print(df.validate(expectation_suite=expectation_suite, only_return_failures=True))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "success": true,
  "results": [],
  "statistics": {
    "evaluated_expectations": 9,
    "successful_expectations": 9,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "evaluation_parameters": {}
}

Custom

Our tags feature column is a list of tags for each input. The Great Expectation's library doesn't come equipped to process a list feature but we can easily do so by creating a custom expectation.

  1. Create a custom Dataset module to wrap our data around.
  2. Define expectation functions that can map to each individual row of the feature column (map) or to the entire feature column (aggregate) by specifying the appropriate decorator.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    class CustomPandasDataset(ge.dataset.PandasDataset):
        _data_asset_type = "CustomPandasDataset"
    
        @ge.dataset.MetaPandasDataset.column_map_expectation
        def expect_column_list_values_to_be_not_null(self, column):
            return column.map(lambda x: None not in x)
    
        @ge.dataset.MetaPandasDataset.column_map_expectation
        def expect_column_list_values_to_be_unique(self, column):
            return column.map(lambda x: len(x) == len(set(x)))
    
  3. Wrap data with the custom Dataset module and use the custom expectations.
    1
    2
    3
    df = CustomPandasDataset(projects_dict)
    df.expect_column_values_to_not_be_null(column="tags")
    df.expect_column_list_values_to_be_unique(column="tags")
    

Note

There are various levels of abstraction (following a template vs. completely from scratch) available when it comes to creating a custom expectation with Great Expectations.

Projects

So far we've worked with the Great Expectations library at the Python script level but we can organize our expectations even more by creating a Project.

  1. Initialize the Project using great_expectations init. This will interactively walk us through setting up data sources, naming, etc. and set up a great_expectations directory with the following structure:
    1
    2
    3
    4
    5
    6
    7
    8
    great_expectations/
    |   โ”œโ”€โ”€ checkpoints/
    |   โ”œโ”€โ”€ expectations/
    |   โ”œโ”€โ”€ notebooks/
    |   โ”œโ”€โ”€ plugins/
    |   โ”œโ”€โ”€ uncommitted/
    |   โ”œโ”€โ”€ .gitignore
    |   โ””โ”€โ”€ great_expectations.yml
    
  2. Define our custom module under the plugins directory and use it to define our data sources in our great_expectations.yml configuration file.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    datasources:
      data:
        class_name: PandasDatasource
        data_asset_type:
          module_name: custom_module.custom_dataset
          class_name: CustomPandasDataset
        module_name: great_expectations.datasource
        batch_kwargs_generators:
          subdir_reader:
            class_name: SubdirReaderBatchKwargsGenerator
            base_directory: ../assets/data
    
  3. Create expectations using the profiler, which creates automatic expectations based on the data, or we can also create our own expectations. All of this is done interactively via a launched Jupyter notebook and saved under our great_expectations/expectations directory.

    1
    2
    3
    great_expectations suite scaffold SUITE_NAME  # uses profiler
    great_expectations suite new --suite  # no profiler
    great_expectations suite edit SUITE_NAME  # add your own custom expectations
    

    When using the automatic profiler, you can choose which feature columns to apply profiling to. Since our tags feature is a list feature, we'll leave it commented and create our own expectations using the suite edit command.

  4. Create Checkpoints where a Suite of Expectations are applied to a specific Data Asset. This is a great way of programmatically applying checkpoints on our existing and new data sources.

    1
    2
    great_expectations checkpoint new CHECKPOINT_NAME SUITE_NAME
    great_expectations checkpoint run CHECKPOINT_NAME
    

  5. Run checkpoints on new batches of incoming data by adding to our testing pipeline via Makefile, or workflow orchestrator like Airflow, KubeFlow Pipelines, etc. We can also use the Great Expectations GitHub Action to automate validating our data pipeline code when we push a change. More on using these Checkpoints with pipelines in our Pipelines lesson.

Data docs

When we create expectations using the CLI application, Great Expectations automatically generates documentation for our tests. It also stores information about validation runs and their results. We can launch the generate data documentation with the following command: great_expectations docs build

Best practices

We've applied expectations on our source dataset but there are many other key areas to test the data as well. Throughout the ML development pipeline, we should test the intermediate outputs from processes such as cleaning, augmentation, splitting, preprocessing, tokenization, etc. We'll use these expectations to monitor new batches of data and before combining them with our existing data assets.

Note

Currently, these data processing steps are tied with our application code but in future lessons, we'll separate these into individual pipelines and use Great Expectation Checkpoints in between to apply all these expectations in an orchestrated fashion.

๐Ÿค–  Models

The other half of testing ML systems involves testing our models during training, evaluation, inference and deployment.

Training

We want to write tests iteratively while we're developing our training pipelines so we can catch errors quickly. This is especially important because, unlike traditional software, ML systems can run to completion without throwing any exceptions / errors but can produce incorrect systems. We also want to catch errors quickly to save on time and compute.

  • Check shapes and values of model output
    1
    assert model(inputs).shape == torch.Size([len(inputs), num_classes])
    
  • Check for decreasing loss after one batch of training
    1
    assert epoch_loss < prev_epoch_loss
    
  • Overfit on a batch
    1
    2
    accuracy = train(model, inputs=batches[0])
    assert accuracy == pytest.approx(1.0, abs=0.05) # 1.0 ยฑ 0.05
    
  • Train to completion (tests early stopping, saving, etc.)
    1
    2
    3
    train(model)
    assert learning_rate >= min_learning_rate
    assert artifacts
    
  • On different devices
    1
    2
    assert train(model, device=torch.device("cpu"))
    assert train(model, device=torch.device("cuda"))
    

Note

You can mark the compute intensive tests with a pytest marker and only execute them when there is a change being made to system affecting the model.

1
2
3
@pytest.mark.training
def test_train_model():
    ...

Evaluation

When it comes to testing how well our model performs, we need to first have our priorities in line.

  • What metrics are important?
  • What tradeoffs are we willing to make?
  • Are there certain subsets (slices) of data that are important?

Metrics

Overall

We want to ensure that our key metrics on the overall dataset improves with each iteration of our model. Overall metrics include accuracy, precision, recall, f1, etc. and we should define what counts as a performance regression. For example, is a higher precision at the expensive of recall an improvement or a regression? Usually, a team of developers and domain experts will establish what the key metric(s) are while also specifying the lowest regression tolerance for other metrics.

1
2
assert precision > prev_precision  # most important, cannot regress
assert recall >= best_prev_recall - 0.03  # recall cannot regress > 3%
Per-class

We can perform similar assertions for class specific metrics as well.

1
assert metrics["class"]["data_augmentation"]["f1"] > prev_data_augmentation_f1  # priority class
Slices

In the same vain, we can create assertions for certain key slices of our data as well. This can be very simple test to ensure that our high priority slices of data continue to improve in performance.

1
assert metrics["slices"]["class"]["cv_transformers"]["f1"] > prev_cv_transformers_f1  # priority slice

Behavioral testing

Besides just looking at metrics, we also want to conduct some behavior sanity tests. Behavioral testing is the process of testing input data and expected outputs while treating the model as a black box. They don't necessarily have to be adversarial in nature but more along the types of perturbations we'll see in the real world once our model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing of NLP Models with CheckList which breaks down behavioral testing into three types of tests:

  • invariance: Changes should not affect outputs.
    1
    2
    3
    4
    5
    # INVariance via verb injection (changes should not affect outputs)
    tokens = ["revolutionized", "disrupted"]
    tags = [["transformers"], ["transformers"]]
    texts = [f"Transformers have {token} the ML field." for token in tokens]
    compare_tags(texts=texts, tags=tags, artifacts=artifacts, test_type="INV")
    
  • directional: Change should affect outputs.
    1
    2
    3
    4
    5
    6
    7
    8
    # DIRectional expectations (changes with known outputs)
    tokens = ["PyTorch", "Huggingface"]
    tags = [
        ["pytorch", "transformers"],
        ["huggingface", "transformers"],
    ]
    texts = [f"A {token} implementation of transformers." for token in tokens]
    compare_tags(texts=texts, tags=tags, artifacts=artifacts, test_type="DIR")
    
  • minimum functionality: Simple combination of inputs and expected outputs.
    1
    2
    3
    4
    5
    # Minimum Functionality Tests (simple input/output pairs)
    tokens = ["transformers", "graph neural networks"]
    tags = [["transformers"], ["graph-neural-networks"]]
    texts = [f"{token} have revolutionized machine learning." for token in tokens]
    compare_tags(texts=texts, tags=tags, artifacts=artifacts, test_type="MFT")
    

Note

Be sure to explore the NLP Checklist package which simplifies and augments the creation of these behavioral tests via functions, templates, pretrained models and interactive GUIs in Jupyter notebooks.

We combine all of these behavioral tests to create a behavioral report (tagifai.eval.get_behavioral_report()) which quantifies how many of these tests are passed by a particular instance of a trained model. This report is then saved with the run's performance artifact so we can use this information when choosing which model(s) to deploy to production.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"behavioral_report": {
    "score": 1.0,
    "results": {
        "passed": [
            {
                "input": {
                    "text": "Transformers have revolutionized the ML field.",
                    "tags": [
                        "transformers"
                    ]
                },
                "prediction": {
                    "input_text": "Transformers have revolutionized the ML field.",
                    "preprocessed_text": "transformers revolutionized ml field",
                    "predicted_tags": [
                        "natural-language-processing",
                        "transformers"
                    ]
                },
                "type": "INV"
            },
            ...
            {
                "input": {
                    "text": "graph neural networks have revolutionized machine learning.",
                    "tags": [
                        "graph-neural-networks"
                    ]
                },
                "prediction": {
                    "input_text": "graph neural networks have revolutionized machine learning.",
                    "preprocessed_text": "graph neural networks revolutionized machine learning",
                    "predicted_tags": [
                        "graph-neural-networks",
                        "graphs"
                    ]
                },
                "type": "MFT"
            }
        ],
        "failed": []
    }
}

Warning

When you create additional behavioral tests, be sure to reevaluate all the models you're considering on the new set of tests so their scores can be compared. We can do this since behavioral tests are not dependent on data or model versions and are simply tests that treat the model as a black box.

1
tagifai behavioral-reevaluation

Inference

When our model is deployed, most users will be using it for inference (directly / indirectly), so it's very important that we test all aspects of it.

Loading artifacts

This is the first time we're not loading our components from in-memory so we want to ensure that the required artifacts (model weights, encoders, config, etc.) are all able to be loaded.

1
2
3
artifacts = main.load_artifacts(run_id=run_id, device=torch.device("cpu"))
assert isinstance(artifacts["model"], nn.Module)
...

Prediction

Once we have our artifacts loaded, we're readying to test our prediction pipelines. We should test samples with just one input, as well as a batch of inputs (ex. padding can have unintended consequences sometimes).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# tests/app/test_api.py | test_best_predict()
data = {
    "run_id": "",
    "texts": [
        {"text": "Transfer learning with transformers for self-supervised learning."},
        {"text": "Generative adversarial networks in both PyTorch and TensorFlow."},
    ],
}
response = client.post("/predict", json=data)
assert response.status_code == HTTPStatus.OK
assert response.request.method == "POST"
assert len(response.json()["data"]["predictions"]) == len(data["texts"])
...

Deployment

There are also a whole class of model tests that are beyond metrics or behavioral testing and focus on the system as a whole. Many of them involve testing and benchmarking the tradeoffs (ex. latency, compute, etc.) we discussed from the baselines lesson. These tests also need to performed across the different systems (ex. devices) that our model may be on. For example, development may happen on a CPU but the deployed model may be loaded on a GPU and there may be incompatible components (ex. reparametrization) that may cause errors. As a rule of thumb, we should test with the system specifications that our production environment utilizes.

Note

We'll automate tests on different devices in our CI/CD lesson where we'll use GitHub Actions to spin up our application with Docker Machine on cloud compute instances (we'll also use this for training).

Once we've tested our model's ability to perform in the production environment (offline tests), we can run several types of online tests to determine the quality of that performance.

AB tests

AB tests involve sending production traffic to the different systems that we're evaluating and then using statistical hypothesis testing to decide which system is better. There are several common issues with AB testing such as accounting for different sources of bias, such as the novelty effect of showing some users the new system. We also need to ensure that the same users continue to interact with the same systems so we can compare the results without contamination. In many cases, if we're simply trying to compare the different versions for a certain metric, multi-armed bandits will be a better approach.

Canary tests

Canary tests involve sending most of the production traffic to the currently deployed system but sending traffic from a small cohort of users to the new system we're trying to evaluate. Again we need to make sure that the same users continue to interact with the same system as we gradually roll out the new system.

Shadow tests

Shadow testing involves sending the same production traffic to the different systems. We don't have to worry about system contamination and it's very safe compared to the previous approaches since the new system's results are not served. However, we do need to ensure that we're replicating as much of the production system as possible so we can catch issues that are unique to production early on. But overall, shadow testing is easy to monitor, validate operational consistency, etc.

Testing vs. monitoring

We'll conclude by talking about the similarities and distinctions between testing and monitoring. They're both integral parts of the ML development pipeline and depend on each other for iteration. Testing is assuring that our system (code, data and models) passes the expectations that we've established at \(t_0\). Whereas, monitoring involves that these expectations continue to pass on live production data while also ensuring that their data distributions are comparable to the reference window (typically subset of training data) through \(t_n\). When these conditions no longer hold true, we need to inspect more closely (retraining may not always fix our root problem).

With monitoring, there are quite a few distinct concerns that we didn't have to consider during testing since it involves (live) data we have yet to see.

  • features and prediction distributions (drift), typing, schema mismatches, etc.
  • determining model performance (rolling and window metrics on overall and slices of data) using indirect signals (since labels may not be readily available).
  • in situations with large data, we need to know which data points to label and upsample for training.
  • identifying anomalies and outliers.

We'll cover all of these concepts in much more depth (and code) in our monitoring lesson.

Resources


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    title  = "Testing ML Systems: Code, Data and Models - Made With ML",
    author = "Goku Mohandas",
    url    = "https://madewithml.com/courses/mlops/testing/"
    year   = "2021",
}