Skip to content

Testing Machine Learning Systems: Code, Data and Models


Learn how to test ML models (and their code and data) to ensure consistent behavior in our ML systems.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 35K+ developers in learning how to responsibly deliver value with ML.

Intuition

In this lesson, we'll learn how to test code, data and models to construct a machine learning system that we can reliably iterate on. Tests are a way for us to ensure that something works as intended. We're incentivized to implement tests and discover sources of error as early in the development cycle as possible so that we can decrease downstream costs and wasted time. Once we've designed our tests, we can automatically execute them every time we change or add to our codebase.

Types of tests

There are four majors types of tests which are utilized at different points in the development cycle:

  1. Unit tests: tests on individual components that each have a single responsibility (ex. function that filters a list).
  2. Integration tests: tests on the combined functionality of individual components (ex. data processing).
  3. System tests: tests on the design of a system for expected outputs given inputs (ex. training, inference, etc.).
  4. Acceptance tests: tests to verify that requirements have been met, usually referred to as User Acceptance Testing (UAT).
  5. Regression tests: tests based on errors we've seen before to ensure new changes don't reintroduce them.

While ML systems are probabilistic in nature, they are composed of many deterministic components that can be tested in a similar manner as traditional software systems. The distinction between testing ML systems begins when we move from testing code to testing the data and models.

types of tests

There are many other types of functional and non-functional tests as well, such as smoke tests (quick health checks), performance tests (load, stress), security tests, etc. but we can generalize all of these under the system tests above.

How should we test?

The framework to use when composing tests is the Arrange Act Assert methodology.

  • Arrange: set up the different inputs to test on.
  • Act: apply the inputs on the component we want to test.
  • Assert: confirm that we received the expected output.

Cleaning is an unofficial fourth step to this methodology because it's important to not leave remnants of a previous test which may affect subsequent tests. We can use packages such as pytest-randomly to test against state dependency by executing tests randomly.

In Python, there are many tools, such as unittest, pytest, etc. that allow us to easily implement our tests while adhering to the Arrange Act Assert framework. These tools come with powerful built-in functionality such as parametrization, filters, and more, to test many conditions at scale.

What should we be testing for?

When arranging our inputs and asserting our expected outputs, what are some aspects of our inputs and outputs that we should be testing for?

Show answer
  • inputs: data types, format, length, edge cases (min/max, small/large, etc.)
  • outputs: data types, formats, exceptions, intermediary and final outputs

Best practices

Regardless of the framework we use, it's important to strongly tie testing into the development process.

  • atomic: when creating functions and classes, we need to ensure that they have a single responsibility so that we can easily test them. If not, we'll need to split them into more granular components.
  • compose: when we create new components, we want to compose tests to validate their functionality. It's a great way to ensure reliability and catch errors early on.
  • reuse: we should maintain central repositories where core functionality is tested at the source and reused across many projects. This significantly reduces testing efforts for each new project's code base.
  • regression: we want to account for new errors we come across with a regression test so we can ensure we don't reintroduce the same errors in the future.
  • coverage: we want to ensure 100% coverage for our codebase. This doesn't mean writing a test for every single line of code but rather accounting for every single line.
  • automate: in the event we forget to run our tests before committing to a repository, we want to auto run tests when we make changes to our codebase. We'll learn how to do this locally using pre-commit hooks and remotely via GitHub actions in subsequent lessons.

Test-driven development

Test-driven development (TDD) is the process of writing a test before writing the functionality to ensure that tests are always written. This is in contrast to writing functionality first and then composing tests afterwards. Here are our thoughts on this:

  • good to write tests as we progress, but it does signify 100% correctness.
  • initial time should be spent on design before ever getting into the code or tests.

Perfect coverage doesn't mean that our application is error free if those tests aren't meaningful and don't encompass the field of possible inputs, intermediates and outputs. Therefore, we should work towards better design and agility when facing errors, quickly resolving them and writing test cases around them to avoid next time.

Application

In our application, we'll be testing the code, data and models. We'll start by creating a separate tests directory with code subdirectory for testing our tagifai scripts. We'll create subdirectories for testing data and models soon below.

mkdir tests
cd tests
mkdir app config model tagifai
touch <SCRIPTS>
cd ../
tests/
โ””โ”€โ”€ code/
|   โ”œโ”€โ”€ test_data.py
|   โ”œโ”€โ”€ test_evaluate.py
|   โ”œโ”€โ”€ test_main.py
|   โ”œโ”€โ”€ test_predict.py
|   โ””โ”€โ”€ test_utils.py

Feel free to write the tests and organize them in these scripts after learning about all the concepts in this lesson. We suggest using our tests directory on GitHub as a reference.

Notice that our tagifai/train.py script does not have it's respective tests/code/test_train.py. Some scripts have large functions (ex. train.train(), train.optimize(), predict.predict(), etc.) with dependencies (ex. artifacts) and it makes sense to test them via tests/code/test_main.py.

๐Ÿงช  Pytest

We're going to be using pytest as our testing framework for it's powerful builtin features such as parametrization, fixtures, markers and more.

pip install pytest==7.1.2

Since this testing package is not integral to the core machine learning operations, let's create a separate list in our setup.py and add it to our extras_require:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# setup.py
test_packages = [
    "pytest==7.1.2",
]

# Define our package
setup(
    ...
    extras_require={
        "dev": docs_packages + style_packages + test_packages,
        "docs": docs_packages,
        "test": test_packages,
    },
)

We created an explicit test option because a user will want to only download the testing packages. We'll see this in action when we use CI/CD workflows to run tests via GitHub Actions.

Configuration

Pytest expects tests to be organized under a tests directory by default. However, we can also add to our existing pyproject.toml file to configure any other test directories as well. Once in the directory, pytest looks for python scripts starting with tests_*.py but we can configure it to read any other file patterns as well.

1
2
3
4
# Pytest
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"

Assertions

Let's see what a sample test and it's results look like. Assume we have a simple function that determines whether a fruit is crisp or not:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# food/fruits.py
def is_crisp(fruit):
    if fruit:
        fruit = fruit.lower()
    if fruit in ["apple", "watermelon", "cherries"]:
        return True
    elif fruit in ["orange", "mango", "strawberry"]:
        return False
    else:
        raise ValueError(f"{fruit} not in known list of fruits.")
    return False

To test this function, we can use assert statements to map inputs with expected outputs. The statement following the word assert must return True.

1
2
3
4
5
6
7
8
# tests/food/test_fruits.py
def test_is_crisp():
    assert is_crisp(fruit="apple")
    assert is_crisp(fruit="Apple")
    assert not is_crisp(fruit="orange")
    with pytest.raises(ValueError):
        is_crisp(fruit=None)
        is_crisp(fruit="pear")

We can also have assertions about exceptions like we do in lines 6-8 where all the operations under the with statement are expected to raise the specified exception.

Example of using assert in our project
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# tests/code/test_evaluate.py
def test_get_metrics():
    y_true = np.array([0, 0, 1, 1])
    y_pred = np.array([0, 1, 0, 1])
    classes = ["a", "b"]
    performance = evaluate.get_metrics(y_true=y_true, y_pred=y_pred, classes=classes, df=None)
    assert performance["overall"]["precision"] == 2/4
    assert performance["overall"]["recall"] == 2/4
    assert performance["class"]["a"]["precision"] == 1/2
    assert performance["class"]["a"]["recall"] == 1/2
    assert performance["class"]["b"]["precision"] == 1/2
    assert performance["class"]["b"]["recall"] == 1/2

Execution

We can execute our tests above using several different levels of granularity:

python3 -m pytest                                           # all tests
python3 -m pytest tests/food                                # tests under a directory
python3 -m pytest tests/food/test_fruits.py                 # tests for a single file
python3 -m pytest tests/food/test_fruits.py::test_is_crisp  # tests for a single function

Running our specific test above would produce the following output:

python3 -m pytest tests/food/test_fruits.py::test_is_crisp

tests/food/test_fruits.py::test_is_crisp .           [100%]

Had any of our assertions in this test failed, we would see the failed assertions, along with the expected and actual output from our function.

tests/food/test_fruits.py F                          [100%]

    def test_is_crisp():
>       assert is_crisp(fruit="orange")
E       AssertionError: assert False
E        +  where False = is_crisp(fruit='orange')

Tip

It's important to test for the variety of inputs and expected outputs that we outlined above and to never assume that a test is trivial. In our example above, it's important that we test for both "apple" and "Apple" in the event that our function didn't account for casing!

Classes

We can also test classes and their respective functions by creating test classes. Within our test class, we can optionally define functions which will automatically be executed when we setup or teardown a class instance or use a class method.

  • setup_class: set up the state for any class instance.
  • teardown_class: teardown the state created in setup_class.
  • setup_method: called before every method to setup any state.
  • teardown_method: called after every method to teardown any state.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Fruit(object):
    def __init__(self, name):
        self.name = name

class TestFruit(object):
    @classmethod
    def setup_class(cls):
        """Set up the state for any class instance."""
        pass

    @classmethod
    def teardown_class(cls):
        """Teardown the state created in setup_class."""
        pass

    def setup_method(self):
        """Called before every method to setup any state."""
        self.fruit = Fruit(name="apple")

    def teardown_method(self):
        """Called after every method to teardown any state."""
        del self.fruit

    def test_init(self):
        assert self.fruit.name == "apple"

We can execute all the tests for our class by specifying the class name:

python3 -m pytest tests/food/test_fruits.py::TestFruit
tests/food/test_fruits.py::TestFruit .           [100%]
Example of testing a class in our project
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# tests/code/test_data.py
class TestLabelEncoder:
@classmethod
def setup_class(cls):
    """Called before every class initialization."""
    pass

@classmethod
def teardown_class(cls):
    """Called after every class initialization."""
    pass

def setup_method(self):
    """Called before every method."""
    self.label_encoder = data.LabelEncoder()

def teardown_method(self):
    """Called after every method."""
    del self.label_encoder

def test_empty_init(self):
    label_encoder = data.LabelEncoder()
    assert label_encoder.index_to_class == {}
    assert len(label_encoder.classes) == 0

def test_dict_init(self):
    class_to_index = {"apple": 0, "banana": 1}
    label_encoder = data.LabelEncoder(class_to_index=class_to_index)
    assert label_encoder.index_to_class == {0: "apple", 1: "banana"}
    assert len(label_encoder.classes) == 2

def test_len(self):
    assert len(self.label_encoder) == 0

def test_save_and_load(self):
    with tempfile.TemporaryDirectory() as dp:
        fp = Path(dp, "label_encoder.json")
        self.label_encoder.save(fp=fp)
        label_encoder = data.LabelEncoder.load(fp=fp)
        assert len(label_encoder.classes) == 0

def test_str(self):
    assert str(data.LabelEncoder()) == "<LabelEncoder(num_classes=0)>"

def test_fit(self):
    label_encoder = data.LabelEncoder()
    label_encoder.fit(["apple", "apple", "banana"])
    assert "apple" in label_encoder.class_to_index
    assert "banana" in label_encoder.class_to_index
    assert len(label_encoder.classes) == 2

def test_encode_decode(self):
    class_to_index = {"apple": 0, "banana": 1}
    y_encoded = [0, 0, 1]
    y_decoded = ["apple", "apple", "banana"]
    label_encoder = data.LabelEncoder(class_to_index=class_to_index)
    label_encoder.fit(["apple", "apple", "banana"])
    assert np.array_equal(label_encoder.encode(y_decoded), np.array(y_encoded))
    assert label_encoder.decode(y_encoded) == y_decoded

Parametrize

So far, in our tests, we've had to create individual assert statements to validate different combinations of inputs and expected outputs. However, there's a bit of redundancy here because the inputs always feed into our functions as arguments and the outputs are compared with our expected outputs. To remove this redundancy, pytest has the @pytest.mark.parametrize decorator which allows us to represent our inputs and outputs as parameters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@pytest.mark.parametrize(
    "fruit, crisp",
    [
        ("apple", True),
        ("Apple", True),
        ("orange", False),
    ],
)
def test_is_crisp_parametrize(fruit, crisp):
    assert is_crisp(fruit=fruit) == crisp
python3 -m pytest tests/food/test_is_crisp_parametrize.py ...   [100%]
  1. [Line 2]: define the names of the parameters under the decorator, ex. "fruit, crisp" (note that this is one string).
  2. [Lines 3-7]: provide a list of combinations of values for the parameters from Step 1.
  3. [Line 9]: pass in parameter names to the test function.
  4. [Line 10]: include necessary assert statements which will be executed for each of the combinations in the list from Step 2.

Similarly, we could pass in an exception as the expected result as well:

1
2
3
4
5
6
7
8
9
@pytest.mark.parametrize(
    "fruit, exception",
    [
        ("pear", ValueError),
    ],
)
def test_is_crisp_exceptions(fruit, exception):
    with pytest.raises(exception):
        is_crisp(fruit=fruit)
Example of parametrize in our project
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# tests/code/test_data.py
from tagifai import data
@pytest.mark.parametrize(
    "text, lower, stem, stopwords, cleaned_text",
    [
        ("Hello worlds", False, False, [], "Hello worlds"),
        ("Hello worlds", True, False, [], "hello worlds"),
        ...
    ],
)
def test_preprocess(text, lower, stem, stopwords, cleaned_text):
    assert (
        data.clean_text(
            text=text,
            lower=lower,
            stem=stem,
            stopwords=stopwords,
        )
        == cleaned_text
    )

Fixtures

Parametrization allows us to reduce redundancy inside test functions but what about reducing redundancy across different test functions? For example, suppose that different functions all have a dataframe as an input. Here, we can use pytest's builtin fixture, which is a function that is executed before the test function.

1
2
3
4
5
6
7
@pytest.fixture
def my_fruit():
    fruit = Fruit(name="apple")
    return fruit

def test_fruit(my_fruit):
    assert my_fruit.name == "apple"

Note that the name of fixture and the input to the test function are identical (my_fruit).

We can apply fixtures to classes as well where the fixture function will be invoked when any method in the class is called.

1
2
3
@pytest.mark.usefixtures("my_fruit")
class TestFruit:
    ...
Example of fixtures in our project

In our project, we use fixtures to efficiently pass a set of inputs (ex. Pandas DataFrame) to different testing functions that require them (cleaning, splitting, etc.).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# tests/code/test_data.py
@pytest.fixture(scope="module")
def df():
    data = [
        {"title": "a0", "description": "b0", "tag": "c0"},
        {"title": "a1", "description": "b1", "tag": "c1"},
        {"title": "a2", "description": "b2", "tag": "c1"},
        {"title": "a3", "description": "b3", "tag": "c2"},
        {"title": "a4", "description": "b4", "tag": "c2"},
        {"title": "a5", "description": "b5", "tag": "c2"},
    ]
    df = pd.DataFrame(data * 10)
    return df


@pytest.mark.parametrize(
    "labels, unique_labels",
    [
        ([], ["other"]),  # no set of approved labels
        (["c3"], ["other"]),  # no overlap b/w approved/actual labels
        (["c0"], ["c0", "other"]),  # partial overlap
        (["c0", "c1", "c2"], ["c0", "c1", "c2"]),  # complete overlap
    ],
)
def test_replace_oos_labels(df, labels, unique_labels):
    replaced_df = data.replace_oos_labels(
        df=df.copy(), labels=labels, label_col="tag", oos_label="other"
    )
    assert set(replaced_df.tag.unique()) == set(unique_labels)

Note that we don't use the df fixture directly (we pass in df.copy()) inside our parametrized test function. If we did, then we'd be changing df's values after each parametrization.

Tip

When creating fixtures around datasets, it's best practice to create a simplified version that still adheres to the same schema. For example, in our dataframe fixture above, we're creating a smaller dataframe that still has the same column names as our actual dataframe. While we could have loaded our actual dataset, it can cause issues as our dataset changes over time (new labels, removed labels, very large dataset, etc.)

Fixtures can have different scopes depending on how we want to use them. For example our df fixture has the module scope because we don't want to keep recreating it after every test but, instead, we want to create it just once for all the tests in our module (tests/test_data.py).

  • function: fixture is destroyed after every test. [default]
  • class: fixture is destroyed after the last test in the class.
  • module: fixture is destroyed after the last test in the module (script).
  • package: fixture is destroyed after the last test in the package.
  • session: fixture is destroyed after the last test of the session.

Functions are lowest level scope while sessions are the highest level. The highest level scoped fixtures are executed first.

Typically, when we have many fixtures in a particular test file, we can organize them all in a fixtures.py script and invoke them as needed.

Markers

We've been able to execute our tests at various levels of granularity (all tests, script, function, etc.) but we can create custom granularity by using markers. We've already used one type of marker (parametrize) but there are several other builtin markers as well. For example, the skipif marker allows us to skip execution of a test if a condition is met. For example, supposed we only wanted to test training our model if a GPU is available:

1
2
3
4
5
6
@pytest.mark.skipif(
    not torch.cuda.is_available(),
    reason="Full training tests require a GPU."
)
def test_training():
    pass

We can also create our own custom markers with the exception of a few reserved marker names.

1
2
3
@pytest.mark.fruits
def test_fruit(my_fruit):
    assert my_fruit.name == "apple"

We can execute them by using the -m flag which requires a (case-sensitive) marker expression like below:

pytest -m "fruits"      #  runs all tests marked with `fruits`
pytest -m "not fruits"  #  runs all tests besides those marked with `fruits`

Tip

The proper way to use markers is to explicitly list the ones we've created in our pyproject.toml file. Here we can specify that all markers must be defined in this file with the --strict-markers flag and then declare our markers (with some info about them) in our markers list:

1
2
3
@pytest.mark.training
def test_train_model():
    assert ...

1
2
3
4
5
6
7
8
# Pytest
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"
addopts = "--strict-markers --disable-pytest-warnings"
markers = [
    "training: tests that involve training",
]
Once we do this, we can view all of our existing list of markers by executing pytest --markers and we'll receive an error when we're trying to use a new marker that's not defined here.

Coverage

As we're developing tests for our application's components, it's important to know how well we're covering our code base and to know if we've missed anything. We can use the Coverage library to track and visualize how much of our codebase our tests account for. With pytest, it's even easier to use this package thanks to the pytest-cov plugin.

pip install pytest-cov==2.10.1

And we'll add this to our setup.py script:

1
2
3
4
5
# setup.py
test_packages = [
    "pytest==7.1.2",
    "pytest-cov==2.10.1"
]
python3 -m pytest --cov tagifai --cov-report html

Here we're asking for coverage for all the code in our tagifai and app directories and to generate the report in HTML format. When we run this, we'll see the tests from our tests directory executing while the coverage plugin is keeping tracking of which lines in our application are being executed. Once our tests are complete, we can view the generated report (default is htmlcov/index.html) and click on individual files to see which parts were not covered by any tests. This is especially useful when we forget to test for certain conditions, exceptions, etc.

Warning

Though we have 100% coverage, this does not mean that our application is perfect. Coverage only indicates that a piece of code executed in a test, not necessarily that every part of it was tested, let alone thoroughly tested. Therefore, coverage should never be used as a representation of correctness. However, it is very useful to maintain coverage at 100% so we can know when new functionality has yet to be tested. In our CI/CD lesson, we'll see how to use GitHub actions to make 100% coverage a requirement when pushing to specific branches.

Exclusions

Sometimes it doesn't make sense to write tests to cover every single line in our application yet we still want to account for these lines so we can maintain 100% coverage. We have two levels of purview when applying exclusions:

  1. Excusing lines by adding this comment # pragma: no cover, <MESSAGE>

    1
    2
    3
    4
    if trial:  # pragma: no cover, optuna pruning
        trial.report(val_loss, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()
    

  2. Excluding files by specifying them in our pyproject.toml configuration:

1
2
3
# Pytest coverage
[tool.coverage.run]
omit = ["app/gunicorn.py"]

The main point is that we were able to add justification to these exclusions through comments so our team can follow our reasoning.

Now that we have a foundation for testing traditional software, let's dive into testing our data and models in the context of machine learning systems.

๐Ÿ”ข  Data

So far, we've used unit and integration tests to test the functions that interact with our data but we haven't tested the validity of the data itself. Once we define what our data should look like, we can use, expand and adapt these expectations as our dataset grows.

Expectations

Follow along with this notebook as we develop expectations for our dataset. We'll organize these expectations in our repository in the projects section.

There are many dimensions to consider for what our data is expected to look like. We'll briefly talk about a few of them, including ones that may not directly be applicable to our task but, nonetheless, are very important to be aware of.

  • rows / cols: the most basic expectation is validating the presence of samples (rows) and features (columns). These can help identify inconsistencies between upstream backend database schema changes, upstream UI form changes, etc.

    Rows/cols

    What are aspects of rows and cols in our dataset that we should test for?

    Show answer
    • presence of specific features
    • row count (exact or range) of samples
  • individual values: we can also have expectations about the individual values of specific features.

    Individual

    What are aspects of individual values that we should test for?

    Show answer
    • missing values
    • type adherence (ex. feature values are all float)
    • values must be unique or from a predefined set
    • list (categorical) / range (continuous) of allowed values
    • feature value relationships with other feature values (ex. column 1 values must always be greater than column 2)
  • aggregate values: we can also set expectations about all the values of specific features.

    Aggregate

    What are aspects of aggregate values that we should test for?

    Show answer
    • value statistics (mean, std, median, max, min, sum, etc.)
    • distribution shift by comparing current values to previous values (useful for detecting drift)

To implement these expectations, we could compose assert statements or we could leverage the open-source library called Great Expectations.

pip install great-expectations==0.15.7

And we'll add this to our setup.py script:

1
2
3
4
5
6
# setup.py
test_packages = [
    "pytest==7.1.2",
    "pytest-cov==2.10.1",
    "great-expectations==0.15.7"
]

It's a library that already has many of these expectations builtin (map, aggregate, multi-column, distributional, etc.) and allows us to create custom expectations as well. It also provides modules to seamlessly connect with backend data sources such as local file systems, S3, databases and even DAG runners. Let's explore the library by implementing the expectations we'll need for our application.

Though Great Expectations has all the data validation functionality we need, there are several other production-grade data validation options available as well, such as TFX, AWS Deequ, etc.

First we'll load the data we'd like to apply our expectations on. We can load our data from a variety of sources (filesystem, database, cloud etc.) which we can then wrap around a Dataset module (Pandas / Spark DataFrame, SQLAlchemy).

1
2
3
4
import great_expectations as ge
import json
import pandas as pd
from urllib.request import urlopen
1
2
3
4
5
6
# Load projects
url = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.json"
projects = json.loads(urlopen(url).read())
df = ge.dataset.PandasDataset(projects)
print (f"{len(df)} projects")
df.head(5)
id created_on title description tag
0 6 2020-02-20 06:43:18 Comparison between YOLO and RCNN on real world... Bringing theory to experiment is cool. We can ... computer-vision
1 7 2020-02-20 06:47:21 Show, Infer & Tell: Contextual Inference for C... The beauty of the work lies in the way it arch... computer-vision
2 9 2020-02-24 16:24:45 Awesome Graph Classification A collection of important graph embedding, cla... graph-learning
3 15 2020-02-28 23:55:26 Awesome Monte Carlo Tree Search A curated list of Monte Carlo tree search pape... reinforcement-learning
4 19 2020-03-03 13:54:31 Diffusion to Vector Reference implementation of Diffusion2Vec (Com... graph-learning

Built-in

Once we have our data source wrapped in a Dataset module, we can compose and apply expectations on it. There are many built-in expectations to choose from:

Table expectations
1
2
3
4
5
6
# columns
df.expect_table_columns_to_match_ordered_list(
    column_list=["id", "created_on", "title", "description", "tag"])

# data leak
df.expect_compound_columns_to_be_unique(column_list=["title", "description"])
Column expectations
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# id
df.expect_column_values_to_be_unique(column="id")

# created_on
df.expect_column_values_to_not_be_null(column="created_on")
df.expect_column_values_to_match_strftime_format(
    column="created_on", strftime_format="%Y-%m-%d %H:%M:%S")

# title
df.expect_column_values_to_not_be_null(column="title")
df.expect_column_values_to_be_of_type(column="title", type_="str")

# description
df.expect_column_values_to_not_be_null(column="description")
df.expect_column_values_to_be_of_type(column="description", type_="str")

# tag
df.expect_column_values_to_not_be_null(column="tag")
df.expect_column_values_to_be_of_type(column="tag", type_="str")

Each of these expectations will create an output with details about success or failure, expected and observed values, expectations raised, etc. For example, the expectation df.expect_column_values_to_be_of_type(column="title", type_="str") would produce the following if successful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {},
  "expectation_config": {
    "kwargs": {
      "column": "title",
      "type_": "str",
      "result_format": "BASIC"
    },
    "meta": {},
    "expectation_type": "_expect_column_values_to_be_of_type__map"
  },
  "result": {
    "element_count": 955,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  }
}

and if we have a failed expectation (ex. df.expect_column_values_to_be_of_type(column="title", type_="int")), we'd receive this output(notice the counts and examples for what caused the failure):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "expectation_config": {
    "meta": {},
    "kwargs": {
      "column": "title",
      "type_": "int",
      "result_format": "BASIC"
    },
    "expectation_type": "_expect_column_values_to_be_of_type__map"
  },
  "result": {
    "element_count": 955,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 955,
    "unexpected_percent": 100.0,
    "unexpected_percent_nonmissing": 100.0,
    "partial_unexpected_list": [
      "How to Deal with Files in Google Colab: What You Need to Know",
      "Machine Learning Methods Explained (+ Examples)",
      "OpenMMLab Computer Vision",
      "...",
    ]
  },
  "meta": {}
}

We can group all the expectations together to create an Expectation Suite object which we can use to validate any Dataset module.

1
2
3
# Expectation suite
expectation_suite = df.get_expectation_suite(discard_failed_expectations=False)
print(df.validate(expectation_suite=expectation_suite, only_return_failures=True))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "success": true,
  "results": [],
  "statistics": {
    "evaluated_expectations": 11,
    "successful_expectations": 11,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "evaluation_parameters": {}
}

We could also create custom expectations for our data.

Projects

So far we've worked with the Great Expectations library at the adhoc script / notebook level but we can further organize our expectations by creating a Project.

cd tests
great_expectations init
This will interactively walk us through setting up data sources, naming, etc. and set up a tests/great_expectations]directory with the following structure:
tests/great_expectations/
|   โ”œโ”€โ”€ checkpoints/
|   โ”œโ”€โ”€ expectations/
|   โ”œโ”€โ”€ plugins/
|   โ”œโ”€โ”€ uncommitted/
|   โ”œโ”€โ”€ .gitignore
|   โ””โ”€โ”€ great_expectations.yml

Data source

great_expectations datasource new
What data would you like Great Expectations to connect to?
    1. Files on a filesystem (for processing with Pandas or Spark) ๐Ÿ‘ˆ
    2. Relational database (SQL)
What are you processing your files with?
1. Pandas ๐Ÿ‘ˆ
2. PySpark
Set our data path to ../data (if you're doing this inside tests) and run the cells in this notebook and change the datasource_name to data. After we run the cells, we can close the notebook (and end the process on the terminal with Ctrl + c) and we can see the Datasource being added to great_expectations.yml.

Suites

Create expectations manually, interactively or automatically and save them as suites (a set of expectations for a particular data asset).

great_expectations suite new
How would you like to create your Expectation Suite?
    1. Manually, without interacting with a sample batch of data (default)
    2. Interactively, with a sample batch of data ๐Ÿ‘ˆ
    3. Automatically, using a profiler
Which data asset (accessible by data connector "default_inferred_data_connector_name") would you like to use?
    1. labeled_projects.json
    2. projects.json ๐Ÿ‘ˆ
    3. tags.json
This will open up an interactive notebook where we can add expectations. Copy and paste the expectations below and run all the cells. Repeat this step for tags.json.

Expectations for projects.json

Table expectations

1
2
3
4
# Presence of features
df.expect_table_columns_to_match_ordered_list(
    column_list=["id", "created_on", "title", "description", "tag"])
df.expect_compound_columns_to_be_unique(column_list=["title", "description"])  # data leak

Column expectations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# id
df.expect_column_values_to_be_unique(column="id")

# create_on
df.expect_column_values_to_not_be_null(column="created_on")
df.expect_column_values_to_match_strftime_format(
    column="created_on", strftime_format="%Y-%m-%d %H:%M:%S")

# title
df.expect_column_values_to_not_be_null(column="title")
df.expect_column_values_to_be_of_type(column="title", type_="str")

# description
df.expect_column_values_to_not_be_null(column="description")
df.expect_column_values_to_be_of_type(column="description", type_="str")

# tag
df.expect_column_values_to_not_be_null(column="tag")
df.expect_column_values_to_be_of_type(column="tag", type_="str")

Expectations for tags.json

Table expectations

1
validator.expect_table_columns_to_match_ordered_list(column_list=["tag", "aliases"])

Column expectations:

1
2
3
4
5
6
7
# tag
validator.expect_column_values_to_be_unique(column="tag")
validator.expect_column_values_to_not_be_null(column="tag")
validator.expect_column_values_to_be_of_type(column="tag", type_="str")

# aliases
validator.expect_column_values_to_be_of_type(column="aliases", type_="list")

All of these expectation suites will be saved under great_expectations/expectations:

great_expectations/
|   โ”œโ”€โ”€ expectations/
|   |   โ”œโ”€โ”€ labeled_projects.json
|   |   โ”œโ”€โ”€ projects.json
|   |   โ””โ”€โ”€ tags.json

To edit a suite, we can execute the follow CLI command:

great_expectations suite edit <SUITE_NAME>

Checkpoints

Create Checkpoints where a Suite of Expectations are applied to a specific data asset. This is a great way of programmatically applying checkpoints on our existing and new data sources.

cd tests
great_expectations checkpoint new CHECKPOINT_NAME
So for our project, it would be:
great_expectations checkpoint new projects
great_expectations checkpoint new tags
great_expectations checkpoint new labeled_projects
Each of these checkpoint creation calls will launch a notebook where we can define which suites to apply this checkpoint to. We have to change the lines for data_asset_name (which data asset to run the checkpoint suite on) and expectation_suite_name (name of the suite to use). For example, the labeled_projects checkpoint would use the labeled_projects.json data asset but use the projects suite (same suite as the projects checkpoint).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
my_checkpoint_name = "labeled_projects"  # This was populated from your CLI command.

yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
class_name: SimpleCheckpoint
run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
validations:
  - batch_request:
      datasource_name: data
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: labeled_projects.json
      data_connector_query:
        index: -1
    expectation_suite_name: projects
"""
print(yaml_config)

Once we've defined our checkpoint, we're ready to execute them:

great_expectations checkpoint run projects
great_expectations checkpoint run tags
great_expectations checkpoint run labeled_projects

At the end of this lesson, we'll create a target in our Makefile that run all these tests (code, data and models) and we'll automate their execution in our pre-commit lesson.

Note

We've applied expectations on our source dataset but there are many other key areas to test the data as well. For example, the intermediate outputs from processes such as cleaning, augmentation, splitting, preprocessing, tokenization, etc.

Data docs

When we create expectations using the CLI application, Great Expectations automatically generates documentation for our tests. It also stores information about validation runs and their results. We can launch the generate data documentation with the following command: great_expectations docs build

Production

By default, Great Expectations stores our expectations, results and metrics locally but for production, we'll want to set up remote metadata stores. This is typically something our data engineering team would help set up as validation should occur prior to downstream applications such as machine learning (though ML teams can certainly help craft more expectations to ensure data validity).

Many of these expectations will be executed when the data is extracted and loaded to different data platforms (ex. data warehouse) and decoupled from any downstream uses cases (ex. ML). We can see below how Great Expectations checkpoint validations are used at every step of the way, include before and after applying data transformations (using tools like dbt) inside our data warehouse.

ETL pipelines in production

Learn more about different data management systems in our infrastructure lesson if you're not familiar with them.

๐Ÿค–  Models

The final aspect of testing ML systems involves testing our models during training, evaluation, inference and deployment.

Training

We want to write tests iteratively while we're developing our training pipelines so we can catch errors quickly. This is especially important because, unlike traditional software, ML systems can run to completion without throwing any exceptions / errors but can produce incorrect systems. We also want to catch errors quickly to save on time and compute.

  • Check shapes and values of model output
    1
    assert model(inputs).shape == torch.Size([len(inputs), num_classes])
    
  • Check for decreasing loss after one batch of training
    1
    assert epoch_loss < prev_epoch_loss
    
  • Overfit on a batch
    1
    2
    accuracy = train(model, inputs=batches[0])
    assert accuracy == pytest.approx(0.95, abs=0.05) # 0.95 ยฑ 0.05
    
  • Train to completion (tests early stopping, saving, etc.)
    1
    2
    3
    train(model)
    assert learning_rate >= min_learning_rate
    assert artifacts
    
  • On different devices
    1
    2
    assert train(model, device=torch.device("cpu"))
    assert train(model, device=torch.device("cuda"))
    

Note

You can mark the compute intensive tests with a pytest marker and only execute them when there is a change being made to system affecting the model.

1
2
3
@pytest.mark.training
def test_train_model():
    ...

Behavioral testing

Behavioral testing is the process of testing input data and expected outputs while treating the model as a black box (model agnostic evaluation). They don't necessarily have to be adversarial in nature but more along the types of perturbations we may expect to see in the real world once our model is deployed. A landmark paper on this topic is Beyond Accuracy: Behavioral Testing of NLP Models with CheckList which breaks down behavioral testing into three types of tests:

  • invariance: Changes should not affect outputs.
    1
    2
    3
    4
    # INVariance via verb injection (changes should not affect outputs)
    tokens = ["revolutionized", "disrupted"]
    texts = [f"Transformers applied to NLP have {token} the ML field." for token in tokens]
    predict.predict(texts=texts, artifacts=artifacts)
    
['natural-language-processing', 'natural-language-processing']
  • directional: Change should affect outputs.
    1
    2
    3
    4
    # DIRectional expectations (changes with known outputs)
    tokens = ["text classification", "image classification"]
    texts = [f"ML applied to {token}." for token in tokens]
    predict.predict(texts=texts, artifacts=artifacts)
    
['natural-language-processing', 'computer-vision']
  • minimum functionality: Simple combination of inputs and expected outputs.
    1
    2
    3
    4
    # Minimum Functionality Tests (simple input/output pairs)
    tokens = ["natural language processing", "mlops"]
    texts = [f"{token} is the next big wave in machine learning." for token in tokens]
    predict.predict(texts=texts, artifacts=artifacts)
    
['natural-language-processing', 'mlops']

Tip

Each of these types of tests can also include adversarial tests such as testing with common biased tokens or noisy tokens, etc.

1
2
3
4
5
texts = [
    "CNNs for text classification.",  # CNNs are typically seen in computer-vision projects
    "This should not produce any relevant topics."  # should predict `other` label
]
predict.predict(texts=texts, artifacts=artifacts)
    ['natural-language-processing', 'other']

And we can convert these behavioral tests into systematic parameterized tests:

mkdir tests/model
touch tests/model/test_behavioral.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# tests/model/test_behavioral.py
from pathlib import Path
import pytest
from config import config
from tagifai import main, predict

@pytest.fixture(scope="module")
def artifacts():
    run_id = open(Path(config.CONFIG_DIR, "run_id.txt")).read()
    artifacts = main.load_artifacts(run_id=run_id)
    return artifacts

@pytest.mark.parametrize(
    "text_a, text_b, tag",
    [
        (
            "Transformers applied to NLP have revolutionized machine learning.",
            "Transformers applied to NLP have disrupted machine learning.",
            "natural-language-processing",
        ),
    ],
)
def test_inv(text_a, text_b, tag, artifacts):
    """INVariance via verb injection (changes should not affect outputs)."""
    tag_a = predict.predict(texts=[text_a], artifacts=artifacts)[0]["predicted_tag"]
    tag_b = predict.predict(texts=[text_b], artifacts=artifacts)[0]["predicted_tag"]
    assert tag_a == tag_b == tag
View tests/model/test_behavioral.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
from pathlib import Path

import pytest

from config import config
from tagifai import main, predict


@pytest.fixture(scope="module")
def artifacts():
    run_id = open(Path(config.CONFIG_DIR, "run_id.txt")).read()
    artifacts = main.load_artifacts(run_id=run_id)
    return artifacts


@pytest.mark.parametrize(
    "text, tag",
    [
        (
            "Transformers applied to NLP have revolutionized machine learning.",
            "natural-language-processing",
        ),
        (
            "Transformers applied to NLP have disrupted machine learning.",
            "natural-language-processing",
        ),
    ],
)
def test_inv(text, tag, artifacts):
    """INVariance via verb injection (changes should not affect outputs)."""
    predicted_tag = predict.predict(texts=[text], artifacts=artifacts)[0]["predicted_tag"]
    assert tag == predicted_tag


@pytest.mark.parametrize(
    "text, tag",
    [
        (
            "ML applied to text classification.",
            "natural-language-processing",
        ),
        (
            "ML applied to image classification.",
            "computer-vision",
        ),
        (
            "CNNs for text classification.",
            "natural-language-processing",
        )
    ],
)
def test_dir(text, tag, artifacts):
    """DIRectional expectations (changes with known outputs)."""
    predicted_tag = predict.predict(texts=[text], artifacts=artifacts)[0]["predicted_tag"]
    assert tag == predicted_tag


@pytest.mark.parametrize(
    "text, tag",
    [
        (
            "Natural language processing is the next big wave in machine learning.",
            "natural-language-processing",
        ),
        (
            "MLOps is the next big wave in machine learning.",
            "mlops",
        ),
        (
            "This should not produce any relevant topics.",
            "other",
        ),
    ],
)
def test_mft(text, tag, artifacts):
    """Minimum Functionality Tests (simple input/output pairs)."""
    predicted_tag = predict.predict(texts=[text], artifacts=artifacts)[0]["predicted_tag"]
    assert tag == predicted_tag

Inference

When our model is deployed, most users will be using it for inference (directly / indirectly), so it's very important that we test all aspects of it.

Loading artifacts

This is the first time we're not loading our components from in-memory so we want to ensure that the required artifacts (model weights, encoders, config, etc.) are all able to be loaded.

1
2
3
artifacts = main.load_artifacts(run_id=run_id)
assert isinstance(artifacts["label_encoder"], data.LabelEncoder)
...

Prediction

Once we have our artifacts loaded, we're readying to test our prediction pipelines. We should test samples with just one input, as well as a batch of inputs (ex. padding can have unintended consequences sometimes).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# test our API call directly
data = {
    "texts": [
        {"text": "Transfer learning with transformers for text classification."},
        {"text": "Generative adversarial networks in both PyTorch and TensorFlow."},
    ]
}
response = client.post("/predict", json=data)
assert response.status_code == HTTPStatus.OK
assert response.request.method == "POST"
assert len(response.json()["data"]["predictions"]) == len(data["texts"])
...

Makefile

Let's create a target in our Makefile that will allow us to execute all of our tests with one call:

# Test
.PHONY: test
test:
    pytest -m "not training"
    cd tests && great_expectations checkpoint run projects
    cd tests && great_expectations checkpoint run tags
    cd tests && great_expectations checkpoint run labeled_projects

make test

Testing vs. monitoring

We'll conclude by talking about the similarities and distinctions between testing and monitoring. They're both integral parts of the ML development pipeline and depend on each other for iteration. Testing is assuring that our system (code, data and models) passes the expectations that we've established offline. Whereas, monitoring involves that these expectations continue to pass online on live production data while also ensuring that their data distributions are comparable to the reference window (typically subset of training data) through \(t_n\). When these conditions no longer hold true, we need to inspect more closely (retraining may not always fix our root problem).

With monitoring, there are quite a few distinct concerns that we didn't have to consider during testing since it involves (live) data we have yet to see.

  • features and prediction distributions (drift), typing, schema mismatches, etc.
  • determining model performance (rolling and window metrics on overall and slices of data) using indirect signals (since labels may not be readily available).
  • in situations with large data, we need to know which data points to label and upsample for training.
  • identifying anomalies and outliers.

We'll cover all of these concepts in much more depth (and code) in our monitoring lesson.

Resources


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Code - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2021}
}