Organizing Machine Learning Code
Repository
📬 Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
To have organized code is to have readable, reproducible, robust code. Your team, manager and most importantly, your future self, will thank you for putting in the initial effort towards organizing your work. In this lesson, we'll discuss how to migrate and organize code from our notebook to Python scripts.
Editor
Before we can start coding, we need a space to do it. There are several options for code editors, such as VSCode, Atom, Sublime, PyCharm, Vim, etc. and they all offer unique features while providing the basic operations for code editing and execution. We will be using VSCode to edit and execute our code thanks to its simplicity, multi-language support, add-ons and growing industry adoption.
You are welcome to use any editor but we will be using some add-ons that may be specific to VSCode.
- Install VSCode from source for your system: https://code.visualstudio.com/
- Open the Command Palette (
F1
or Cmd + Shift +P
on mac) → type in "Preferences: Open Settings (UI)" → hit Enter - Adjust any relevant settings you want to (spacing, font-size, etc.)
- Install VSCode extensions (use the lego blocks icon on the editor's left panel)
Recommended VSCode extensions
I recommend installing these extensions, which you can by copy/pasting this command:
code --install-extension 74th.monokai-charcoal-high-contrast
code --install-extension alefragnani.project-manager
code --install-extension bierner.markdown-preview-github-styles
code --install-extension bradgashler.htmltagwrap
code --install-extension christian-kohler.path-intellisense
code --install-extension euskadi31.json-pretty-printer
code --install-extension formulahendry.auto-close-tag
code --install-extension formulahendry.auto-rename-tag
code --install-extension kamikillerto.vscode-colorize
code --install-extension mechatroner.rainbow-csv
code --install-extension mikestead.dotenv
code --install-extension mohsen1.prettify-json
code --install-extension ms-azuretools.vscode-docker
code --install-extension ms-python.python
code --install-extension ms-python.vscode-pylance
code --install-extension ms-vscode.sublime-keybindings
code --install-extension njpwerner.autodocstring
code --install-extension PKief.material-icon-theme
code --install-extension redhat.vscode-yaml
code --install-extension ritwickdey.live-sass
code --install-extension ritwickdey.LiveServer
code --install-extension shardulm94.trailing-spaces
code --install-extension streetsidesoftware.code-spell-checker
code --install-extension zhuangtongfa.material-theme
If you add your own extensions and want to share it with others, just run this command to generate the list of commands:
code --list-extensions | xargs -L 1 echo code --install-extension
Once we're all set up with VSCode, we can start by creating our project directory, which we'll use to organize all our scripts. There are many ways to start a project, but here's our recommended path:
- Use the terminal to create a directory (
mkdir <PROJECT_NAME>
). - Change into the project directory you just made (
cd <PROJECT_NAME>
). - Start VSCode from this directory by typing
code .
To open VSCode directly from the terminal with a
code $PATH
command, open the Command Palette (F1
or Cmd + Shift +P
on mac) → type "Shell Command: Install 'code' command in PATH" → hit Enter → restart the terminal. - Open a terminal within VSCode (
View
>Terminal
) to continue creating scripts (touch <FILE_NAME>
) or additional subdirectories (mkdir <SUBDIR>
) as needed.

Setup
README
We'll start our organization with a README.md
file, which will provide information on the files in our directory, instructions to execute operations, etc. We'll constantly keep this file updated so that we can catalogue information for the future.
touch README.md
Let's start by adding the instructions we used for creating a virtual environment:
1 2 3 4 5 |
|
If you press the Preview button located on the top right of the editor (button enclosed in red circle in the image below), you can see what the README.md
will look like when we push to remote host for git.

Configurations
Next we'll create a configuration directory called config
where we can store components that will be required for our application. Inside this directory, we'll create a config.py
and a args.json
.
mkdir config
touch config/main.py config/args.json
config/
├── args.json - arguments
└── config.py - configuration setup
Inside config.py
, we'll add the code to define key directory locations (we'll add more configurations in later lessons as they're needed):
1 2 3 4 5 6 |
|
and inside args.json
, we'll add the parameters that are relevant to data processing and model training.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Operations
We'll start by creating our package directory (tagifai
) inside our project directory (mlops
). Inside this package directory, we will create a main.py
file that will define the core operations we want to be able to execute.
mkdir tagifai
touch tagifai/main.py
tagifai/
└── main.py - training/optimization pipelines
We'll define these core operations inside main.py
as we move code from notebooks to the appropriate scripts below:
elt_data
: extract, load and transform data.optimize
: tune hyperparameters to optimize for objective.train_model
: train a model using best parameters from optimization study.load_artifacts
: load trained artifacts from a given run.predict_tag
: predict a tag for a given input.
Utilities
Before we start moving code from our notebook, we should be intentional about how we move functionality over to scripts. It's common to have ad-hoc processes inside notebooks because it maintains state as long as the notebook is running. For example, we may set seeds in our notebooks like so:
1 2 3 |
|
But in our scripts, we should wrap this functionality as a clean, reuseable function with the appropriate parameters:
1 2 3 4 |
|
We can store all of these inside a utils.py
file inside our tagifai
package directory.
touch tagifai/utils.py
tagifai/
├── main.py - training/optimization pipelines
└── utils.py - supplementary utilities
View utils.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Don't worry about formatting our scripts just yet. We'll be automating all of it in our styling lesson.
Project
When it comes to migrating our code from notebooks to scripts, it's best to organize based on utility. For example, we can create scripts for the various stages of ML development such as data processing, training, evaluation, prediction, etc.:
We'll create the different python files to wrap our data and ML functionality:
cd tagifai
touch data.py train.py evaluate.py predict.py
tagifai/
├── data.py - data processing utilities
├── evaluate.py - evaluation components
├── main.py - training/optimization pipelines
├── predict.py - inference utilities
├── train.py - training utilities
└── utils.py - supplementary utilities
We may have additional scripts in other projects, as they are necessary. For example, we'd typically have a
models.py
script we define explicit model architectures in Pytorch, Tensorflow, etc.
Organizing our code base this way also makes it easier for us to understand (or modify) the code base. We could've placed all the code into one main.py
script but as our project grows, it will be hard to navigate one monolithic file. On the other hand, we could've assumed a more granular stance by breaking down data.py
into split.py
, preprocess.py
, etc. This might make more sense if we have multiple ways of splitting, preprocessing, etc. (ex. a library for ML operations) but for our task, it's sufficient to be at this higher level of organization.
Principles
Through the migration process below, we'll be using several core software engineering principles repeatedly.
Wrapping functionality into functions
How do we decide when specific lines of code should be wrapped as a separate function? Functions should be atomic in that they each have a single responsibility so that we can easily test them. If not, we'll need to split them into more granular units. For example, we could replace tags in our projects with these lines:
1 2 |
|
1 2 3 4 5 |
|
It's better to wrap them as a separate function because we may want to:
- repeat this functionality in other parts of the project or in other projects.
- test that these tags are actually being replaced properly.
Composing generalized functions
Specific | |
---|---|
1 2 3 4 5 |
|
Generalized | |
---|---|
1 2 3 4 5 |
|
This way when the names of columns change or we want to replace with different labels, it's very easy to adjust our code. This also includes using generalized names in the functions such as label
instead of the name of the specific label column (ex. tag
). It also allows others to reuse this functionality for their use cases.
However, it's important not to force generalization if it involves a lot of effort. We can spend time later if we see the similar functionality reoccurring.
🔢 Data
Load
Load and save data
First, we'll name and create the directory to save our data assets to (raw data, labeled data, etc.):
1 2 3 4 5 6 7 8 9 10 11 |
|
Next, we'll add the location of our raw data assets to our config.py
. It's important that we store this information in our central configuration file so we can easily discover and update it if needed, as opposed to being deeply buried inside the code somewhere.
1 2 3 4 5 |
|
Since this is a main operation, we'll define it in main.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Before we can use this operation, we need to make sure we have the necessary packages loaded into our environment. Libraries such as pathlib
, json
, etc. are preloaded with native Python, but packages like NumPy
are not. Let's load the required packages and add them to our requirements.txt
file.
pip install numpy==1.19.5 pandas==1.3.5 pretty-errors==1.2.19
# Add to requirements.txt
numpy==1.19.5
pandas==1.3.5
pretty-errors==1.2.19
We can fetch the exact version of the packages we used in our notebook by running
pip freeze
in a code cell.
Though we're not using the NumPy package for this elt_data()
operation, our Python interpreter will still require it because we invoke the utils.py
script with the line from tagifai import utils
, which does use NumPy in its header. So if we don't install the package in our virtual environment, we'll receive an error.
We'll run the operation using the Python interpreter via the terminal (type python
in the terminal and types the commands below).
1 2 |
|
We could also call this operation directly through the main.py
script but we'll have to change it every time we want to run a new operation.
1 2 3 |
|
python tagifai/main.py
We'll learn about a much easier way to execute these operations in our CLI lesson. But for now, either of the methods above will produce the same result.
✅ Saved data!
We should also see the data assets saved to our data
directory:
data/
├── projects.csv
└── tags.csv
Why save the raw data?
Why do we need to save our raw data? Why not just load it from the URL and save the downstream assets (labels, features, etc.)?
Show answer
We'll be using the raw data to generate labeled data and other downstream assets (ex. features). If the source of our raw data changes, then we'll no longer be able to produce our downstream assets. By saving it locally, we can always reproduce our results without any external dependencies. We'll also be executing data validation checks on the raw data before applying transformations on it.
However, as our dataset grows, it may not scale to save the raw data or even labels or features. We'll talk about more scalable alternatives in our versioning lesson where we aren't saving the physical data but the instructions to retrieve them from a specific point in time.
Preprocess
Preprocess features
Next, we're going to define the functions for preprocessing our input features. We'll be using these functions when we are preparing the data prior to training our model. We won't be saving the preprocessed data to a file because different experiment may preprocess them differently.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
This function uses the clean_text()
function which we can define right above it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Install required packages and add to requirements.txt
:
pip install nltk==3.7
# Add to requirements.txt
nltk==3.7
Notice that we're using an explicit set of stopwords instead of NLTK's default list:
1 2 3 |
|
This is because we want to have full visibility into exactly what words we're filtering. The general list may have some valuable terms we may wish to keep and vice versa.
# config/config.py
STOPWORDS = [
"i",
"me",
"my",
...
"won't",
"wouldn",
"wouldn't",
]
Next, we need to define the two functions we're calling from data.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Encode
Encode labels
Now let's define the encoder for our labels, which we'll use prior to splitting our dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
Split
Split dataset
And finally, we'll conclude our data operations with our split function:
1 2 3 4 5 6 7 8 9 |
|
Install required packages and add to requirements.txt
:
pip install scikit-learn==0.24.2
# Add to requirements.txt
scikit-learn==0.24.2
📈 Modeling
Train
Train w/ default args
Now we're ready to kick off the training process. We'll start by defining the operation in our main.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
We'll be adding more to our train_model()
operation when we factor in experiment tracking but, for now, it's quite simple. This function calls for a train()
function inside our train.py
script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
This train()
function calls two external functions (predict.custom_predict()
from predict.py
and evaluate.get_metrics()
from evaluate.py
):
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
Install required packages and add to requirements.txt
:
pip install imbalanced-learn==0.8.1 snorkel==0.9.8
# Add to requirements.txt
imbalanced-learn==0.8.1
snorkel==0.9.8
Commands to train a model:
1 2 3 4 |
|
Epoch: 00 | train_loss: 1.16783, val_loss: 1.20177 Epoch: 10 | train_loss: 0.46262, val_loss: 0.62612 Epoch: 20 | train_loss: 0.31599, val_loss: 0.51986 Epoch: 30 | train_loss: 0.25191, val_loss: 0.47544 Epoch: 40 | train_loss: 0.21720, val_loss: 0.45176 Epoch: 50 | train_loss: 0.19610, val_loss: 0.43770 Epoch: 60 | train_loss: 0.18221, val_loss: 0.42857 Epoch: 70 | train_loss: 0.17291, val_loss: 0.42246 Epoch: 80 | train_loss: 0.16643, val_loss: 0.41818 Epoch: 90 | train_loss: 0.16160, val_loss: 0.41528 { "overall": { "precision": 0.8990934378802025, "recall": 0.8194444444444444, "f1": 0.838280325954406, "num_samples": 144.0 }, "class": { "computer-vision": { "precision": 0.975, "recall": 0.7222222222222222, "f1": 0.8297872340425532, "num_samples": 54.0 }, "mlops": { "precision": 0.9090909090909091, "recall": 0.8333333333333334, "f1": 0.8695652173913043, "num_samples": 12.0 }, "natural-language-processing": { "precision": 0.9803921568627451, "recall": 0.8620689655172413, "f1": 0.9174311926605505, "num_samples": 58.0 }, "other": { "precision": 0.4523809523809524, "recall": 0.95, "f1": 0.6129032258064516, "num_samples": 20.0 } }, "slices": { "nlp_cnn": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "num_samples": 1 }, "short_text": { "precision": 0.8, "recall": 0.8, "f1": 0.8000000000000002, "num_samples": 5 } } }
Optimize
Optimize args
Now that we can train one model, we're ready to train many models to optimize our hyperparameters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
We'll define the objective()
function inside train.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Recall that in our notebook, we modified the train()
function to include information about trials during optimization for pruning:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Since we're using the MLflowCallback
here with Optuna, we can either allow all our experiments to be stored under the default mlruns
directory that MLflow will create or we can configure that location:
1 2 3 4 5 6 |
|
Install required packages and add to requirements.txt
:
pip install mlflow==1.23.1 optuna==2.10.0 numpyencoder==0.3.0
# Add to requirements.txt
mlflow==1.23.1
numpyencoder==0.3.0
optuna==2.10.0
Commands to optimize hyperparameters:
1 2 3 4 |
|
A new study created in memory with name: optimization ... Best value (f1): 0.8497010532479641 Best hyperparameters: { "analyzer": "char_wb", "ngram_max_range": 6, "learning_rate": 0.8616849162496086, "power_t": 0.21283622300887173 }
We should see our experiment in our model registry, located at stores/model/
:
stores/model/
└── 0/
Experiment tracking
Experiment tracking
Now that we have our optimized hyperparameters, we can train a model and store it's artifacts via experiment tracking. We'll start by modifying the train()
operation in our main.py
script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
There's a lot more happening inside our train_model()
function but it's necessary in order to store all the metrics, parameters and artifacts. We're also going to update the train()
function inside train.py
so that the intermediate metrics are captured:
1 2 3 4 5 6 7 8 9 10 11 |
|
Commands to train a model with experiment tracking:
1 2 3 4 |
|
Run ID: d91d9760b2e14a5fbbae9f3762f0afaf Epoch: 00 | train_loss: 0.74266, val_loss: 0.83335 Epoch: 10 | train_loss: 0.21884, val_loss: 0.42853 Epoch: 20 | train_loss: 0.16632, val_loss: 0.39420 Epoch: 30 | train_loss: 0.15108, val_loss: 0.38396 Epoch: 40 | train_loss: 0.14589, val_loss: 0.38089 Epoch: 50 | train_loss: 0.14358, val_loss: 0.37992 Epoch: 60 | train_loss: 0.14084, val_loss: 0.37977 Epoch: 70 | train_loss: 0.14025, val_loss: 0.37828 Epoch: 80 | train_loss: 0.13983, val_loss: 0.37699 Epoch: 90 | train_loss: 0.13841, val_loss: 0.37772 { "overall": { "precision": 0.9026155077984347, "recall": 0.8333333333333334, "f1": 0.8497010532479641, "num_samples": 144.0 }, "class": { "computer-vision": { "precision": 0.975609756097561, "recall": 0.7407407407407407, "f1": 0.8421052631578947, "num_samples": 54.0 }, "mlops": { "precision": 0.9090909090909091, "recall": 0.8333333333333334, "f1": 0.8695652173913043, "num_samples": 12.0 }, "natural-language-processing": { "precision": 0.9807692307692307, "recall": 0.8793103448275862, "f1": 0.9272727272727272, "num_samples": 58.0 }, "other": { "precision": 0.475, "recall": 0.95, "f1": 0.6333333333333334, "num_samples": 20.0 } }, "slices": { "nlp_cnn": { "precision": 1.0, "recall": 1.0, "f1": 1.0, "num_samples": 1 }, "short_text": { "precision": 0.8, "recall": 0.8, "f1": 0.8000000000000002, "num_samples": 5 } } }
Our configuration directory should now have a performance.json
and a run_id.txt
file. We're saving these so we can quickly access this metadata of the latest successful training. If we were considering several models as once, we could manually set the run_id of the run we want to deploy or programmatically identify the best across experiments.
config/
├── args.json - arguments
├── config.py - configuration setup
├── performance.json - performance metrics
└── run_id.txt - ID of latest successful run
And we should see this specific experiment and run in our model registry:
stores/model/
├── 0/
└── 1/
Predict
Predict texts
We're finally ready to use our trained model for inference. We'll add the operation to predict a tag to main.py
:
1 2 3 4 5 6 7 8 9 10 11 |
|
This involves creating the load_artifacts()
function inside our main.py
script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
and defining the predict()
function inside predict.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Commands to predict the tag for text:
1 2 3 |
|
[ { "input_text": "Transfer learning with transformers for text classification.", "predicted_tag": "natural-language-processing" } ]
Don't worry about formatting our functions and classes just yet. We'll be covering how to properly do this in the documentation lesson.
So many functions and classes...
As we migrated from notebooks to scripts, we had to define so many functions and classes. How can we improve this?
Show answer
As we work on more projects, we may find it useful to contribute our generalized functions and classes to a central repository. Provided that all the code is tested and documented, this can reduce boilerplate code and redundant efforts. To make this central repository available for everyone, we can package it and share it publicly or keep it private with a PyPI mirror, etc.
# Ex. installing our public repo
pip install git+https://github.com/GokuMohandas/mlops-course#egg=tagifai
To cite this content, please use:
1 2 3 4 5 6 |
|