Creating an interactive dashboard to visually inspect our application using Streamlit. Repository
When developing an application, we may know the nuances (preprocessing, performance, etc.) quite well but how can we effectively communicate this to other developers and business stakeholders? One option is a Jupyter notebook but it's often cluttered with code and isn't very easy for non-technical team members to access and run. We need to create a dashboard that can be accessed without any technical prerequisites and effectively communicates key findings. It would be even more useful if our dashboard was interactive such that even technical developers would find use from exploring it prior to the next iteration.
With Streamlit, we can quickly create an empty application and as we develop, the UI will update as well.
# Setup pip install streamlit mkdir streamlit touch streamlit/st_app.py streamlit run streamlit/st_app.py
Local URL: http://localhost:8501
Before we create a dashboard for our specific application, we need to learn about the different Streamlit API components. Instead of going through them all in this lesson, take ten minutes and go through the entire documentation page. We normally don't suggest this but it's quite short and we promised you'll be amazed at how many UI components (styled text, latex, tables, plots, etc.) you can create using just Python. We'll explore the different components in detail as they apply to creating different interactions for our specific dashboard below.
We start by showing a sample of our different data sources because, for many people, this may be the first time they see the data so it's a great for them to understand all the different features, formats, etc. For displaying the tags, we don't want to just dump all of them on the dashboard but instead we can use a selectbox to allow the user to view them one at a time.
1 2 3 4 5 6 7 8 9 10 11 12 13
We can also show our a snapshot of the loaded DataFrame which has sortable columns that people can play with to explore the data.
1 2 3 4
We can essentially walk viewers through our entire data phase (EDA, preprocessing, etc.) and allow them (and ourselves) to explore key decisions. For example, we chose to introduce a minimum tag frequency constraint so that we can have enough samples. We can now interactively change that value with a slider widget and see which tags just made and missed the cut.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
What makes this truly interactive is that when we alter the value here, all the downstream tables and plots will update to reflect that change immediately. This is a great way to explore what constraints to use because we can quickly visualize the impact it can have on our data.
1 2 3 4 5 6 7 8 9 10 11
On a similar note, we can also interactively view how our preprocessing functions behave. We can alter any of the function's default input arguments, as well as the input text.
1 2 3 4 5 6 7
In fact, we were able to discover and fix a bug here where the NLTK package automatically lowers text when stemming which we had to override using our
Stemmer class in our data script.
This page allows us to quickly compare the improvements and regression of current and previous deployments. Since our application deployments are organized via tags, we need to be able to compare the model's across different tags, as well as what's available on current workspace.
We want to provide two different performance views:
- View change in key metrics across all versions.
- Inspect detailed differences between two versions.
We can create some plots of key metric improvements (or regressions) across time (releases). We can fetch all the available tags (and current workspace) and extract their respective performance metadata and visualize them on a plot.
We can see that as our application matures, we'll have many tags and potentially different versions deployed simultaneously. At this point, we should invest in having a metadata or evaluation store where all parameter and performance artifacts are efficiently indexed by model IDs.
Now when we want to zoom into the difference between two versions, we can use a simple selectbox to choose which tags to compare. This could either be the current workspace with the current deployed tag, two previously deployed tags, etc.
1 2 3 4 5 6 7 8 9 10 11
Once we've chosen which two tags to compare, we can view their differences in performance. With ML applications, we need to compare many key metrics and weigh different tradeoffs in order to decide if one version is better than the other. Therefore, we want to create visualizations that will help us draw the insights quickly and make the decision. For example, we could have visualized the key metric values for two different tags by instead we can plot the differences directly to quickly extract the key metric improvements and regressions.
We also want to enable a closer analysis of the overall, class and slices performance improvements and regressions.
With the inference page, we want to be able to test our model using various inputs to receive predictions, as well as intermediate outputs (ex. preprocessed text). This is a great way for our team members to quickly play with the latest deployed model.
Our last page will enable a closer inspection on the test split's predictions to identify areas to improve, collect more data, etc. First we offer a quick view of each tag's performance and we could also do the same for specific slices of the data we may care about (high priority, minority, etc.)
We're also going to inspect the true positive (TP), false positive (FP) and false negative (FN) samples across our different tags. It's a great way to catch issues with labeling (FP), weaknesses (FN), etc.
Be careful not to make decisions based on predicted probabilities before calibrating them to reliably use as measures of confidence.
- Use false positives to identify potentially mislabeled data.
- Connect inspection pipelines with annotation systems so that changes to the data can be reviewed and incorporated.
- Inspect FP / FN samples by estimating training data influences (TracIn) on their predictions.
- Inspect the trained model's behavior under various conditions using the WhatIf tool.
There are a few functions defined at the start of our st_app.py script which have a
@st.cache decorator. This calls for Streamlit to cache the function by the combination of it's inputs which will significantly improve performance involving computation heavy functions.
1 2 3 4
We have several different options for deploying and managing our Streamlit dashboard. We could use Streamlit's sharing feature (beta) which allows us to seamlessly deploy dashboards straight from GitHub. Our dashboard will continue to stay updated as we commit changes to our repository. Another option is to deploy the Streamlit dashboard along with our API service. We can use docker-compose to spin up a separate container or simply add it to the API service's Dockerfile's ENTRYPOINT with the appropriate ports exposed. The later might be ideal, especially if your dashboard isn't meant to be public and it you want added security, performance, etc.
To cite this lesson, please use:
1 2 3 4 5 6