Test Driven Machine Learning

In a previous post, we talked in detail about Test Driven Development (TDD) its main methodology, benefits, pitfalls, and best practices. According to the major differences between ML-based code and traditional programming, in this post, we’re discussing the applicability of TDD methodology in the ML projects in different three levels.

TDD in Code Level

As a data scientist, you spend most of your day on data preparation, cleaning, feature processing, and development, etc. Such data-related tasks often allow a wide space for TDD.

Many times during this process, your code may not raise any error. However, the result of the answers won’t be as expected; you won’t get what you exactly wanted!
And I’m sure you met the situation when something that you calculated in some stage in your notebook is not working in the next stage of the development. Maybe the fields were not matching, maybe the datatypes, or maybe a seed was not fixed, and so on.

Here the question to ask is “How can I catch bugs in my code?

This focuses on finding functional issues that exist in the code. This is where TDD has the most value. In this phase, the traditional well-known TDD testing strategies such as unit testing are applicable and useful.

TDD in Algorithm Level

Here the question is about “How can I improve my algorithm’s performance?

With TDD, you describe the expected behavior in the form of a test and then create the code to satisfy the test. While this can work well for some components of your machine learning model, as we discussed in the previous section, it won’t work well for the high-level behavior of a machine learning model, because the expected behavior is not precisely known in advance.

The process of developing a machine learning model often involves trying different approaches to see which one is most effective. The behavior is likely to be measured in terms of percentages, rather than absolutes.

At this point, you may be wondering what kind of TDD can be used here?

In every machine learning algorithm, there exists a way to quantify the quality of what you’re doing. In the linear regression; it’s your adjusted R2 value; in classification problems, it’s a ROC curve or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough.

We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve.

TDD in Project Level

Where TDD is about writing tests for all small parts of the code before actually coding it and as ML projects are costly projects in terms of money, time, and effort, with ML development the same TDD mindset should be applied on a higher level. Before you start to develop a model you should have a well thought through plan to test if that model actually improves your product/creates business value.

In the same way, as TDD drives the design of the software by helping the developers to ask the right questions and anticipate potential problems, there are multiple questions you should ask to yourself before any major work is taking place including:

  • What is the cost of failure?
  • How fast do you get new data?
  • And so on…

TDD Best Practices in ML

Machine Learning is an iterative process, it’s about rapid experimentation and iteration. To iterate quickly and with confidence on cutting-edge machine learning models with a team these practices can be in benefit:

  • Machine learning models tend to have many parameters, often nested in complex ways. In order to simplify the testing and experimentation process, it’s recommended to separate hyperparameters from the business logic and model construction, using config files which will act as a higher-level way to design a model. Gin is a lightweight configuration framework for Python that can help in achieving this purpose.
  • Configuring your developed models in a hierarchical structure that has one HEAD model representing the latest and most effective changes. Changing the HEAD model required a pull request associated with reports contain stats to prove that a set of changes makes real improvements over the HEAD model.
  • ML models are difficult to debug; if you have a bug (or many) the model can still perform adequately. Thus, unit tests on every pull request is a need to ensure the model can still mechanically train and key functionality always works as expected, such tests can be run via tools like CircleCI.
  • Keep track of all of your experiments and their different components including the dataset, without keeping track of your experimenting history you won’t be able to learn much. DVC is a type of experiment management software that has been built on top of Git for this purpose. For a brief introduction about ML version control and DVC you can refer to our previous post here.


In this post we discussed how TDD can be applied in ML in three different levels: code level, algorithm level, and project level. As well as, the best practices in applying TDD with ML.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

Leave a Reply

Your email address will not be published. Required fields are marked *