Have you ever run into Google’s ‘People also ask’ section and found the questions you were about to ask? Periodically, we are going to answer the most researched questions about AI and Machine Learning with our guests on Clearbox AI’s new interview series ‘People also Ask...and we answer’. Enjoy!
In our latest interview, we talked about how companies may harness the power of AI. We touched upon many points such as data management, organisational needs, the importance of setting the right expectations and of understanding results. Today we’ll take a step forward, and we will talk about how to make AI processes work. In particular, we will introduce the concept of Continuous Integration/Continuous Delivery. This method comes from the software engineering world, and we will see how it applies to both software and Machine Learning. I’m glad to address the topic with our guests, Luca Gilli, CTO and co-founder of Clearbox AI, and Gaspare Vitta CI/CD Engineer at Qarik.
Introducing our guest
Gaspare Vitta is an Italian CI/CD engineer. His primary focus revolves around the developer experience. As an organisation grows, so does its number of developers, and he takes care of improving the whole development workflow. He manages the tunings, all the automation work with built systems, CI/CD, code quality pipelines and security pipelines.
What is CI/CD? Can you give us some examples?
Gaspare: As you mentioned before, CI/CD is the Continuous Integration and Deployment, and it’s a process that you embrace to deliver frequently your software and not only that. It may also concern source code, configuration data, Machine Learning models, and more. It starts when a developer pushes code from its laptop to the central repository, and when this happens, a Continuous Integration system triggers a new pipeline instance. What you want to do in the first step is to make sure that the commit is good enough to be part of your codebase. You want to build the code, test it, run code quality checks, and ensure that the code is following all the linting standards you have in your organisation. At the same time, you may want to run some security checks. Let's say a developer is adding a new third-party dependency, and you want to scan that to make sure that it is following some standards that you have in your organisation. The main goal of this stage is basically to give developers quick feedback since they are waiting for the result of the CI. You want to tell them, “This commit is good” or “This commit is not good, go there and fix it”, and you have to do it quickly. After this, you take care of other things like running more in-depth testing, for example, acceptance tests. For instance, you can have them automated and running, and when you reach a certain point when you have a good commit, you can call it a release candidate. In many organisations, there is the need to run some sort of manual testing, and you want to spend manual effort only on good commits, i.e. things that made it through the whole pipeline till that point. Now you want to run more tests - stress tests, capacity testing, or whatever other thing you need for the organisation - and then you can promote it to production and release it to your users. There are a lot of steps, so my suggestion is to do that one step at a time. Don't try to aim for the old pipeline; it's not a one-day job. If you're seeing your code on GitHub, just start with GitHub actions to build your code around your unit test. This is already a massive productivity improvement. At the same time, try to provide errors messages to your user in an actionable way, so don't say things like "Hey, your unit tester failed" but say "This unit has failed in this point, and this is the log message", so it's easier for developers to fix them. Try to measure data of your pipelines: how long it takes, how long each step takes. In this way, it's easier to identify bottlenecks in the development process.
Alessandra: One question comes to my mind. I'm not a technician, and I'm having doubts about using the "D" of CI/CD because I found that it can be Delivery, Development, Deployment… What's the difference between these?
Gaspare: You can find the different meanings in literature, but it means just to give the software to your customers as soon as possible so they can validate it. It's just a process. Don't go for naming, just the process. Try to do that as quick as possible and as reliable as possible.
How does CI/CD work in Machine Learning?
Luca: Thank you, Gaspare; as you said, Machine Learning could also be seen as software because you usually have Machine Learning models implemented inside their tools, and you want to use these tools live. You want to be able to constantly update these tools making sure that you don't break anything, so the same concepts for CI/CD apply to Machine Learning. The only difference is that when you talk about Machine Learning, you have software because you need software to write Machine Learning code, but you also have models and data, and these are two different entities that keep changing. Machine Learning in production means continuous exposition to new data. You have to improve your models constantly, so you need to implement many tests and version controls to make sure that these small changes don't break anything. And, in case they break anything, you should be able to trace back to the last working version of a model. I don't have a software engineering background, and I had to learn the essential concepts of CI/CD as many practitioners in the Machine Learning field: my feeling is that there's a bit of a struggle to understand the best practices and the best tools. I usually tell myself that CI/CD for data scientists should be at least version control and testing, and you should start by implementing these two concepts. There are a lot of tools to learn CI/CD for Machine Learning specifically. For example, you can do version control for code using git or subversion using tools like GitHub for data. Now there are tools like Data Version Control which is Git for data or models. There's MLFlow, and these are all very useful tools. The learning curve sometimes is slightly steep because you need to learn how to use them, but they save you a lot of time in the long term and prevent technical debt. For example, sometimes we like to make an easy choice "Oh, I'm working on a Jupiter notebook because life on a Jupiter notebook is so easy”. I'm sure that 99% of data scientists wanted to go back to a different version of the model that they thought was working better and couldn't do it because they didn't store it properly. In my opinion, sometimes implementing this kind of practice is as simple as you would save a Word document every five minutes because you're afraid of losing your progress. I decided that I wanted to push my code to GitHub as a save button or save my Machine Learning experiment with MLflow every time I stop working and do something else because I know that it is a checkpoint that I can always go back. The goal is to make sure that this model will behave the way it did last time if I go back after N hours.
Alessandra: so, is it feasible to do CI/CD even if you are not a software engineer?
Luca: Yes, maybe you don't need to be a GitHub master to use GitHub; you can just do a commit, a push and a pull. Some people are experts on Git, but you don’t have to be a Git wizard to start implementing these first best practices. You can just use it as a save and load button.
Alessandra: Gaspare, I will come back to you because I recently read an article you wrote and one sentence hit me:
"Releasing software shouldn't be an art but a boring and monotonous engineering process". What does it mean?
Gaspare: If you're releasing your software correctly, it will be as simple as hitting one button or executing one command. Nevertheless, there are a lot of anti-patterns that you can find. The most common is the manual deployment of software, so it's very easy to spot. You can just see it from the documentation; for example, if there is a ton of documentation just for releasing the software, that is a process that takes days for just one team of engineers. When you release software that way, you just do it once every month or weeks. The delta of what is currently in production and what you're releasing is massive, which means there's a lot of room for errors. If you find later on a bug, you will have to go and scan a lot of instances. On the other hand, if you release at every commit, the difference will be so tiny that it's very easy to spot errors. This makes sure that the scripts you're relying on are tested repeatedly to work reliably on the release date. Another thing that is also painful for organisations is that you just do it manually whenever you need to modify your production environment. You cannot roll back to a previous version, which can also be dangerous. Ideally, we should check everything within the code of your OS configuration, VM configuration, and all the third-party dependencies that you have to recreate your environment in an automated way. They said there are two pillars that you should follow if you want to do CI/CD in the best way:
- Speed: you want to provide rapid feedback loops, so you want to release software as fast as possible;
- Quality: you want to build your software with very high quality, so all your testing suites and parts should be automated, and you should move that as early as possible in your pipeline.
How do you test ML models according to CI/CD principles?
Luca: It's a very complicated question because we can discuss it at several levels. For example, most ML models nowadays are written in Python, and one of the problems in Python is reproducibility. As Gaspare mentioned, these environments and models locally trained and tested at the developers' level should give the same answers in production. Therefore, you should ensure that the output should always be the same that you verify while developing the model when given this input. For example, you can consider this package as a test. You can set up a GitHub and try the model with this version of Python, with this other virtual Python, with these requirements that you specified as dependencies and ensure a proper model reproducibility. In my opinion, this is a technical aspect very related to Python. The other aspect is that you should make sure that every single small change in the model should either improve it or don't make it worse. In this way, you can write tests also for that, and a very good strategy to write this kind of test is to define a baseline. It can be like a straightforward dummy model, like a majority classifier that makes the same decision based on a very silly rule. Your model should always beat that kind of baseline to be written as a test. If the accuracy or the precision from a new version of the model is worse, then reject this version of the model. That also helps define if the metric I'm using for my model is good enough. This kind of baseline can detect that I'm using accuracy for a very imbalanced problem. This is a widespread mistake in Machine Learning because it gives you a false sense of security in terms of performance. Another aspect is when you have a macro idea that your model is doing okay, at least in terms of global metrics. It's crucial to test it on some more local metrics and divide data into segments to analyse the model behaviour more in-depth. Testing requires a lot of hand-tuning and definitions about the segments I want to analyse; for example, if you have a model that deals with personal data, you need to understand if the model is working with a segment of people better or worse than before. It's also essential to detect topics like bias and fairness. At Clearbox AI, we talk and think about the concept of Synthetic Data, so you can also think about using this newly generated data as a testbed for models. It's a very new topic, and there's no framework yet for this kind of testing, but we’re trying to convey how synthetic data helps to improve model testing. Creating data in specific segments of population or point types can support better understanding models in time and monitoring them through testing.
Would you like to suggest some readings about CI/CD?
Gaspare: Yes, sure. I will suggest two books about CI/CD. One is "Continuous Delivery", by Jez Humble and David Farley: it's a classic in this field. The other one is "Accelerate", also by Jez Humble. These readings provide the backbone and a lot of use cases about implementing CI/CD.
Luca: I would like to suggest a paper called "Hidden Technical Debt of Machine Learning Systems", recognised as the paper that introduced the definition of MLOps. It's an exciting read because it talks about going from DevOps to MLOps in Machine Learning. Then I’d like to share a blog post from the blog of Martin Fowler about "Continuous Delivery for Machine Learning". It's fascinating to understand what makes Machine Learning different from software in testing and versioning.
Watch the previous episodes: