7 Tips for Beginner to Future-Proof your Machine Learning Project
A Introduction to Developing More Collaborative, Reproducible and Reusable ML Code
There can be a knowledge gap when transitioning from exploratory Machine Learning projects, typical in research and study, to industry-level projects. This is due to the fact that industry projects generally have three additional objectives: collaborative, reproducible, and reusable, which serve the purpose of enhancing business continuity, increasing efficiency and reducing cost. Although I am no way near finding a perfect solution, I would like to document some tips I have learned to transform a exploratory, notebook based ML code to industry-ready project that is designed with more scalability and sustainability.
I have categorized these tips into three key strategies: modularization, versioning and consistency.
Improvement 1: Modularization - Break Down Code into Smaller Pieces
Improvement 2: Versioning - Data, Code and Model Versioning
Improvement 3: Consistency - Consistent Structure and Naming Convention
Improvement 1: Modularization - Break Down Code into Smaller Pieces
Problem
One struggle I have faced is to have one notebook for the entire data science project - which is common while learning data science. As you may experience, there are repeatable components in a data science project, for instance, same data preprocessing steps are applied to transform both train data and inference data. Different versions of the same function are copied and reused at different locations. Not only does it decrease the consistency of the code, but it also makes troubleshooting the entire notebook more challenging.
Bad Example
## bad example
train_data = train_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].mean())
train_data['Month'] = pd.to_datetime(train_data['Date']).dt.month.apply(str)
inference_data = inference_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
inference_data[numeric_cols] = inference_data[numeric_cols].fillna(inference_data[numeric_cols].mean())
inference_data['Month'] = pd.to_datetime(inference_data['Date']).dt.month.apply(str)
Tip 1: Reuse code where possible by creating and importing functions, modules, packages
Good Example 1
def data_preparation(data):
data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
return data
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)
In this example, we extract the common processing steps as a data_preparation function and apply it to train_data and inference_data. Breaking down a long script into self-contained components like this makes it easier to unit test and troubleshoot.
Good Example 2
Furthermore, we can store this function in a standalone Python module (i.e. 'preprocessing.py') and import the function from this file.
## file preprocessing.py ##
def data_preparation(data):
data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
return data
from preprocessing import data_preparation
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)
This makes it readily accessible and reusable for applications in other projects or by other team members. Additionally, it enhances code consistency and reduces the risk of having multiple versions of the same function definition. For example, we may accidentally drop or misspell one variable when copying this code.
Tip 2: Keep parameters in a separate config file
To further improve upon the script, we can store parameters e.g. dropped columns in another file (i.e. “config.py”) and importing it as a parameter.
Good Example 3
## parameters.py ##
DROP_COLS = ['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am']
NUM_COLS = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
from parameters import DROP_COLS, NUM_COLS
def data_preparation(data):
data = data.drop(DROP_COLS, axis=1)
data[NUM_COLS] = data[NUM_COLS].fillna(data[NUM_COLS].mean())
data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
return data
These parameters generally remains constant in one iteration of ML pipeline, but can be mutable as the pipeline evolves overtime. While modularizing a simple script like the above might seem unnecessary, it becomes effective as the script becomes more complicated.
This approach is also widely used for storing and parsing model hyperparameters in a MLOps pipeline, or storing API tokens and password credentials in a secure location without exposing it in the script.
Improvement 2: Versioning - Data, Code and Model Versioning
Problem
An unexpected trend was identified in the model output which requires revisiting the project. However, only a single output file was generated as the end result of the code. Since production data is changing overtime, regenerating the model output or tracing back to the source has become nearly impossible. Furthermore, the model cannot be reused for future predictions.
Tips 3: Data Versioning
Industry data and production data are hardly static and may change on a daily basis. Therefore, it is crucial to take a snapshot of the data at the point in time when it is used for model training or predictions. A common practice is to using timestamp to version the data.
from datetime import datetime
timestamp = datetime.today().strftime('%Y%m%d')
train_data.to_csv(f’train_data_{timestamp}.csv’)
## output
>>> train_data_20240101.csv
There are more elegant services and solutions in the industry. DVC is a good example if you are looking for tools that make the process more streamlined.
It is also important to save data snapshots throughout the project lifecycle, for example: raw data, processed data, train data, validation data, test data and inference data. This reduces the necessity to rerun the code from scratch each time. Besides, if data drifts are detected in final model output, keeping a record of the intermediate steps helps to identify where the changes occur.
Tip 4: Model Versioning
Depending on training data, preprocessing pipeline and hyperparameters, models developed from the same algorithm can vary significantly from each other, thus it is essential to keeping track of different model configurations during the model experimentation phase. Since models themselves also have a certain level of randomness, even though it is trained on the same dataset and process, the output can be different. This extends beyond the scope of machine learning or deep learning models. PCA and data transformation that required fitting training data would also have a dimension of randomness, which means that using random_seed is important to mitigate the amount of variations in the output.
While learning experiment tracking is a long journey, the first thing you can do is to save the model. There are multiple ways to save a trained model in Python. For example, we can use pickle library.
import pickle
model_filename = 'model.pkl'
pickle.dump(model, model_filename)
You may want to choose a more descriptive name for your filename, and it is always helpful to provide a brief description that explains the model variant.
To load the model:
model = pickle.load(’model.pkl’)
Tip 5: Code Versioning
The third recommendation is to save the queries that have been used to generate any output data, e.g. the SQL script for extracting the raw data. Furthermore, when executing batch inference, save the script with the precise date instead of a relative date. This helps to keep a record of the time snapshot for future reference.
# use precise date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= '2024-01-01' AND Date >= '2023-01-01
# use relative date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= DATEADD(year,-1,GETDATE())
Additionally, Git is undoubtedly an essential code versioning tool when collaborating on a data science project within a team. It helps to track changes and revert back to previous checkpoint when necessary.
Improvement 3: Consistency - Consistent Structure and Naming Convention
Problem
All the data, files, and scripts are stored in one flat directory structure. Every ML project is built from scratch without a consistent workflow. The code is clamped together within one notebook. It becomes difficult to figure out any dependencies, and there is the risk of accidentally executing a line of code that overwrites previous output data.
Tip 6: Consistent Directory Structure
As the field of Data Science and Machine Learning has matured overtime, consistent framework and project lifecycle has been gradually developed, such as CRISP-DM and TDSP. Therefore, we can build project directory structure to adhere with a standard framework. For instance, “cookiecutter data science” provides a logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Based on the recommended directory structure, I have adjusted it to a reduced version as below which I am also allowing it to evolve overtime.
Feel free to develop a structure that best suits your workflow and can be used as a template to design all future projects. In addition to the benefit of consistency, it is a powerful way to organize thoughts and create a high level architecture during the development phase.
├── data
│ ├── output <- The output data from the model.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── code <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to generate and process data
│ │ ├── data_preparation.py
│ │ └── data_preprocessing.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── inference_model.py
│ │ └── train_model.py
│ │
│ └── analysis <- Scripts to create exploratory and results oriented visualizations
│ └── analysis.py
│
Tip 7: Consistent Naming Convention
Another way to introduce more consistency and also reduce friction in the team collaboration is keeping a naming convention for your model, data and code. There isn't a single best practice; it's about finding a method that suits your specific use cases. You may derive some insights from HuggingFace or Kaggle model hub, for example <model-name>-<parameters>-<model-version> or <model-name>-<data-version>-<use-case>.
And of course, documentation is always preferred to add extra details behind the names. However, it is often easier said than done, we may stick to it for the first few days until we completely forget about maintaining the same convention. One tip I've learned is to create a template, follow the naming convention, and save it in the working directory. Then it can both serve as a reminder and can be easily duplicated for new files.
Hope you found this article helpful. If you’d like to support my work and see more articles like this, treat me a coffee ☕️ by signing up Premium Membership with $10 one-off purchase.
Take Home Message
The article discussed how to future-proof machine learning projects with three key improvements: modularization, versioning, and maintaining consistency.
Improvement 1: Modularization - Break Down Code into Smaller Pieces
Tip 1: Reuse code when possible by importing functions, modules, packages
Tip 2: Keep parameters in a separate parameter or config file
Improvement 2: Versioning - Proper Data, Code and Model Versioning
Tip 3: Data versioning
Tip 4: Model versioning
Tip 5: Code versioning
Improvement 3: Consistency - Consistent Structure and Naming Convention
Tip 6: Consistent directory structure
Tip 7: Consistent naming convention
Comentários