In recent years, I worked with several ML startups under different roles and responsibilities such as ML engineer, MLOps engineer, team lead, consultant, and advisor. I found four mistakes that are very common among engineers and executives. In this article, I describe 4 common mistakes to build an ML product within the context of a wearable technology company.
Once, I led the machine intelligence team at a wearable technology company. The company built a gesture control armband that lets users control their environments using their hand gestures. We, at the machine intelligence team, developed an ML-based gesture recognition engine that takes muscle signals and recognizes the hand gestures accordingly.
We had to build and evaluate a large list of ML models to build an industry-grade machine learning (ML) product. Therefore, we built a pipeline that properly administers all the development steps required to build an ML product including data collection, model training, model evaluation, or model selection.
In this article, I share 4 potential mistakes to build an ML product that we could have done.
An ML product can barely be shipped by a single ML model.
One of the common mistakes to build an ML product is to insist on building a single ML model to address all scenarios. In the early days, we aimed to build a gesture recognition model for 7 billion people all around the world. You can guess! It didn’t work such that. So, we soon started looking into the raw muscle signals to obtain better insights.
We found that the patterns or features that we expected for each gesture were not consistent among users. In other words, a single ML model may not work for everyone. We found individuals have more differences than what we thought in the early days. For example, we found that individuals have different muscle anatomy, skin conductivity, perspiration rate, forearm perimeter, and gesture style. But what is the alternative solution?Don't be upset if you couldn't build a single ML model for all the users! Click To Tweet
Our experiments showed that the forearm perimeter has a major effect on model performance. So, it might be helpful to build several models for each different range of forearm perimeters. And, assign the relevant model to each user based on the group that they belong to. This approach significantly improves model performances. But that could not create the user experience quality that we wanted to deliver.
Then, we took another pivot. We decided to create two options for our users, i.e., a) “population model” that had been built in the house using all of our training dataset and b) “personalized model” that must be built by users using their own data.
An ML product must not be evaluated by a controlled test dataset.
One of the common mistakes to build an ML product is to evaluate an ML product using a controlled test dataset. We had to thoroughly evaluate the performance of the ML model before each release. The first step was to ensure the internal test data was a true representation of the user data. However, we did not know the users well enough; especially during the early days. We did not know how, where, or by whom our products were used. Therefore, we were doubtful whether test data can truly represent user data.
Our analysis showed that many parameters such as “forearm perimeter” or “skin condition” significantly affect the performance. So, we had to become certain our test dataset has enough samples of each parameter if we wanted to have a true representation of user data. Plus, that also helped to create a balanced dataset to train.
We could easily measure forearm perimeters on the contrary to the skin condition. Therefore, we sufficiently collected data with a proper distribution in the forearm perimeter while we could not do the same for the skin condition. Nevertheless, we measured the skin condition using a conductivity meter for a small group of users for further analysis.Test data must be a true representation of the user data! Click To Tweet
Moreover, we were not aware of the distribution of forearm perimeters in our users. You may say it must be a normal distribution. However, we did not have that many users at that time. So, we were certain the distribution is definitely skewed. That is, we could just guestimate of how a new release affects the user experience. Nevertheless, we did all we could do to ensure our test data is a true representation of user data.
An ML product must be evaluated by problem-specific metrics.
We can simply evaluate an academic 2-class ML problem with standard metrics such as accuracy, precision, and recall. Nevertheless, we can not evaluate an ML product that easy. For example, an ML product is often a multi-class problem with problem-specific configurations each of which introduces some complexities to the evaluation framework. One of the common mistakes to build an ML product is to evaluate an ML product with generic metrics.
We designed an ML model to identify 5 hand gestures, i.e., Left, Right, Fist, Spread, and Snap. We evaluated the performance of the ML model in each class with 3 standard parameters, i.e., accuracy, precision, recall. Therefore, we had 15 (3×5) performance metrics to consider. These 15 parameters were not equally important. For example, the false positive in class Fist or Snap has different consequences on the user experience in use case A. Plus, use cases had also their own differences. In short, we had to create problem-specific metrics to ensure releasing a new ML model given all of the complexities.An ML product must be evaluated by problem-specific metrics. Click To Tweet
An ML product needs a library of ML models just in case.
To build an ML product, we must frequently build ML models. The model performance was not necessarily and consistently improved over time. So, we had to archive the models along with their performance reports in a library for the sake of future retrieval. One of the common mistakes to build an ML product is to assume the performance of an ML model increases consistently by feeding more data.
We needed to retrieve the past models to conduct a comparative analysis or to release a ML model with required acceptance criteria. Therefore, we stored useful metadata such as performance metrics and their hyperparameters with the models. This helped us run more deeply analysis on the results if it was needed. We also stored train and test data next to each ML model.
You can create this library in a cloud storage solution such as Amazon S3, a universal artifact management solution such as JFrog, or in a compartment provided by a continuous integration solution such as CircleCI. The best practice is to use an artifact management solution which creates better access to the list of stored ML models.
Related Link- Storing Metadata from Machine Learning Experiments
Another example. Just imagine you are a machine intelligence lead at a company. In the past several months, you have been working hard to design an ML model with the acceptance criteria X. A deadline reached and, unexpectedly, you were told that the company needs an ML model with the acceptance criteria Y. You do not have time to do all the training and testing again. What you must do is to run a query on the model library and extract the one which has the closest performance to the required criteria.
- Have an alternative strategy if a single ML model couldn’t address your needs
- Ensure the test data is a true representation of the user data
- Use problem-specific metric to evaluate the end-to-end performance
- Build a library of ML models, and tagged them with useful metadata