Machine learning model…. Cloud deployment strategies…. BANG ZOOM WOW!! OK now that I have your attention, let’s talk shop! Deploying a Machine Learning model to enhance the quality of your company’s analytics is going to take some effort:
– To clean data
– To clearly define objectives
– To build strong project management
Many articles have been written on the first two bullet points (garbage in, garbage out) so I will focus on the third: managing a machine learning project.
I’ve put together a list of my top recommendations to help managers along their journey to machine learning mastery. The recommendations, in no particular order, are:
- Clearly define your goals and success criteria
- Give your data science team time to succeed
- Redefine your use case as a machine learning problem statement
- Figure out what it would take to solve a problem without using machine learning
- Understand your correct ratio of data engineers to data scientists
- Architect with CI/CD in mind from the beginning
- Leverage data science code for data pipelining in production
Clearly Define Your Goals and Success Criteria
Sometimes companies view data science as a pure research department. Companies like Google and Amazon can afford to burn millions of dollars on research every year, but in the corporate world – more often than not – people will want to see value in the work. Yes, I agree it’s ok to fail, but failing to maintain a departmental budget from projects that yield no value is not. My suggestion is to start small and identify a quick win.
Lay out your goals. Ask yourself:
- What are we trying to achieve?
- What questions/problems do we want to answer/solve?
Try and answer these questions at a high level first (e.g., determine naturally occurring customer segments, predict equipment failures with greater accuracy), then ask yourself, why? Why is it important to determine customer segments? Why bother predicting equipment failures? This seems like an obvious step but it’s one that is often overlooked. Data scientists tend to work a theory before deciding if it’s a good investment of time.
Remember:
- Before digging a trench, figure out why you’re digging.
- If you cannot come up with a good reason for digging until you’ve dug a hole, odds are no one else needs that hole
You can have the best model in the world that no one ever needed. Going through the process laid out above can help avoid the frustration of having no one appreciate the hard work that goes into digging a hole.
Now lay out your success criteria (SC), which is defined as the metrics against which your goals will be measured. Determining the SC will be based on what’s driving your need for ML. For instance, if your company is looking for ways of increasing gross margin, your success criteria should be determined as a function of gross margin (e.g., we will be successful if the model can find 3% gross margin in lost deals)
Give Your Data Science Team Time to Succeed
This is an unpopular statement that no manager wants to hear: we cannot guarantee positive results. This is where managing expectations comes into play. Executive leadership needs to understand that data science projects require an initial investment of time and money. There are many ways to do this which go beyond the scope of this article, but here are some of the tactics that have helped:
- Take on smaller, non-data science projects to keep the lights on while your DS team works towards a breakthrough
- Find someone in executive leadership to help champion your case
- Don’t over-invest in technology until you have a rock-solid use case
- Perform pilot or POC projects before investing months of development time (think before you dig!)
Redefine Your Use Case as a Machine Learning Problem Statement
This is the fun part, and part we are all familiar with: translating business requirements into technical. First, frame your problem in terms of a problem that needs to be solved. Doing this will not only ensure you heard your users correctly but really helps highlight the value proposition (or lack thereof) in their request. After doing so, you have an analytics use case. Let’s take a simple classification problem as an example. If your users say, ‘we want to personalize product recommendations based on total sales, customer attributes, and like product reviews.’ In machine learning, I would reformulate this statement as “can we create a model that maximizes total sales based on customer attributes and like product reviews?” This statement doesn’t answer the question, but it’s a perfect lead into a conversation about which attributes will be used for a feature set in the model.
Figure Out What it Would Take to Solve a Problem Without Using Machine Learning
This might seem like a no-brainer, but it’s a really big one. This point speaks to the practical issues that arise by not implementing the previous point. If you cannot reframe your use case into a machine learning problem statement, you might not have a problem that needs ML.
Have you ever implemented something new at your company? Something radically different from the status quo? If you have, you know there will always be nay-sayers and disbelievers. Odds are some of them are the same people you’re trying to convince to fund the project. I’ve seen ML projects stop dead in their tracks and turn into YART (yet another reporting task) because during a presentation, someone says, “that looks like it could have been done with 5 if/else statements” (We hate that guy). We’ve all been swept up in the excitement of using the brand-new shiny technology stack; but ML is not a panacea to every problem.
Some quick questions to help sniff out potential ML poser use-cases:
- Can I deliver a solution (or part of the solution) deterministically?
- How long would it take to code a solution to the posed problem?
- Is there an out-of-the-box application that would work?
- Is someone else already doing this elsewhere in the company?
Take some time and diagram out the basic architecture of a solution like this. It might turn out that your use case can be delivered as a simple dashboard or a quick python script that spits out a one-time analysis. Don’t waste time with use cases that can be handled by the BI team; remember, your job is to help take your company/department to the next level of analytics greatness.
Understand the Correct Ratio of Data Engineers to Data Scientists
Not all data scientists are created equal!
Some data science (DS) folks are born out of the fire that is data engineering. Others are specialist PHD-holding gurus. Each has their own strengths and weaknesses, but all of them will have to learn how to manipulate data. To ease the burden on them and your systems, you’ll need some dedicated people to help them get the data they need when they need it. This is especially important if your DS types are relatively new to your organization and don’t know their way around the data landscape. Bring in some seasoned data engineers and your timeline will come down incredibly. It helps if they have some experience working with DS teams since a lot of data science requires data in a very strict predictable (pun intended) format that lends itself to feature engineering.
Architect with CI/CD in Mind From the Beginning
The background of this recommendation is outside the scope of this article, but I plan on doing a follow up that elaborates on this one. Data pipelines will have to be built to support the model that goes into production, but unlike batch reporting, machine language (ML) use cases tend to target real-time needs. Even near-real-time models pose a challenge for engineers. Worse is when data scientists create models that have a shifting feature set based on the most recent data (think about how you would handle “Most Active Customers in the past 30 days” as a feature input to a model).
Go back to our previous example about product recommendation. Wouldn’t you want your model to incorporate the most up to date reviews and customer attributes? Of course, this all depends on the nature of your industry: business-to-consumer (B2C) marketing is much more real-time than, say, pipeline routing where the majority of contracts are negotiated long in advance. How about dealing with new customers? Will a model need to make recommendations the same day for a new customer? Understanding how your data changes over time is key to building sustainable pipelines to support model development and refinement.
Leverage Data Science Code for Data Pipelining in Production
I saved the best for last; and this isn’t the first nor the last time I will write about this: leverage the work your data science team has already done. DS work is similar to data analytics in that both teams readily need access to multiple data sources, time to massage data, and direction on business objectives; yet differ when we look at the process. DS is an R&D activity. Your DS team shouldn’t be fixated on delivering day-to-day operational metrics. DS teams are like wild horses: they need room to stretch their legs and run. And like a wild horse, if you follow it long enough, you’ll find water.
Your DS team will likely spend a lot of time using applications like Jupyter, R Studio, or PyCharm. The latter of which provides some benefits with more GUI tools for CI/CD and version control. Using a more “notebook” style environment to import, transform, and model data is more appealing for the DS team because these tools operate as a GUI interface on top of a run-time language: they can selectively execute small blocks of code, check the results, iterate changes quickly, then move on. This makes the DS development process much smoother… for development.
Once a model is trained, productionizing notebook code can be a nightmare. We need to productionize this code because the work the DS team did to build the model involved a lot of data prep (bringing in data from multiple sources, joins, filters, etc..). Why all of this needs to be captured and recreated for production pipelines is beyond the scope of this post. I will list some reasons though:
- The feature set used to train the model will be needed when an inference is requested for any supervised learning models
- The calculations to obtain numerical features (10-day averages, running totals, etc..) might be tailored to fit the specific use case (net sales can mean ANYTHING)
- Writing these features to persistent storage in your warehouse might cause governance issues if these calculations are not properly documented with the data
Access to the code base where the DS team keeps their code under revision (if you aren’t using some sort of version control shame on you) will be necessary. Also, don’t keep your engineering team in the dark until it’s ready to go to production (they need sunlight and fresh water to grow). Stand-ups should include engineering representatives as well as data scientists. These incremental updates will help everyone code towards a common goal: automating production delivery.