Machine Learning Deployment Options

So you’ve built your machine learning model…

It’s the time to take the next step – often one of the least spoken about – of putting your model into production; model deployment.

Deployment of your machine learning model means making your model available to your other business systems.

There are different ways to perform model deployment and we’ll discuss a few of them in this post.

Model Deployment Styles

We have two styles to deploy the models, related to the way we want to do the inference.

Batch Model Serving

Usually, in batch serving, we have a database where we store the new incoming data points before making predictions on them. A batch process can be run periodically to fetch a set of that data points, in order to, provide them to the model to make the predictions. The model then iterates over all the data points providing the predictions. An operational database is then updated with the recent predictions. We can then serve the score to the different systems wanting to consume the information.

In the application domain where the data consumers should be updated with the new data in real-time, this approach is insensible.

Real-time Model Serving

Providing predictions in real-time. Here, we have two options:

Message Streaming

One way to improve the operational database process talked about in the previous section is to implement something that will continuously update the database as new data is generated. You can use a scalable messaging platform like Kafka to send newly acquired data to a long-running Spark Streaming process. The Spark process can then make a new prediction based on the new data and update the operational database.

Web Services

A common implementation pattern is using a web service wrapper around the model. By deploying a container that includes the model and necessary libraries to build a REST API. This API will accept a request with JSON data, pass the data to the function that contains the model, and return a predicted response.

The one thing to look out for with this deployment pattern is managing the infrastructure to deal with the load associated with concurrency. If there are several calls that happen at the same time, there will be multiple API calls placed to the end point. If there is not sufficient capacity, calls may take a long time to respond or even fail, which will be an issue for our system.

Serving Patterns

There are two different patterns for putting models into productions as web services, let’s take a look:

1. Embedded Model

In this pattern, the model is embedded in the application. We treat the application artifact and version as being a combination of the application code and the chosen model.

When can we use this pattern?

When our model and the consuming application are both written in the same language (most cases; check the options below).

How to implement this pattern?

The model should be exported as a serialized object and pushed to storage.
When building our application, we pull it and embed it inside the same Docker container, for instance. Thus, the Docker image becomes our application and model artifact that gets versioned and deployed to production with the application itself.

Options to implement the pattern:

We have several options to implement the embedded model pattern, let’s name a few:

  • Serializing the model, like with “Pickle”, a standard python serialization library.
  • Share the model via a language-agnostic exchange formats, like PMML, PFA, and ONNX.
  • Add the model as a dependency by using tools like H2O to export it as a POJO. The benefit of this approach is that you can train the models in a language familiar to Data Scientists, such as Python or R, and export the model as a compiled binary that runs in a different target environment (JVM), which can be faster at inference time.

2. Model Deployed as a Separate Service

In this pattern, the model is wrapped as a service and then can be deployed independently of the consuming applications.

One of the biggest advantages of this approach is the ability to update the model independently. On the other hand, some latency could be introduced to the inference time, as there will be some sort of remote invocation required for each prediction.

Serving Technologies

There are quite a few technologies that can be used to power a prediction web service.

1. Sever-based

Deploying over a standalone server (or VM as it is called in GCP, or EC2 as in AWS … )

Pros Cons
Low latency time. Fast response with none of the files to be created again.Unnecessary charges for keeping the server(s) up even when we are not making requests.
Empowers you with heterogeneous development environments and you can work on any technology stack you want Overall responsibility for maintenance and up time of the server is upon us.
You can allocate the computing resources as per your application requirements. As our usage scales, we need to manage scaling up our server as well as scaling it down.

With so many issues mentioned above, it’s a lot of work to handle for smaller companies and individuals. Moreover, as a result, this impacts the overall time to market and cost of delivery.

“unless Kubernetes”; most of the above cons are shaken by the usage of Kubernetes, which is the modern go-to when it comes to deploying servers. Making availability, downtime a joke of the past. This is a huge topic though, so we will leave for future blog posts.

2. Serverless

With serverless computing, the cloud provider is executing your code (or stack in general) on a dynamically allocated, mostly predefined, resources. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity.

Pros Cons
Cost-cutting when you compare the compute cost with traditional servers. (Mostly, AWS Fargate is an exception for instance)Latency issues, especially for cold start. Your stack is put to sleep if idle, which may cause a longer replying time next call.
The cost of managing server resources is less. (Mostly, AWS Fargate is an exception for instance)Resources allocation limitations.
The backend of an application automatically and inherently scales up and down to meet the demand.
Eliminates the infrastructure complexity, as a result, you can focus more on developing your product and business outcomes.

An important difference between serverless and server-based is that anything on serverless is stateless and only stateless. While it is a good design principle, it may not be the case and you may need a server for this reason:

Why Serverless it mostly suitable for deploying ML models?

When a machine learning model goes into production, it is very likely to be idle most of the time. There are a lot of use cases where a model only needs to run inference when new data is available.

If we do have such a use case and we deploy a model on a server, it will eagerly be checking for new data, only to be disappointed for most of its lifetime and meanwhile, you pay for the live time of the server.

However, because of some limitations on serverless, not every model can be deployed as such. The size of the deployment package, RAM usage, are just a few of those limitations.
But when using serverless is one of the options, it is advised to do so.

There are two ways to serve your model with serverless:

1- Serverless Containers, such as:

2- Serverless Functions; as opposed to being packaged in a Docker container, you deploy code as functions, such as:

Functions place more constraints on how your code is deployed, and supports only a specific set of languages, but offers the ability to trigger functions using events in your cloud environment.

Here, you can find a decision diagram made by Google to guide you through choosing the best serverless service that suits your needs. It’s designed for Google’s services but can be helpful in general.

Finally, regarding choosing between web services and message streaming, REST APIs are best suited to request/response interactions where the client application sends a request to the API backend over HTTP. Message streaming is best suited to notification when new data or events occur that you may want to take action upon. According to this source.

In App Machine Learning Inference

Making the inference/predictions on the device itself. This allows models to still be usable in situations with limited network capacity and push the compute requirements away to the devices themselves. Moreover, in certain situations when there are legal or privacy requirements that do not allow for data to be stored outside of an application, or there exists constraints such as having to upload a large number of files, leveraging a model within the application tends to be the right approach.

Android-ML Kit allows to leverage models within native applications, while Tensorflow.js and ONNXJS allow for running models directly in the browser or in apps leveraging javascript.

Although these devices may not have the disk space, computing power, or battery life, they can still handle complex machine learning models. However, this requires performing a handful of tricks at the cost of a few percentage points of accuracy.

If you’re planning to go with this approach, or just curious about this topic, take a look at these helpful links:

1- How smartphones handle huge Neural Networks
2- SqueezeNet

ML Cloud Platforms to Serve Your Model

The trend of making everything-as-a-service has affected the Machine Learning industry too. Several companies, such as Amazon, Microsoft, and Google, now offer machine learning as a service on top of their existing cloud services.

We went through a few of the best machine learning platforms on the market in previous posts:

These platforms offer you training and deployment services.

If you are planning to use these services, here is a brief comparison between the most famous three ML services Google AI Platform, AWS SageMaker, and Azure Machine Learning, in terms of the deployment:

All these services serve your model as a web service in real-time, and let you choose between batch or real-time inferencing. Following are the differences in the serving process:

Serving TechnologyServing Pattern
Azure Machine LearningServerless (container/function) or Server-basedEmbedded
Google AI PlatformServerless containerEmbedded
Amazon SageMakerServer-basedSeparate service


In this post, we discussed the options provided for you to deploy your own machine learning model, and here is our final decision diagram, hoping to help you make your decision:

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store.

Leave a Reply

Your email address will not be published. Required fields are marked *