Saving and Using Model Artefacts

One of the keys requirements for creating apps that use AI/ML is the ability to save models after training, in a format that can be used to share and deploy the model.

This is similar to software packages that can be shared and used through language specific package managers such as Maven for Java.

When it comes to models though there multiple data points that need to be ‘saved’. These are:

1) the model architecture – provides the frame to fit the weights and biases in as well as any layer specific logic and custom layer definitions.

2) the weights and biases – the output of the training process

3) model configuration elements

4) post processing (e.g., beam search)

We need to ensure that the above items as they exist in memory (after training) can be serialised (“saved to disk”) in a safe and consistent way that allows it to be loaded and reused in a different environment.

All the major ML libraries in python provide a mechanism to save and load models. There are also packages that can take a model object created by these libraries and persist them on disk.

TLDR: Jump straight to the Conclusions here.

Major Formats for Saving the Model

The simplest type of model is one that has an architecture with no custom layers or complex post processing logic. Basically, it is just a tensor (numbers arranged in a multi-dimensional array). Some examples include: linear regression models, fully connected feed forward neural networks.

More complex models may contain custom layers, multiple inputs and outputs, hidden states, and so on.

Let us look at various options available to us to save the model…

Pickle File

This is the python native way of persisting objects. It is also the default method supported by PyTorch and therefore liberaries built on it (i.e., Transformers library).

Pickle process serialises the full model object to disk therefore it is very flexible. The pickle file contains all the code and data associated with the object. Therefore, as long as the libraries used to create the model are available in the target environment, the model object can be reconstructed and its state set with the weights and bias data. The model is then ready for inference.

A key risk with pickle files is that the constructor code used to rebuild the model object can be replaced with any code. For example, call to a layer constructor can be easily replaced with another python function call (e.g., which scans the drive for important files and corrupts them).

Safetensors

With the popularity of Gen AI and model sharing platforms such as HuggingFace, it became risk to have all this serialised data flying around all over the place. Python libraries like Transformers allow one line download and invocation of models further increasing the risk of using pickle files.

To get around this HuggingFace developed a new format to store ML models called ‘safetensors’. This was a ‘from scratch’ development keeping in mind the usage of such artefacts.

Safetensors library was written in Rust (i.e., really fast) and is not bound to the python ecosystem. It has been designed with restricted ‘execution’ capabilities. It is quite simple to use and has helper methods to save files across different ML libraries (e.g., pytorch).

GGML

If you use a MacBook Pro with an Apple processor for your ML development then you may be familiar with GGML format. GGML format is optimised for running on Apple hardware, has various tweaks (e.g., 16-bit representation to reduce memory size) that allow models to be run locally, and is written in C which is super efficient. As an aside: GGML is also a library used to run models stored in the GGML format.

A major drawback of GGML is that it is not a native format. Scripts convert saved models from PyTorch into the GGML format, therefore for each architecture type you need a conversion script. This and other issues such as lack of backward compatibility when model structure changed, led to the rapid decline of GGML.

GGUF

This format was designed to overcome the issues with GGML (particularly the lack of backward compatibility) while preserving the benefits of being able to run state-of-the-art models in a resource constrained environment like your personal laptop (through quantisation) using GGML. Quantisation involves reducing the granularity of the floating point numbers used to represent the model. This can reduce the amount of storage and memory needed without adversly impacting model performance.

If you have used Nomic’s GPT4ALL (based on llama.cpp) to run LLMs locally, you would have used a quantised model in the GGUF format. You would have also used the GGUF format if you are running on Apple hardware or if you have used llama.cpp directly.

Conclusions

There are three main model formats to consider when it comes to consuming external models and distributing our own model:

Pickle: if you are consuming external models for experimentation without the intention of putting it into production.

Safetensors: if you are ready to distribute your model or are planning to consume an external model for production deployment.

GGUF: if you are on Apple hardware or want to run high performance models in a resource constrained environment or you want to use something like GPT4ALL to ‘host’ the model instead of using say Python Transformers to access and run the model.

Leave a Comment