Web-server for LLMs

One thing that really bugs me when running larger LLMs locally is the load time for each model. Larger the on-disk size of the model, the longer it takes for the model to be ready to query.

One solution is to run the model in a Jupyter notebook so we load it once and then query as many times as we want in subsequent code blocks. But this is not ideal because we can still have issues that require us to restart the notebook. These issues usually have little to do with the Gen AI model itself (which in most use-cases is treated as a closed box).

Another requirement is to be able to run your application and your LLM on different machines/instances. This can be a common requirement if you want to create a scalable Gen-AI based app. This is the main reason we have services like the Google Model Garden that don’t require you to ‘host’ models and provide API-based access instead.

To get around this I developed a web-server using python flask that can load any model accessible through the huggingface transformers library (using a plugin approach). This is mainly useful for testing and learning but can be made production ready with little effort.

The code can be found here: https://github.com/amachwe/gen_ai_web_server

Key steps to wrap and setup your model web-server:

  1. Load the tokenizer and model using the correct class from the huggingface transformers python library.
  2. Create the wrapped model using selected LLM_Server_Wrapper and pass that to the server.
  3. Start the server.

The above steps shown as code below. Also available in the llm_server.py file in the linked GitHub repo above.

Will share further examples of how I have been using this…

… as promised I have now added a client to help you get started (see link below).

Simple Client

This client shows the power of abstracting the LLM away from its use. We can use the “/info” path to get the prompting hints from the wrapped model. This will help the client create the prompt properly. This is required because each model could have its own prompting style.

Leave a Comment