Build a system for the future of AI research in an hour (Part 1, Dependencies)
Imagine a blank computing environment powered by LLMs. Let's build a cluster that closely resembles what that might look/feel like in the hopes of improvising from established patterns.

The LLM space moves fast. As of writing this, most of the good, viral research is about one of two things:
Emulating aspects of OpenAI’s ChatGPT architecture/performance
More efficient model paradigms
Increasingly, peers I’ve spoken to place an emphasis on the sheer amount of testing and toying around with different solutions that is required. Not just A/B’ing models but the ability to control the underlying resources, fine-tuning from several versions of datasets at once, library version control, quantization comparison, the list goes on but it includes modularizing all of this. This doesn’t even address the time required to read the aforementioned research.
As a data scientist-turned-engineer, my role is characterized by a focus on availability, monitoring, and observability in the most efficient and performant way(s). People in more sales-heavy parts of the tech industry refer to this as “AIOps” (as “ML” becomes less used every day). If we dig a little deeper into the patterns we’re able to see culturally, it’s very likely Karpathy is onto something when he suggests LLMs may be a new computing paradigm on the kernel-level (as he usually is). Let’s adopt this philosophy and think of what we are building as the bare minimum for quality research. This architecture will define a fluid LLM-based computing platform which you can use to create, host, understand, interpret any down-stream use case given what we understand about LLMs today (including the momentum of change in their research).
In this article, we’re going to look over what about implementing large language models is so time-consuming in early stages, and how to tick the important boxes we’d usually reserve for production deployments in a way that meets the pace of the experimentation phase.
In short, it’s time to move away from doing everything in a Jupyter Notebook 😬 Sorry, but it’s slowing you down!
⚗️Toolkit
To do that, we’re building a production-grade environment and cutting some corners to make the silhouette of our deployment overlap just our initial needs.
We’re getting help from several important architectural & design choices.
clustering configuration engine for infrastructure
and we’ll use
k3s
, a lightweight version of Kubernetes that preserves production concepts
GPU-passthrough for docker images
the recent ability to use quantization flags in prebuilt images with CUDA 12.x installed on an Ubuntu host makes this very straight-forward
OSS, performant, scalable vector DB with great team(s) behind the product
tools we made at Prismadic to help with LLM-specific tasks & routines
similar to langchain, but includes well thought-out design patterns for scalability.
CUDA 8-compatible GPUs
The T4 found on AWS g4dn instances, as an example, works only up to CUDA 7.5. g5.xlarge is the most affordable type of instance that has GPUs capable of performing modern LLM research tasks available on AWS in regular quantities.
These pieces put together create an extremely configurable and adaptable environment to test n models for most downstream tasks quickly.
⚙️ Configuring your host for virtualizing GPU workloads
Assuming you have CUDA enabled on an Ubuntu 22 machine, let’s install some necessary components you might not have already as an ML researcher so that we can begin turning this computer into a zombie that exists just to serve LLMs.
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
$ echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release -y
$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io -y
$ sudo usermod -aG docker $USER
there, we’re just getting container and cluster virtualization tools then adding ourselves to the docker group, so moving on!
Next is to install nvidia-container-toolkit
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
(it’s always handy to do a quick nvidia-smi
at times like this to make sure your drivers and CUDA libs are working as expected)
when this is finished installing, you’ll be using a slightly modified containerd
runtime in docker which allows you to take advantage of host GPU inside the docker image context. You can think of it as what hardware passthrough is to bare-metal VM environments.
🧪 Test your GPU with vLLM
Minor changes to existing tools go a long way to pave a path of least resistance.
vLLM has good documentation for creating your own image but this requires that your CUDA version matches that of the PyTorch used in the image to build the vllm package from pip
— which can present deployment hurdles on hosts where an existing CUDA installation exists which other development might already depend on. conda
addresses some of this, but it’s not great for clusters like Kubernetes which puts us back to square one.
Instead of using their image, we’re going to grab a pre-built one we’ve already contributed to prior to this article as to prepare for the quantization requirement of our deployment.
Since this is already addressed, it’s as simple as:
$ sudo docker run -d -p 8080:8080 --gpus=all -e MODEL=TheBloke/Mistral-7B-Instruct-v0.1-AWQ -e QUANTIZATION=awq -e DTYPE=half ghcr.io/substratusai/vllm
docker
will then:
open
8080
to the applicationuse all of your GPUs which are discoverable by CUDA
distributed inference!
assign to the server a quantized model from TheBloke on huggingface
use the new flags for the image to appropriate a non-standard model
(you may find some of this useful if you’d like to build a vllm
image yourself instead, but there’s really no advantage to this as you’ll find out later when we deploy the python API)
You can delete this container now with sudo docker container stop <id>
now as we’re going to deploy this as a pod with a Kubernetes manifest, which will enable reliable availability with little effort. The image is useful to k8s too though, so don’t remove it once built.
☸️ Kubernetes
Kubernetes eludes the research space and it’s no wonder why: it’s a lot of difficult abstraction. This article doesn’t beg the attention between building and philosophy though. Besides, for some reason it garners pretty intense reactions (and I’m not totally sure why…)
In any case, let’s install k3s
& kubectl
so that we aren’t left behind.
$ curl -sfL https://get.k3s.io | sh -
$ curl -LO https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl
$ chmod +x kubectl
$ sudo mv kubectl /usr/local/bin/
You should see a healthy Kubernetes cluster info dump.
Next, to quickly deploy a quantized version of the mistral-7b-instruct
model in your new Kubernetes cluster, create a file called vllm.yaml
and in that file, paste the following:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: ghcr.io/substratusai/vllm:latest
ports:
- containerPort: 8080
env:
- name: MODEL
value: "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"
- name: QUANTIZATION
value: "awq"
- name: DTYPE
value: "half"
volumeMounts:
- mountPath: /dev/shm
name: dshm
readinessProbe:
httpGet:
path: /docs
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
resources:
limits:
nvidia.com/gpu: "1"
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
note: we put “1” as the number of GPU in resources.limits.nvidia.com/gpu
because each model will be designated to the VRAM we have allotted (and 14GB is required for the model without inference or embeddings from a vector DB in the memory!)
now, it’s as simple as:
$ k3s kubectl apply -f vllm.yaml
🎉 You should now have a highly scalable, virtualized, containerized deployment for deploying not just quantized models, but almost any model on dedicated GPU compute (Nvidia has great documentation on slicing up GPU based on time allotments to different workloads, but this is outside the scope of this series & indeed probably the needs of many individuals)
feel free to validate this with kubectl get deployments
vllm
takes care of queuing, the OpenAI API cloning, pretty much everything you need to get started and worry less about the compute-heavy parts of the work. So in that spirit, we aren’t going to worry about it so much (but their documentation is great and the next tool we’re about to get prepared is coupled nicely with vllm
specifically!)
Okay, now remove it with:
$ k3s kubectl delete -f vllm.yaml
Because in part ✌️ of this series, we’re going to create our “Control Panel” of our new LLM-chauvinist OS concept.
(it’s literally a control plane for the cluster with some tweaks)