BYO-GPU — minimal Kubernetes path

One Deployment, one Service, one stable in-cluster DNS name. Keys stays outside the cluster or in your app tier—either way, you expose an OpenAI-compatible HTTP surface.

Not a production SRE guide — This is the smallest shape that works. Add HPA, PDBs, network policies, and monitoring for real clusters.

1. Deployment sketch

  • Request nvidia.com/gpu: 1 (or your accelerator resource) on nodes that advertise it.
  • Mount model weights via PVC or download init container—follow your model license.
  • Listen on container port 8000 (example); liveness/readiness HTTP probes if your server supports them.

2. Service

Use ClusterIP and reach the API from your gateway or mesh. If Keys resolves from outside the cluster, front with an Ingress or internal load balancer.

# Illustrative only — replace images, probes, and resources.
apiVersion: v1
kind: Service
metadata:
  name: llm-openai
spec:
  selector:
    app: llm-openai
  ports:
    - port: 8000
      targetPort: 8000

3. Keys binding

Point your Keys private provider base URL at http://llm-openai.default.svc.cluster.local:8000/v1 (or HTTPS equivalent). Use the same logical model alias and policy steps as the VM guide.

4. Smoke in CI

Prefer a self-hosted runner with cluster access or a scheduled job inside the cluster—keep GPU CI costs explicit. Template: GPU route smoke (Testing).