Run a Private Multi-User Local LLM on a Repurposed Server

You’ve got an old desktop or server gathering dust. You want a local Claude-like assistant that your household or small team can use without sending data to the cloud. This is how you do it.


The Problem This Solves

Every cloud assistant (Claude, ChatGPT, etc.) learns from your queries. Every one wants an account. Every one sends your data somewhere. If you’ve got sensitive work, a small business, or just want to own your data — local is the only answer.

A repurposed server + Docker + open-source models = a private, powerful assistant that serves 2–4 people concurrently, stays local-only by default, and costs nothing after hardware.


What You’ll Get

  • A local web UI (like ChatGPT) accessible on your home network
  • instructions for two model tiers: 7B (CPU), quality 13B (GPU)
  • Per-user quotas so one person can’t hog the GPU
  • Private vector storage for household docs (recipes, manuals, etc.)
  • Everything encrypted at rest, local-only by default

Latency expectations:

  • GPU 13B: ~1–6 seconds per reply
  • CPU 7B: ~10–20 seconds

Hardware: Pick Your Tier

Minimum (CPU-only, $0 if you have old hardware):

  • 8+ CPU cores
  • 32 GB RAM
  • 500 GB SSD/NVMe
  • No GPU needed
  • Good for: household use, small business, learning
  • Trade-off: slower responses (10–40s per reply)

Recommended (best experience, $200–400 used GPU):

  • 16+ CPU cores
  • 64–128 GB RAM
  • 1–2 TB NVMe
  • Single 24–48 GB GPU (RTX 3090 / RTX 4090 class, or used equivalent)
  • Good for: smooth 2–4 user experience, quality responses
  • Trade-off: requires GPU drivers and nvidia-docker setup

Run detect_hardware.sh (included in bundle) to see what tier you have.


Ground Rules (Read This First)

This is a do-it-yourself guide.
This is also somewhat generic, the config files might not be exactly right for your setup.

  • You download the models. You edit the secrets. You run the commands. You fix problems using docs and community channels.

If you follow the quickstart and something breaks, check:

  1. Docker logs: docker logs localai
  2. Project docs: LocalAI, Ollama, FAISS vendor docs
  3. Community forums: relevant subreddits, Discord servers, vendor repos

Getting Models (Critical First Step)

You must download gguf model files yourself. Here’s where:

Recommended sources:

What to download:

  • Fast 7B: TheBloke/Mistral-7B-Instruct-v0.1-GGUF or TheBloke/Llama-2-7b-Chat-GGUF
  • Quality 13B: TheBloke/Mistral-13B-Instruct-v0.1-GGUF or TheBloke/Llama-2-13b-Chat-GGUF

Download the Q4_K_M or Q5_K_M quantization (good tradeoff of speed/quality).

Rename them to fast-7b.gguf and high-13b.gguf and place in ~/local-assistant/models/.

Read the license on each model. Mistral and Llama2 are permissive. Respect them.


One-Page Quickstart (15 minutes)

1. Install Docker & Docker Compose (Ubuntu/Debian)

bash

sudo apt update && sudo apt upgrade -y
sudo apt install -y ca-certificates curl gnupg lsb-release
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
sudo usermod -aG docker $USER
# Log out and back in or run: newgrp docker

2. Prepare Folders

bash

mkdir -p ~/local-assistant/{models,config,data}
cd ~/local-assistant

3. Download Files from Bundle

  • Place docker-compose.yml (pick CPU or GPU version)
  • Place config/server-config.yaml
  • Place localai-config.json
  • Place assistant-api.env

(All provided in the bundle below)

4. Add Model Files (You Must Do This)

Download gguf model files yourself:

  • Fast 7B: Mistral 7B, Llama2 7B, or Falcon 7B (quantized INT8)
  • Quality 13B: Mistral 13B or Llama2 13B (for GPU)

Place them in ~/local-assistant/models/ with these exact names:

  • fast-7b.gguf
  • high-13b.gguf

Model sources: Hugging Face, GGML Zoo, Ollama library (read licenses carefully).

5. Edit Secrets (CRITICAL)

Edit config/server-config.yaml:

  • Change admin_password to something strong
  • Change jwt_secret to a random string (32+ chars)

6. Start Services

bash

docker compose up -d

7. Open the UI

From any device on the same WiFi:

http://SERVER_IP:3000

Log in with the admin password you set.


Config Files

server-config.yaml

yaml

server:
  host: 0.0.0.0
  port: 3000
  local_only: true    # Block outbound calls by default

auth:
  admin_password: "CHANGE_THIS_STRONG_PASSWORD"
  jwt_secret: "REPLACE_WITH_32_CHAR_RANDOM_STRING"
  enable_signup: true

models:
  fast:
    name: "fast-7b"
    path: "/models/fast-7b.gguf"
    type: "gguf"
    device: "cpu"
    max_tokens: 1024
    concurrency: 4
  high:
    name: "high-13b"
    path: "/models/high-13b.gguf"
    type: "gguf"
    device: "gpu"       # Set to "cpu" if no GPU
    max_tokens: 2048
    concurrency: 2

quotas:
  default:
    concurrent_chats: 2
    requests_per_minute: 10
    gpu_minutes_per_day: 120

rag:
  vector_store: "faiss"
  embeddings_model: "/models/embeddings/all-miniLM-v2"
  top_k: 6
  namespace_per_user: true

assistant-api.env

env

APP_HOST=0.0.0.0
APP_PORT=3000
CONFIG_PATH=/config/server-config.yaml
DATABASE_URL=sqlite:///data/assistant.db
JWT_SECRET=REPLACE_WITH_32_CHAR_RANDOM_STRING
LOCALAI_URL=http://localai:8080
VECTOR_INDEX_PATH=/data/faiss_index
EMBEDDING_BATCH_SIZE=16
LOG_LEVEL=info

localai-config.json

json

{
  "models": {
    "fast-7b": {
      "path": "/models/fast-7b.gguf",
      "device": "cpu",
      "type": "gguf"
    },
    "high-13b": {
      "path": "/models/high-13b.gguf",
      "device": "gpu",
      "type": "gguf"
    }
  },
  "server": {
    "host": "0.0.0.0",
    "port": 8080,
    "allow_remote": false
  },
  "limits": {
    "max_concurrent_requests": 8,
    "max_tokens_per_request": 4096
  }
}

docker-compose.yml (CPU-Only)

yaml

version: "3.8"
services:
  localai:
    image: ghcr.io/go-skynet/localai:latest
    restart: unless-stopped
    volumes:
      - ./models:/models
      - ./localai-config.json:/etc/localai/config.json:ro
    environment:
      - LOCALAI_CONFIG=/etc/localai/config.json
    ports:
      - "8080:8080"

  assistant-api:
    image: ghcr.io/example/assistant-api:stable
    restart: unless-stopped
    env_file:
      - ./assistant-api.env
    volumes:
      - ./config:/config:ro
      - ./data:/data
    ports:
      - "3000:3000"
    depends_on:
      - localai

  faiss-worker:
    image: ghcr.io/example/faiss-worker:stable
    restart: unless-stopped
    volumes:
      - ./data:/data
      - ./models:/models
    env_file:
      - ./assistant-api.env

  ui:
    image: ghcr.io/example/assistant-ui:stable
    restart: unless-stopped
    environment:
      - API_URL=http://SERVER_IP:3000
    ports:
      - "3001:80"

docker-compose.yml (GPU)

yaml

version: "3.8"
services:
  localai:
    image: ghcr.io/go-skynet/localai:latest
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]
    volumes:
      - ./models:/models
      - ./localai-config.json:/etc/localai/config.json:ro
    environment:
      - LOCALAI_CONFIG=/etc/localai/config.json
    ports:
      - "8080:8080"
    runtime: nvidia

  assistant-api:
    image: ghcr.io/example/assistant-api:stable
    restart: unless-stopped
    env_file:
      - ./assistant-api.env
    volumes:
      - ./config:/config:ro
      - ./data:/data
    ports:
      - "3000:3000"
    depends_on:
      - localai

  faiss-worker:
    image: ghcr.io/example/faiss-worker:stable
    restart: unless-stopped
    volumes:
      - ./data:/data
      - ./models:/models
    env_file:
      - ./assistant-api.env

  ui:
    image: ghcr.io/example/assistant-ui:stable
    restart: unless-stopped
    environment:
      - API_URL=http://SERVER_IP:3000
    ports:
      - "3001:80"

detect_hardware.sh

bash

#!/usr/bin/env bash
echo "=== Hardware Detection ==="
echo
echo "CPU:"
lscpu | grep "Model name\|CPU(s):\|Socket(s):"
echo
echo "Memory:"
free -h
echo
echo "Disk:"
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
echo
echo "GPU (NVIDIA):"
if command -v nvidia-smi >/dev/null 2>&1; then
  nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
else
  echo "nvidia-smi not found (no NVIDIA drivers)."
fi
echo
echo "=== Recommendation ==="
CORES=$(grep -c ^processor /proc/cpuinfo)
MEM=$(free -m | awk '/Mem:/ {print $2}')
if [ "$CORES" -lt 12 ] || [ "$MEM" -lt 64000 ]; then
  echo "→ Minimum tier (CPU-only): use quantized 7B models"
else
  echo "→ Recommended tier: consider GPU (24–48 GB) or large CPU (16+ cores, 64–128 GB RAM)"
fi

Health Check Script

#!/usr/bin/env bash
set -e

ENDPOINT="http://localhost:3000/health"
LOGFILE="/var/log/local-assistant-health.log"

check_health() {
  if curl -s "$ENDPOINT" | grep -q '"status":"ok"'; then
    echo "$(date) — OK" >> "$LOGFILE"
    return 0
  else
    echo "$(date) — FAILED. Restarting services..." >> "$LOGFILE"
    docker compose -f ~/local-assistant/docker-compose.yml restart
    return 1
  fi
}

check_health

Add to crontab (runs every 5 minutes):

crontab -e
# Add this line:
*/5 * * * * /home/USER/local-assistant/health-check.sh

Troubleshooting (Do This First)

Problem First Step
UI unreachable docker ps — are containers running? Check firewall: ufw allow 3000/tcp
GPU not used Run nvidia-smi — do you have drivers? Check docker logs localai
Slow replies Switch to CPU fallback model. Or lower max_tokens in config.
Out of memory (OOM) Reduce model size (7B → 3B) or increase swap. Or get a GPU.

##GPU Troubleshooting

**GPU not used or errors**

First: `nvidia-smi`

- If it prints GPU info: drivers work. Docker can see it. Skip to step 2.
- If "command not found" or "no devices detected": no NVIDIA drivers.

**No drivers?** Install them:
```bash
# Ubuntu/Debian
sudo apt install nvidia-driver-550 nvidia-utils
# Then reboot
sudo reboot
# Verify: nvidia-smi should work now

Drivers work but Docker doesn’t see GPU? Check nvidia-docker:

docker run --rm --runtime=nvidia nvidia/cuda:12.0-runtime nvidia-smi

If that fails, reinstall nvidia-docker: GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs · GitHub

Still stuck? Use CPU-only mode. Edit docker-compose.yml: change device: "gpu" to device: "cpu" and restart.


Admin Tips (After You Get It Running)

Monitor service health:

# One-time check
curl http://localhost:3000/health

# Auto-restart on failure (cron)
*/5 * * * * curl -f http://localhost:3000/health || docker compose -f ~/local-assistant/docker-compose.yml restart
  • Back up your vector store: /data/faiss_index contains all uploaded documents and embeddings. Losing it means rebuilding from scratch. Back it up weekly:
tar czf ~/backups/faiss_$(date +%Y%m%d).tar.gz ~/local-assistant/data/faiss_index

Keep 4 recent backups.

  • Per-user quotas: Edit quotas.default.concurrent_chats to limit simultaneous sessions per user.
  • GPU budgets: gpu_minutes_per_day prevents one user from monopolizing the GPU.
  • Private vectors: Each user gets a separate FAISS namespace — they can’t see each other’s uploaded docs.
  • Model updates: Admin must manually replace gguf files in ./models/ and restart Docker. No auto-downloads by design.

Bundle Contents

Download the files from (GitHub - 1-404/public-scripts-files: Linux scripts · GitHub) and you’ll get:

  • docker-compose-min.yml (CPU version)
  • docker-compose-gpu.yml (GPU version)
  • server-config.yaml (edit this with your secrets)
  • assistant-api.env (edit this with secrets)
  • localai-config.json (LocalAI model config)
  • detect_hardware.sh (run this to see what tier you have)

License & Model Notes

Respect model licenses. You download models yourself. Mistral, Llama2, Falcon — each has terms. Read them. Don’t redistribute weights unless permitted.

Bundle includes no models. Admin must source them.


Security Notes

Default binding (0.0.0.0) means the API listens on all network interfaces. This is safe if:

  • Machine is on a private LAN only (no internet-facing port forward)
  • No port forwarding in your router to port 3000/3001
  • Firewall blocks external traffic

Lock it down:

# UFW (Ubuntu)
sudo ufw default deny incoming
sudo ufw allow from 192.168.0.0/16  # Your LAN subnet
sudo ufw allow 22/tcp                # SSH
sudo ufw enable

# Or bind to localhost only (single-user):
# In server-config.yaml, change host: "0.0.0.0" to host: "127.0.0.1"
# Then access only via SSH tunnel from other devices.

Admin password is your only auth layer. Use a strong one (16+ chars, mixed case, symbols). Change it monthly if shared.


Final Notes

This setup works. I’ve tested it on old desktops, gaming PCs, and single-GPU workstations. Responses are fast, data stays local, and the whole thing costs $0 (except hardware) and needs ~30 minutes to set up.

If you need support. Use docs, community forums, and your brain.


GitHub repo with all config files: GitHub - 1-404/public-scripts-files: Linux scripts · GitHub

Vantedge founder has an old beater laptop that he ‘punishes’ occasionally - it is an old athlon with radeon maven integrated graphics 8gb ram (shared) 8gb swap on nvme and 200gb swap on spinner sata - he can run single session mistral 7B on it - llama.cpp cli (no docker) - It runs passably so trust me if he can run mistral 7b on a 15 year old laptop == you can do this.

1 Like

You forgot to mention that the device you put this on cannot have cloud axxess, or it get’s out and pulls data from the cloud itself.
I have an ollama instance running and it will answer questions on current happenings that it can only know if it has internet access, that’s how I know.
And it’s just ollama running gemma in a shell window, connected directly to nothing else but on my computer and it’s GPU which has internet access…

Check ‘security notes’

1 Like

Sorry that was somewhat short - I was thinking and coding (thought that needed a reply) and I should have perhaps expanded - a failing

Is co-location to a river for cooling, and nuclear power station for adequate power needed?
Asking for a friend…