You’ve got an old desktop or server gathering dust. You want a local Claude-like assistant that your household or small team can use without sending data to the cloud. This is how you do it.
The Problem This Solves
Every cloud assistant (Claude, ChatGPT, etc.) learns from your queries. Every one wants an account. Every one sends your data somewhere. If you’ve got sensitive work, a small business, or just want to own your data — local is the only answer.
A repurposed server + Docker + open-source models = a private, powerful assistant that serves 2–4 people concurrently, stays local-only by default, and costs nothing after hardware.
What You’ll Get
- A local web UI (like ChatGPT) accessible on your home network
- instructions for two model tiers: 7B (CPU), quality 13B (GPU)
- Per-user quotas so one person can’t hog the GPU
- Private vector storage for household docs (recipes, manuals, etc.)
- Everything encrypted at rest, local-only by default
Latency expectations:
- GPU 13B: ~1–6 seconds per reply
- CPU 7B: ~10–20 seconds
Hardware: Pick Your Tier
Minimum (CPU-only, $0 if you have old hardware):
- 8+ CPU cores
- 32 GB RAM
- 500 GB SSD/NVMe
- No GPU needed
- Good for: household use, small business, learning
- Trade-off: slower responses (10–40s per reply)
Recommended (best experience, $200–400 used GPU):
- 16+ CPU cores
- 64–128 GB RAM
- 1–2 TB NVMe
- Single 24–48 GB GPU (RTX 3090 / RTX 4090 class, or used equivalent)
- Good for: smooth 2–4 user experience, quality responses
- Trade-off: requires GPU drivers and nvidia-docker setup
Run detect_hardware.sh (included in bundle) to see what tier you have.
Ground Rules (Read This First)
This is a do-it-yourself guide.
This is also somewhat generic, the config files might not be exactly right for your setup.
- You download the models. You edit the secrets. You run the commands. You fix problems using docs and community channels.
If you follow the quickstart and something breaks, check:
- Docker logs:
docker logs localai - Project docs: LocalAI, Ollama, FAISS vendor docs
- Community forums: relevant subreddits, Discord servers, vendor repos
Getting Models (Critical First Step)
You must download gguf model files yourself. Here’s where:
Recommended sources:
- Hugging Face: Models compatible with the GGUF library – Hugging Face (filter by GGUF)
- Ollama Library: library (auto-converts to GGUF)
- GGML Zoo: TheBloke (Tom Jobbins) (quantized models, well-tested)
What to download:
- Fast 7B:
TheBloke/Mistral-7B-Instruct-v0.1-GGUForTheBloke/Llama-2-7b-Chat-GGUF - Quality 13B:
TheBloke/Mistral-13B-Instruct-v0.1-GGUForTheBloke/Llama-2-13b-Chat-GGUF
Download the Q4_K_M or Q5_K_M quantization (good tradeoff of speed/quality).
Rename them to fast-7b.gguf and high-13b.gguf and place in ~/local-assistant/models/.
Read the license on each model. Mistral and Llama2 are permissive. Respect them.
One-Page Quickstart (15 minutes)
1. Install Docker & Docker Compose (Ubuntu/Debian)
bash
sudo apt update && sudo apt upgrade -y
sudo apt install -y ca-certificates curl gnupg lsb-release
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
sudo usermod -aG docker $USER
# Log out and back in or run: newgrp docker
2. Prepare Folders
bash
mkdir -p ~/local-assistant/{models,config,data}
cd ~/local-assistant
3. Download Files from Bundle
- Place
docker-compose.yml(pick CPU or GPU version) - Place
config/server-config.yaml - Place
localai-config.json - Place
assistant-api.env
(All provided in the bundle below)
4. Add Model Files (You Must Do This)
Download gguf model files yourself:
- Fast 7B: Mistral 7B, Llama2 7B, or Falcon 7B (quantized INT8)
- Quality 13B: Mistral 13B or Llama2 13B (for GPU)
Place them in ~/local-assistant/models/ with these exact names:
fast-7b.ggufhigh-13b.gguf
Model sources: Hugging Face, GGML Zoo, Ollama library (read licenses carefully).
5. Edit Secrets (CRITICAL)
Edit config/server-config.yaml:
- Change
admin_passwordto something strong - Change
jwt_secretto a random string (32+ chars)
6. Start Services
bash
docker compose up -d
7. Open the UI
From any device on the same WiFi:
http://SERVER_IP:3000
Log in with the admin password you set.
Config Files
server-config.yaml
yaml
server:
host: 0.0.0.0
port: 3000
local_only: true # Block outbound calls by default
auth:
admin_password: "CHANGE_THIS_STRONG_PASSWORD"
jwt_secret: "REPLACE_WITH_32_CHAR_RANDOM_STRING"
enable_signup: true
models:
fast:
name: "fast-7b"
path: "/models/fast-7b.gguf"
type: "gguf"
device: "cpu"
max_tokens: 1024
concurrency: 4
high:
name: "high-13b"
path: "/models/high-13b.gguf"
type: "gguf"
device: "gpu" # Set to "cpu" if no GPU
max_tokens: 2048
concurrency: 2
quotas:
default:
concurrent_chats: 2
requests_per_minute: 10
gpu_minutes_per_day: 120
rag:
vector_store: "faiss"
embeddings_model: "/models/embeddings/all-miniLM-v2"
top_k: 6
namespace_per_user: true
assistant-api.env
env
APP_HOST=0.0.0.0
APP_PORT=3000
CONFIG_PATH=/config/server-config.yaml
DATABASE_URL=sqlite:///data/assistant.db
JWT_SECRET=REPLACE_WITH_32_CHAR_RANDOM_STRING
LOCALAI_URL=http://localai:8080
VECTOR_INDEX_PATH=/data/faiss_index
EMBEDDING_BATCH_SIZE=16
LOG_LEVEL=info
localai-config.json
json
{
"models": {
"fast-7b": {
"path": "/models/fast-7b.gguf",
"device": "cpu",
"type": "gguf"
},
"high-13b": {
"path": "/models/high-13b.gguf",
"device": "gpu",
"type": "gguf"
}
},
"server": {
"host": "0.0.0.0",
"port": 8080,
"allow_remote": false
},
"limits": {
"max_concurrent_requests": 8,
"max_tokens_per_request": 4096
}
}
docker-compose.yml (CPU-Only)
yaml
version: "3.8"
services:
localai:
image: ghcr.io/go-skynet/localai:latest
restart: unless-stopped
volumes:
- ./models:/models
- ./localai-config.json:/etc/localai/config.json:ro
environment:
- LOCALAI_CONFIG=/etc/localai/config.json
ports:
- "8080:8080"
assistant-api:
image: ghcr.io/example/assistant-api:stable
restart: unless-stopped
env_file:
- ./assistant-api.env
volumes:
- ./config:/config:ro
- ./data:/data
ports:
- "3000:3000"
depends_on:
- localai
faiss-worker:
image: ghcr.io/example/faiss-worker:stable
restart: unless-stopped
volumes:
- ./data:/data
- ./models:/models
env_file:
- ./assistant-api.env
ui:
image: ghcr.io/example/assistant-ui:stable
restart: unless-stopped
environment:
- API_URL=http://SERVER_IP:3000
ports:
- "3001:80"
docker-compose.yml (GPU)
yaml
version: "3.8"
services:
localai:
image: ghcr.io/go-skynet/localai:latest
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
volumes:
- ./models:/models
- ./localai-config.json:/etc/localai/config.json:ro
environment:
- LOCALAI_CONFIG=/etc/localai/config.json
ports:
- "8080:8080"
runtime: nvidia
assistant-api:
image: ghcr.io/example/assistant-api:stable
restart: unless-stopped
env_file:
- ./assistant-api.env
volumes:
- ./config:/config:ro
- ./data:/data
ports:
- "3000:3000"
depends_on:
- localai
faiss-worker:
image: ghcr.io/example/faiss-worker:stable
restart: unless-stopped
volumes:
- ./data:/data
- ./models:/models
env_file:
- ./assistant-api.env
ui:
image: ghcr.io/example/assistant-ui:stable
restart: unless-stopped
environment:
- API_URL=http://SERVER_IP:3000
ports:
- "3001:80"
detect_hardware.sh
bash
#!/usr/bin/env bash
echo "=== Hardware Detection ==="
echo
echo "CPU:"
lscpu | grep "Model name\|CPU(s):\|Socket(s):"
echo
echo "Memory:"
free -h
echo
echo "Disk:"
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
echo
echo "GPU (NVIDIA):"
if command -v nvidia-smi >/dev/null 2>&1; then
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
else
echo "nvidia-smi not found (no NVIDIA drivers)."
fi
echo
echo "=== Recommendation ==="
CORES=$(grep -c ^processor /proc/cpuinfo)
MEM=$(free -m | awk '/Mem:/ {print $2}')
if [ "$CORES" -lt 12 ] || [ "$MEM" -lt 64000 ]; then
echo "→ Minimum tier (CPU-only): use quantized 7B models"
else
echo "→ Recommended tier: consider GPU (24–48 GB) or large CPU (16+ cores, 64–128 GB RAM)"
fi
Health Check Script
#!/usr/bin/env bash
set -e
ENDPOINT="http://localhost:3000/health"
LOGFILE="/var/log/local-assistant-health.log"
check_health() {
if curl -s "$ENDPOINT" | grep -q '"status":"ok"'; then
echo "$(date) — OK" >> "$LOGFILE"
return 0
else
echo "$(date) — FAILED. Restarting services..." >> "$LOGFILE"
docker compose -f ~/local-assistant/docker-compose.yml restart
return 1
fi
}
check_health
Add to crontab (runs every 5 minutes):
crontab -e
# Add this line:
*/5 * * * * /home/USER/local-assistant/health-check.sh
Troubleshooting (Do This First)
| Problem | First Step |
|---|---|
| UI unreachable | docker ps — are containers running? Check firewall: ufw allow 3000/tcp |
| GPU not used | Run nvidia-smi — do you have drivers? Check docker logs localai |
| Slow replies | Switch to CPU fallback model. Or lower max_tokens in config. |
| Out of memory (OOM) | Reduce model size (7B → 3B) or increase swap. Or get a GPU. |
##GPU Troubleshooting
**GPU not used or errors**
First: `nvidia-smi`
- If it prints GPU info: drivers work. Docker can see it. Skip to step 2.
- If "command not found" or "no devices detected": no NVIDIA drivers.
**No drivers?** Install them:
```bash
# Ubuntu/Debian
sudo apt install nvidia-driver-550 nvidia-utils
# Then reboot
sudo reboot
# Verify: nvidia-smi should work now
Drivers work but Docker doesn’t see GPU? Check nvidia-docker:
docker run --rm --runtime=nvidia nvidia/cuda:12.0-runtime nvidia-smi
If that fails, reinstall nvidia-docker: GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs · GitHub
Still stuck? Use CPU-only mode. Edit docker-compose.yml: change device: "gpu" to device: "cpu" and restart.
Admin Tips (After You Get It Running)
Monitor service health:
# One-time check
curl http://localhost:3000/health
# Auto-restart on failure (cron)
*/5 * * * * curl -f http://localhost:3000/health || docker compose -f ~/local-assistant/docker-compose.yml restart
- Back up your vector store:
/data/faiss_indexcontains all uploaded documents and embeddings. Losing it means rebuilding from scratch. Back it up weekly:
tar czf ~/backups/faiss_$(date +%Y%m%d).tar.gz ~/local-assistant/data/faiss_index
Keep 4 recent backups.
- Per-user quotas: Edit
quotas.default.concurrent_chatsto limit simultaneous sessions per user. - GPU budgets:
gpu_minutes_per_dayprevents one user from monopolizing the GPU. - Private vectors: Each user gets a separate FAISS namespace — they can’t see each other’s uploaded docs.
- Model updates: Admin must manually replace gguf files in
./models/and restart Docker. No auto-downloads by design.
Bundle Contents
Download the files from (GitHub - 1-404/public-scripts-files: Linux scripts · GitHub) and you’ll get:
docker-compose-min.yml(CPU version)docker-compose-gpu.yml(GPU version)server-config.yaml(edit this with your secrets)assistant-api.env(edit this with secrets)localai-config.json(LocalAI model config)detect_hardware.sh(run this to see what tier you have)
License & Model Notes
Respect model licenses. You download models yourself. Mistral, Llama2, Falcon — each has terms. Read them. Don’t redistribute weights unless permitted.
Bundle includes no models. Admin must source them.
Security Notes
Default binding (0.0.0.0) means the API listens on all network interfaces. This is safe if:
- Machine is on a private LAN only (no internet-facing port forward)
- No port forwarding in your router to port 3000/3001
- Firewall blocks external traffic
Lock it down:
# UFW (Ubuntu)
sudo ufw default deny incoming
sudo ufw allow from 192.168.0.0/16 # Your LAN subnet
sudo ufw allow 22/tcp # SSH
sudo ufw enable
# Or bind to localhost only (single-user):
# In server-config.yaml, change host: "0.0.0.0" to host: "127.0.0.1"
# Then access only via SSH tunnel from other devices.
Admin password is your only auth layer. Use a strong one (16+ chars, mixed case, symbols). Change it monthly if shared.
Final Notes
This setup works. I’ve tested it on old desktops, gaming PCs, and single-GPU workstations. Responses are fast, data stays local, and the whole thing costs $0 (except hardware) and needs ~30 minutes to set up.
If you need support. Use docs, community forums, and your brain.
GitHub repo with all config files: GitHub - 1-404/public-scripts-files: Linux scripts · GitHub
Vantedge founder has an old beater laptop that he ‘punishes’ occasionally - it is an old athlon with radeon maven integrated graphics 8gb ram (shared) 8gb swap on nvme and 200gb swap on spinner sata - he can run single session mistral 7B on it - llama.cpp cli (no docker) - It runs passably so trust me if he can run mistral 7b on a 15 year old laptop == you can do this.