mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 06:51:16 +08:00
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.
Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.
- website/scripts/generate-skill-docs.py — generator that reads skills/ and
optional-skills/, writes per-skill pages, regenerates both catalog indexes,
and rewrites the Skills section of sidebars.ts. Handles MDX escaping
(outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
before docusaurus build so CI stays in sync with the source SKILL.md files.
Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
382 lines
9.9 KiB
Markdown
382 lines
9.9 KiB
Markdown
---
|
|
title: "Serving Llms Vllm — Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching"
|
|
sidebar_label: "Serving Llms Vllm"
|
|
description: "Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching"
|
|
---
|
|
|
|
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
|
|
|
# Serving Llms Vllm
|
|
|
|
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
|
|
|
|
## Skill metadata
|
|
|
|
| | |
|
|
|---|---|
|
|
| Source | Bundled (installed by default) |
|
|
| Path | `skills/mlops/inference/vllm` |
|
|
| Version | `1.0.0` |
|
|
| Author | Orchestra Research |
|
|
| License | MIT |
|
|
| Dependencies | `vllm`, `torch`, `transformers` |
|
|
| Tags | `vLLM`, `Inference Serving`, `PagedAttention`, `Continuous Batching`, `High Throughput`, `Production`, `OpenAI API`, `Quantization`, `Tensor Parallelism` |
|
|
|
|
## Reference: full SKILL.md
|
|
|
|
:::info
|
|
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
|
:::
|
|
|
|
# vLLM - High-Performance LLM Serving
|
|
|
|
## Quick start
|
|
|
|
vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests).
|
|
|
|
**Installation**:
|
|
```bash
|
|
pip install vllm
|
|
```
|
|
|
|
**Basic offline inference**:
|
|
```python
|
|
from vllm import LLM, SamplingParams
|
|
|
|
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
|
|
sampling = SamplingParams(temperature=0.7, max_tokens=256)
|
|
|
|
outputs = llm.generate(["Explain quantum computing"], sampling)
|
|
print(outputs[0].outputs[0].text)
|
|
```
|
|
|
|
**OpenAI-compatible server**:
|
|
```bash
|
|
vllm serve meta-llama/Llama-3-8B-Instruct
|
|
|
|
# Query with OpenAI SDK
|
|
python -c "
|
|
from openai import OpenAI
|
|
client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY')
|
|
print(client.chat.completions.create(
|
|
model='meta-llama/Llama-3-8B-Instruct',
|
|
messages=[{'role': 'user', 'content': 'Hello!'}]
|
|
).choices[0].message.content)
|
|
"
|
|
```
|
|
|
|
## Common workflows
|
|
|
|
### Workflow 1: Production API deployment
|
|
|
|
Copy this checklist and track progress:
|
|
|
|
```
|
|
Deployment Progress:
|
|
- [ ] Step 1: Configure server settings
|
|
- [ ] Step 2: Test with limited traffic
|
|
- [ ] Step 3: Enable monitoring
|
|
- [ ] Step 4: Deploy to production
|
|
- [ ] Step 5: Verify performance metrics
|
|
```
|
|
|
|
**Step 1: Configure server settings**
|
|
|
|
Choose configuration based on your model size:
|
|
|
|
```bash
|
|
# For 7B-13B models on single GPU
|
|
vllm serve meta-llama/Llama-3-8B-Instruct \
|
|
--gpu-memory-utilization 0.9 \
|
|
--max-model-len 8192 \
|
|
--port 8000
|
|
|
|
# For 30B-70B models with tensor parallelism
|
|
vllm serve meta-llama/Llama-2-70b-hf \
|
|
--tensor-parallel-size 4 \
|
|
--gpu-memory-utilization 0.9 \
|
|
--quantization awq \
|
|
--port 8000
|
|
|
|
# For production with caching and metrics
|
|
vllm serve meta-llama/Llama-3-8B-Instruct \
|
|
--gpu-memory-utilization 0.9 \
|
|
--enable-prefix-caching \
|
|
--enable-metrics \
|
|
--metrics-port 9090 \
|
|
--port 8000 \
|
|
--host 0.0.0.0
|
|
```
|
|
|
|
**Step 2: Test with limited traffic**
|
|
|
|
Run load test before production:
|
|
|
|
```bash
|
|
# Install load testing tool
|
|
pip install locust
|
|
|
|
# Create test_load.py with sample requests
|
|
# Run: locust -f test_load.py --host http://localhost:8000
|
|
```
|
|
|
|
Verify TTFT (time to first token) < 500ms and throughput > 100 req/sec.
|
|
|
|
**Step 3: Enable monitoring**
|
|
|
|
vLLM exposes Prometheus metrics on port 9090:
|
|
|
|
```bash
|
|
curl http://localhost:9090/metrics | grep vllm
|
|
```
|
|
|
|
Key metrics to monitor:
|
|
- `vllm:time_to_first_token_seconds` - Latency
|
|
- `vllm:num_requests_running` - Active requests
|
|
- `vllm:gpu_cache_usage_perc` - KV cache utilization
|
|
|
|
**Step 4: Deploy to production**
|
|
|
|
Use Docker for consistent deployment:
|
|
|
|
```bash
|
|
# Run vLLM in Docker
|
|
docker run --gpus all -p 8000:8000 \
|
|
vllm/vllm-openai:latest \
|
|
--model meta-llama/Llama-3-8B-Instruct \
|
|
--gpu-memory-utilization 0.9 \
|
|
--enable-prefix-caching
|
|
```
|
|
|
|
**Step 5: Verify performance metrics**
|
|
|
|
Check that deployment meets targets:
|
|
- TTFT < 500ms (for short prompts)
|
|
- Throughput > target req/sec
|
|
- GPU utilization > 80%
|
|
- No OOM errors in logs
|
|
|
|
### Workflow 2: Offline batch inference
|
|
|
|
For processing large datasets without server overhead.
|
|
|
|
Copy this checklist:
|
|
|
|
```
|
|
Batch Processing:
|
|
- [ ] Step 1: Prepare input data
|
|
- [ ] Step 2: Configure LLM engine
|
|
- [ ] Step 3: Run batch inference
|
|
- [ ] Step 4: Process results
|
|
```
|
|
|
|
**Step 1: Prepare input data**
|
|
|
|
```python
|
|
# Load prompts from file
|
|
prompts = []
|
|
with open("prompts.txt") as f:
|
|
prompts = [line.strip() for line in f]
|
|
|
|
print(f"Loaded {len(prompts)} prompts")
|
|
```
|
|
|
|
**Step 2: Configure LLM engine**
|
|
|
|
```python
|
|
from vllm import LLM, SamplingParams
|
|
|
|
llm = LLM(
|
|
model="meta-llama/Llama-3-8B-Instruct",
|
|
tensor_parallel_size=2, # Use 2 GPUs
|
|
gpu_memory_utilization=0.9,
|
|
max_model_len=4096
|
|
)
|
|
|
|
sampling = SamplingParams(
|
|
temperature=0.7,
|
|
top_p=0.95,
|
|
max_tokens=512,
|
|
stop=["</s>", "\n\n"]
|
|
)
|
|
```
|
|
|
|
**Step 3: Run batch inference**
|
|
|
|
vLLM automatically batches requests for efficiency:
|
|
|
|
```python
|
|
# Process all prompts in one call
|
|
outputs = llm.generate(prompts, sampling)
|
|
|
|
# vLLM handles batching internally
|
|
# No need to manually chunk prompts
|
|
```
|
|
|
|
**Step 4: Process results**
|
|
|
|
```python
|
|
# Extract generated text
|
|
results = []
|
|
for output in outputs:
|
|
prompt = output.prompt
|
|
generated = output.outputs[0].text
|
|
results.append({
|
|
"prompt": prompt,
|
|
"generated": generated,
|
|
"tokens": len(output.outputs[0].token_ids)
|
|
})
|
|
|
|
# Save to file
|
|
import json
|
|
with open("results.jsonl", "w") as f:
|
|
for result in results:
|
|
f.write(json.dumps(result) + "\n")
|
|
|
|
print(f"Processed {len(results)} prompts")
|
|
```
|
|
|
|
### Workflow 3: Quantized model serving
|
|
|
|
Fit large models in limited GPU memory.
|
|
|
|
```
|
|
Quantization Setup:
|
|
- [ ] Step 1: Choose quantization method
|
|
- [ ] Step 2: Find or create quantized model
|
|
- [ ] Step 3: Launch with quantization flag
|
|
- [ ] Step 4: Verify accuracy
|
|
```
|
|
|
|
**Step 1: Choose quantization method**
|
|
|
|
- **AWQ**: Best for 70B models, minimal accuracy loss
|
|
- **GPTQ**: Wide model support, good compression
|
|
- **FP8**: Fastest on H100 GPUs
|
|
|
|
**Step 2: Find or create quantized model**
|
|
|
|
Use pre-quantized models from HuggingFace:
|
|
|
|
```bash
|
|
# Search for AWQ models
|
|
# Example: TheBloke/Llama-2-70B-AWQ
|
|
```
|
|
|
|
**Step 3: Launch with quantization flag**
|
|
|
|
```bash
|
|
# Using pre-quantized model
|
|
vllm serve TheBloke/Llama-2-70B-AWQ \
|
|
--quantization awq \
|
|
--tensor-parallel-size 1 \
|
|
--gpu-memory-utilization 0.95
|
|
|
|
# Results: 70B model in ~40GB VRAM
|
|
```
|
|
|
|
**Step 4: Verify accuracy**
|
|
|
|
Test outputs match expected quality:
|
|
|
|
```python
|
|
# Compare quantized vs non-quantized responses
|
|
# Verify task-specific performance unchanged
|
|
```
|
|
|
|
## When to use vs alternatives
|
|
|
|
**Use vLLM when:**
|
|
- Deploying production LLM APIs (100+ req/sec)
|
|
- Serving OpenAI-compatible endpoints
|
|
- Limited GPU memory but need large models
|
|
- Multi-user applications (chatbots, assistants)
|
|
- Need low latency with high throughput
|
|
|
|
**Use alternatives instead:**
|
|
- **llama.cpp**: CPU/edge inference, single-user
|
|
- **HuggingFace transformers**: Research, prototyping, one-off generation
|
|
- **TensorRT-LLM**: NVIDIA-only, need absolute maximum performance
|
|
- **Text-Generation-Inference**: Already in HuggingFace ecosystem
|
|
|
|
## Common issues
|
|
|
|
**Issue: Out of memory during model loading**
|
|
|
|
Reduce memory usage:
|
|
```bash
|
|
vllm serve MODEL \
|
|
--gpu-memory-utilization 0.7 \
|
|
--max-model-len 4096
|
|
```
|
|
|
|
Or use quantization:
|
|
```bash
|
|
vllm serve MODEL --quantization awq
|
|
```
|
|
|
|
**Issue: Slow first token (TTFT > 1 second)**
|
|
|
|
Enable prefix caching for repeated prompts:
|
|
```bash
|
|
vllm serve MODEL --enable-prefix-caching
|
|
```
|
|
|
|
For long prompts, enable chunked prefill:
|
|
```bash
|
|
vllm serve MODEL --enable-chunked-prefill
|
|
```
|
|
|
|
**Issue: Model not found error**
|
|
|
|
Use `--trust-remote-code` for custom models:
|
|
```bash
|
|
vllm serve MODEL --trust-remote-code
|
|
```
|
|
|
|
**Issue: Low throughput (<50 req/sec)**
|
|
|
|
Increase concurrent sequences:
|
|
```bash
|
|
vllm serve MODEL --max-num-seqs 512
|
|
```
|
|
|
|
Check GPU utilization with `nvidia-smi` - should be >80%.
|
|
|
|
**Issue: Inference slower than expected**
|
|
|
|
Verify tensor parallelism uses power of 2 GPUs:
|
|
```bash
|
|
vllm serve MODEL --tensor-parallel-size 4 # Not 3
|
|
```
|
|
|
|
Enable speculative decoding for faster generation:
|
|
```bash
|
|
vllm serve MODEL --speculative-model DRAFT_MODEL
|
|
```
|
|
|
|
## Advanced topics
|
|
|
|
**Server deployment patterns**: See [references/server-deployment.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/server-deployment.md) for Docker, Kubernetes, and load balancing configurations.
|
|
|
|
**Performance optimization**: See [references/optimization.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/optimization.md) for PagedAttention tuning, continuous batching details, and benchmark results.
|
|
|
|
**Quantization guide**: See [references/quantization.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/quantization.md) for AWQ/GPTQ/FP8 setup, model preparation, and accuracy comparisons.
|
|
|
|
**Troubleshooting**: See [references/troubleshooting.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/inference/vllm/references/troubleshooting.md) for detailed error messages, debugging steps, and performance diagnostics.
|
|
|
|
## Hardware requirements
|
|
|
|
- **Small models (7B-13B)**: 1x A10 (24GB) or A100 (40GB)
|
|
- **Medium models (30B-40B)**: 2x A100 (40GB) with tensor parallelism
|
|
- **Large models (70B+)**: 4x A100 (40GB) or 2x A100 (80GB), use AWQ/GPTQ
|
|
|
|
Supported platforms: NVIDIA (primary), AMD ROCm, Intel GPUs, TPUs
|
|
|
|
## Resources
|
|
|
|
- Official docs: https://docs.vllm.ai
|
|
- GitHub: https://github.com/vllm-project/vllm
|
|
- Paper: "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
|
|
- Community: https://discuss.vllm.ai
|