Files
hermes-agent/website/docs/user-guide/skills/optional/mlops/mlops-stable-diffusion.md
Teknium 289cc47631 docs: resync reference, user-guide, developer-guide, and messaging pages against code (#17738)
Broad drift audit against origin/main (b52b63396).

Reference pages (most user-visible drift):
- slash-commands: add /busy, /curator, /footer, /indicator, /redraw, /steer
  that were missing; drop non-existent /terminal-setup; fix /q footnote
  (resolves to /queue, not /quit); extend CLI-only list with all 24
  CLI-only commands in the registry
- cli-commands: add dedicated sections for hermes curator / fallback /
  hooks (new subcommands not previously documented); remove stale
  hermes honcho standalone section (the plugin registers dynamically
  via hermes memory); list curator/fallback/hooks in top-level table;
  fix completion to include fish
- toolsets-reference: document the real 52-toolset count; split browser
  vs browser-cdp; add discord / discord_admin / spotify / yuanbao;
  correct hermes-cli tool count from 36 to 38; fix misleading claim
  that hermes-homeassistant adds tools (it's identical to hermes-cli)
- tools-reference: bump tool count 55 -> 68; add 7 Spotify, 5 Yuanbao,
  2 Discord toolsets; move browser_cdp/browser_dialog to their own
  browser-cdp toolset section
- environment-variables: add 40+ user-facing HERMES_* vars that were
  undocumented (--yolo, --accept-hooks, --ignore-*, inference model
  override, agent/stream/checkpoint timeouts, OAuth trace, per-platform
  batch tuning for Telegram/Discord/Matrix/Feishu/WeCom, cron knobs,
  gateway restart/connect timeouts); dedupe the Cron Scheduler section;
  replace stale QQ_SANDBOX with QQ_PORTAL_HOST

User-guide (top level):
- cli.md: compression preserves last 20 turns, not 4 (protect_last_n: 20)
- configuration.md: display.platforms is the canonical per-platform
  override key; tool_progress_overrides is deprecated and auto-migrated
- profiles.md: model.default is the config key, not model.model
- sessions.md: CLI/TUI session IDs use 6-char hex, gateway uses 8
- checkpoints-and-rollback.md: destructive-command list now matches
  _DESTRUCTIVE_PATTERNS (adds rmdir, cp, install, dd)
- docker.md: the container runs as non-root hermes (UID 10000) via
  gosu; fix install command (uv pip); add missing --insecure on the
  dashboard compose example (required for non-loopback bind)
- security.md: systemctl danger pattern also matches 'restart'
- index.md: built-in tool count 47 -> 68
- integrations/index.md: 6 STT providers, 8 memory providers
- integrations/providers.md: drop fictional dashscope/qwen aliases

Features:
- overview.md: 9 image models (not 8), 9 TTS providers (not 5),
  8 memory providers (Supermemory was missing)
- tool-gateway.md: 9 image models
- tools.md: extend common-toolsets list with search / messaging /
  spotify / discord / debugging / safe
- fallback-providers.md: add 6 real providers from PROVIDER_REGISTRY
  (lmstudio, kimi-coding-cn, stepfun, alibaba-coding-plan,
  tencent-tokenhub, azure-foundry)
- plugins.md: Available Hooks table now includes on_session_finalize,
  on_session_reset, subagent_stop
- built-in-plugins.md: add the 7 bundled plugins the page didn't
  mention (spotify, google_meet, three image_gen providers, two
  dashboard examples)
- web-dashboard.md: add --insecure and --tui flags
- cron.md: hermes cron create takes positional schedule/prompt, not
  flags

Messaging:
- telegram.md: TELEGRAM_WEBHOOK_SECRET is now REQUIRED when
  TELEGRAM_WEBHOOK_URL is set (gateway refuses to start without it
  per GHSA-3vpc-7q5r-276h). Biggest user-visible drift in the batch.
- discord.md: HERMES_DISCORD_TEXT_BATCH_SPLIT_DELAY_SECONDS default
  is 2.0, not 0.1
- dingtalk.md: document DINGTALK_REQUIRE_MENTION /
  FREE_RESPONSE_CHATS / MENTION_PATTERNS / HOME_CHANNEL /
  ALLOW_ALL_USERS that the adapter supports
- bluebubbles.md: drop fictional BLUEBUBBLES_SEND_READ_RECEIPTS env
  var; the setting lives in platforms.bluebubbles.extra only
- qqbot.md: drop dead QQ_SANDBOX; add real QQ_PORTAL_HOST and
  QQ_GROUP_ALLOWED_USERS
- wecom-callback.md: replace 'hermes gateway start' (service-only)
  with 'hermes gateway' for first-time setup

Developer-guide:
- architecture.md: refresh tool/toolset counts (61/52), terminal
  backend count (7), line counts for run_agent.py (~13.7k), cli.py
  (~11.5k), main.py (~10.4k), setup.py (~3.5k), gateway/run.py
  (~12.2k), mcp_tool.py (~3.1k); add yuanbao adapter, bump platform
  adapter count 18 -> 20
- agent-loop.md: run_agent.py line count 10.7k -> 13.7k
- tools-runtime.md: add vercel_sandbox backend
- adding-tools.md: remove stale 'Discovery import added to
  model_tools.py' checklist item (registry auto-discovery)
- adding-platform-adapters.md: mark send_typing / get_chat_info as
  concrete base methods; only connect/disconnect/send are abstract
- acp-internals.md: ACP sessions now persist to SessionDB
  (~/.hermes/state.db); acp.run_agent call uses
  use_unstable_protocol=True
- cron-internals.md: gateway runs scheduler in a dedicated background
  thread via _start_cron_ticker, not on a maintenance cycle; locking
  is cross-process via fcntl.flock (Unix) / msvcrt.locking (Windows)
- gateway-internals.md: gateway/run.py ~12k lines
- provider-runtime.md: cron DOES support fallback (run_job reads
  fallback_providers from config)
- session-storage.md: SCHEMA_VERSION = 11 (not 9); add migrations
  10 and 11 (trigram FTS, inline-mode FTS5 re-index); add
  api_call_count column to Sessions DDL; document messages_fts_trigram
  and state_meta in the architecture tree
- context-compression-and-caching.md: remove the obsolete 'context
  pressure warnings' section (warnings were removed for causing
  models to give up early)
- context-engine-plugin.md: compress() signature now includes
  focus_topic param
- extending-the-cli.md: _build_tui_layout_children signature now
  includes model_picker_widget; add to default layout

Also fixed three pre-existing broken links/anchors the build warned
about (docker.md -> api-server.md, yuanbao.md -> cron-jobs.md and
tips#background-tasks, nix-setup.md -> #container-aware-cli).

Regenerated per-skill pages via website/scripts/generate-skill-docs.py
so catalog tables and sidebar are consistent with current SKILL.md
frontmatter.

docusaurus build: clean, no broken links or anchors.
2026-04-29 20:55:59 -07:00

14 KiB

title, sidebar_label, description
title sidebar_label description
Stable Diffusion Image Generation Stable Diffusion Image Generation State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers

{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}

Stable Diffusion Image Generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

Skill metadata

Source Optional — install with hermes skills install official/mlops/stable-diffusion
Path optional-skills/mlops/stable-diffusion
Version 1.0.0
Author Orchestra Research
License MIT
Dependencies diffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0
Tags Image Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision

Reference: full SKILL.md

:::info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. :::

Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

  • Generating images from text descriptions
  • Performing image-to-image translation (style transfer, enhancement)
  • Inpainting (filling in masked regions)
  • Outpainting (extending images beyond boundaries)
  • Creating variations of existing images
  • Building custom image generation workflows

Key features:

  • Text-to-Image: Generate images from natural language prompts
  • Image-to-Image: Transform existing images with text guidance
  • Inpainting: Fill masked regions with context-aware content
  • ControlNet: Add spatial conditioning (edges, poses, depth)
  • LoRA Support: Efficient fine-tuning and style adaptation
  • Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

  • DALL-E 3: For API-based generation without GPU
  • Midjourney: For artistic, stylized outputs
  • Imagen: For Google Cloud integration
  • Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline
import torch

# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate image
image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory optimization
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

Pipeline Purpose
StableDiffusionPipeline Text-to-image (SD 1.x/2.x)
StableDiffusionXLPipeline Text-to-image (SDXL)
StableDiffusion3Pipeline Text-to-image (SD 3.0)
FluxPipeline Text-to-image (Flux models)
StableDiffusionImg2ImgPipeline Image-to-image
StableDiffusionInpaintPipeline Inpainting

Schedulers

Schedulers control the denoising process:

Scheduler Steps Quality Use Case
EulerDiscreteScheduler 20-50 Good Default choice
EulerAncestralDiscreteScheduler 20-50 Good More variation
DPMSolverMultistepScheduler 15-25 Excellent Fast, high quality
DDIMScheduler 50-100 Good Deterministic
LCMScheduler 4-8 Good Very fast
UniPCMultistepScheduler 15-25 Excellent Fast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

Parameter Default Description
prompt Required Text description of desired image
negative_prompt None What to avoid in the image
num_inference_steps 50 Denoising steps (more = better quality)
guidance_scale 7.5 Prompt adherence (7-12 typical)
height, width 512/1024 Output dimensions (multiples of 8)
generator None Torch generator for reproducibility
num_images_per_prompt 1 Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

Negative prompts

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint region

result = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Use Canny edge image as control
control_image = get_canny_image(input_image)

image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

Available ControlNets

ControlNet Input Type Use Case
canny Edge maps Preserve structure
openpose Pose skeletons Human poses
depth Depth maps 3D-aware generation
normal Normal maps Surface details
mlsd Line segments Architectural lines
scribble Rough sketches Sketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA
pipe.unload_lora_weights()

Multiple LoRAs

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()

# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()

Attention slicing

# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()

# Or specific chunk size
pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

Model variants

Loading different precisions

# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.float16,
    variant="fp16"
)

# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.bfloat16
)

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    vae=vae,
    torch_dtype=torch.float16
)

Batch generation

Generate multiple images efficiently:

# Multiple prompts
prompts = [
    "A cat playing piano",
    "A dog reading a book",
    "A bird painting a picture"
]

images = pipe(prompts, num_inference_steps=30).images

# Multiple images per prompt
images = pipe(
    "A beautiful sunset",
    num_images_per_prompt=4,
    num_inference_steps=30
).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch

# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# 2. Generate with quality settings
image = pipe(
    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
    negative_prompt="blurry, low quality, cartoon, anime, sketch",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024
).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch

# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()

# Generate in ~1 second
image = pipe(
    "A beautiful landscape",
    num_inference_steps=4,
    guidance_scale=1.0
).images[0]

Common issues

CUDA out of memory:

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None

# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]

References

Resources