diff --git a/skills/mlops/obliteratus/SKILL.md b/skills/mlops/obliteratus/SKILL.md index 5d325b7ffb..d9525a347d 100644 --- a/skills/mlops/obliteratus/SKILL.md +++ b/skills/mlops/obliteratus/SKILL.md @@ -1,6 +1,6 @@ --- name: obliteratus -description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — SVD-based weight projection that excises guardrails while preserving reasoning. Supports 13 methods, 27 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM. +description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM. version: 1.0.0 author: Hermes Agent license: MIT @@ -13,7 +13,7 @@ metadata: # OBLITERATUS Skill -Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses SVD-based weight projection to surgically excise refusal directions from model weights while preserving reasoning capabilities. +Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities. **License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean. @@ -100,25 +100,31 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct | Reasoning model (R1 distills) | `surgical` | CoT-aware, preserves chain-of-thought | | Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak | | Want reversible changes | Use steering vectors (see Analysis section) | -| Reproducing prior work | `failspy`, `gabliteration`, `heretic`, `rdo` | | Maximum quality, time no object | `optimized` | Bayesian search for best parameters | -### All 13 Methods +### 9 CLI Methods + +These can be passed to `--method` on the command line: - **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024) -- **failspy** — FailSpy/abliterator reproduction -- **gabliteration** — Gabliteration reproduction -- **heretic** — Heretic/p-e-w reproduction -- **rdo** — Refusal Direction Optimization (ICML 2025) - **advanced** — Multiple SVD directions, norm-preserving projection. Good default. -- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral) - **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery - **spectral_cascade** — DCT frequency-domain decomposition - **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality. - **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision. - **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal. +- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral) - **nuclear** — Maximum force combo for stubborn MoE models. +### 4 Python-API-Only Methods + +These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.** + +- **failspy** — FailSpy/abliterator reproduction +- **gabliteration** — Gabliteration reproduction +- **heretic** — Heretic/p-e-w reproduction +- **rdo** — Refusal Direction Optimization (ICML 2025) + ## Step 5: Run Abliteration ### Basic Usage @@ -226,7 +232,7 @@ huggingface-cli upload your-username/model-name-abliterated ./abliterated-models vllm serve ./abliterated-models/model-name --port 8000 ``` -## Analysis Modules (Pre-Abliteration, Optional) +## Analysis Modules (15 Modules, Pre-Abliteration, Optional) For understanding refusal geometry before committing to abliteration. @@ -275,9 +281,9 @@ obliteratus run my_study.yaml ## Telemetry Notice -- **Local installs**: Telemetry OFF by default. Opt-in via `OBLITERATUS_TELEMETRY=1` -- **HuggingFace Spaces**: Telemetry ON by default -- Collected: model ID, method, scores, hardware, timing (anonymous) +- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag. +- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected). +- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous) - NOT collected: IP addresses, user identity, prompt content - Force off: `export OBLITERATUS_TELEMETRY=0` diff --git a/skills/mlops/obliteratus/references/analysis-modules.md b/skills/mlops/obliteratus/references/analysis-modules.md index 5088c3adbb..075148a008 100644 --- a/skills/mlops/obliteratus/references/analysis-modules.md +++ b/skills/mlops/obliteratus/references/analysis-modules.md @@ -1,8 +1,12 @@ # OBLITERATUS Analysis Modules — Reference -27 analysis modules for mechanistic interpretability of refusal in LLMs. +15 analysis modules for mechanistic interpretability of refusal in LLMs. These help you understand HOW a model refuses before you decide to remove it. +> **Note:** The `analysis/` directory contains additional utility files (utils.py, +> visualization.py, etc.) and helper functions beyond the 15 core analysis modules +> listed below. The module count matches the README's "15 deep analysis modules." + ## Core Analysis (Run These First) ### Alignment Imprint Detection diff --git a/skills/mlops/obliteratus/references/methods-guide.md b/skills/mlops/obliteratus/references/methods-guide.md index 1b574015b8..5f7c501b00 100644 --- a/skills/mlops/obliteratus/references/methods-guide.md +++ b/skills/mlops/obliteratus/references/methods-guide.md @@ -1,5 +1,10 @@ # OBLITERATUS Methods — Detailed Guide +> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods: +> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, +> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo) +> are available only via the Python API and will be rejected by argparse if used on CLI. + ## How Abliteration Works (Theory) When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?" @@ -84,11 +89,13 @@ The informed pipeline runs these analysis modules during abliteration: **Best for:** When you want the model to be maximally helpful **Warning:** Can make the model too eager; may reduce safety-adjacent reasoning -### failspy / gabliteration / heretic / rdo +### failspy / gabliteration / heretic / rdo (PYTHON API ONLY) **Technique:** Faithful reproductions of prior community/academic work **Speed:** Varies **Quality:** Known baselines **Best for:** Reproducing published results, comparing methods +**⚠️ NOT available via CLI** — these methods are only accessible via the Python API. +Do not use `--method failspy` etc. in CLI commands; argparse will reject them. ## Method Selection Flowchart