fix: correct method count and analysis module count per creator review

Fixes based on feedback from OBLITERATUS creator: - CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods (failspy, gabliteration, heretic, rdo) are Python-API-only and will be rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only Methods' sections with clear warnings. - Analysis module count corrected from 27 to 15, matching the README. The analysis/ directory has 24+ .py files but includes utilities, visualization helpers, and __init__.py beyond the 15 core modules. - Description broadened from 'SVD-based weight projection' to 'mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.)' to better represent the method diversity. - Telemetry notice clarified: CLI defaults to OFF, opt-in via OBLITERATUS_TELEMETRY=1 or --contribute flag.
2026-04-28 06:51:16 +08:00 · 2026-03-04 18:07:18 -08:00
parent 5f85fe4be9
commit 58aa8c1846
3 changed files with 32 additions and 15 deletions
--- a/skills/mlops/obliteratus/SKILL.md
+++ b/skills/mlops/obliteratus/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: obliteratus
-description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — SVD-based weight projection that excises guardrails while preserving reasoning. Supports 13 methods, 27 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
+description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
 version: 1.0.0
 author: Hermes Agent
 license: MIT
@@ -13,7 +13,7 @@ metadata:

 # OBLITERATUS Skill

-Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses SVD-based weight projection to surgically excise refusal directions from model weights while preserving reasoning capabilities.
+Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.

 **License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.

@@ -100,25 +100,31 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct
 | Reasoning model (R1 distills)     | `surgical`         | CoT-aware, preserves chain-of-thought    |
 | Stubborn refusals persist         | `aggressive`       | Whitened SVD + head surgery + jailbreak   |
 | Want reversible changes           | Use steering vectors (see Analysis section) |
-| Reproducing prior work            | `failspy`, `gabliteration`, `heretic`, `rdo` |
 | Maximum quality, time no object   | `optimized`        | Bayesian search for best parameters      |

-### All 13 Methods
+### 9 CLI Methods
+
+These can be passed to `--method` on the command line:

 - **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024)
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
 - **advanced** — Multiple SVD directions, norm-preserving projection. Good default.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
 - **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
 - **spectral_cascade** — DCT frequency-domain decomposition
 - **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
 - **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
 - **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
+- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
 - **nuclear** — Maximum force combo for stubborn MoE models.

+### 4 Python-API-Only Methods
+
+These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.**
+
+- **failspy** — FailSpy/abliterator reproduction
+- **gabliteration** — Gabliteration reproduction
+- **heretic** — Heretic/p-e-w reproduction
+- **rdo** — Refusal Direction Optimization (ICML 2025)
+
 ## Step 5: Run Abliteration

 ### Basic Usage
@@ -226,7 +232,7 @@ huggingface-cli upload your-username/model-name-abliterated ./abliterated-models
 vllm serve ./abliterated-models/model-name --port 8000
 ```

-## Analysis Modules (Pre-Abliteration, Optional)
+## Analysis Modules (15 Modules, Pre-Abliteration, Optional)

 For understanding refusal geometry before committing to abliteration.

@@ -275,9 +281,9 @@ obliteratus run my_study.yaml

 ## Telemetry Notice

- **Local installs**: Telemetry OFF by default. Opt-in via `OBLITERATUS_TELEMETRY=1`
- **HuggingFace Spaces**: Telemetry ON by default
- Collected: model ID, method, scores, hardware, timing (anonymous)
+- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
+- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
+- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
 - NOT collected: IP addresses, user identity, prompt content
 - Force off: `export OBLITERATUS_TELEMETRY=0`

--- a/skills/mlops/obliteratus/references/analysis-modules.md
+++ b/skills/mlops/obliteratus/references/analysis-modules.md
@@ -1,8 +1,12 @@
 # OBLITERATUS Analysis Modules — Reference

-27 analysis modules for mechanistic interpretability of refusal in LLMs.
+15 analysis modules for mechanistic interpretability of refusal in LLMs.
 These help you understand HOW a model refuses before you decide to remove it.

+> **Note:** The `analysis/` directory contains additional utility files (utils.py,
+> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
+> listed below. The module count matches the README's "15 deep analysis modules."
+
 ## Core Analysis (Run These First)

 ### Alignment Imprint Detection
--- a/skills/mlops/obliteratus/references/methods-guide.md
+++ b/skills/mlops/obliteratus/references/methods-guide.md
@@ -1,5 +1,10 @@
 # OBLITERATUS Methods — Detailed Guide

+> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
+> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
+> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
+> are available only via the Python API and will be rejected by argparse if used on CLI.
+
 ## How Abliteration Works (Theory)

 When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
@@ -84,11 +89,13 @@ The informed pipeline runs these analysis modules during abliteration:
 **Best for:** When you want the model to be maximally helpful
 **Warning:** Can make the model too eager; may reduce safety-adjacent reasoning

-### failspy / gabliteration / heretic / rdo
+### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
 **Technique:** Faithful reproductions of prior community/academic work
 **Speed:** Varies
 **Quality:** Known baselines
 **Best for:** Reproducing published results, comparing methods
+**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
+Do not use `--method failspy` etc. in CLI commands; argparse will reject them.

 ## Method Selection Flowchart