fix: correct method count and analysis module count per creator review

Fixes based on feedback from OBLITERATUS creator:

- CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade,
  informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods
  (failspy, gabliteration, heretic, rdo) are Python-API-only and will be
  rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only
  Methods' sections with clear warnings.

- Analysis module count corrected from 27 to 15, matching the README.
  The analysis/ directory has 24+ .py files but includes utilities,
  visualization helpers, and __init__.py beyond the 15 core modules.

- Description broadened from 'SVD-based weight projection' to
  'mechanistic interpretability techniques (diff-in-means, SVD,
  whitened SVD, SAE decomposition, etc.)' to better represent
  the method diversity.

- Telemetry notice clarified: CLI defaults to OFF, opt-in via
  OBLITERATUS_TELEMETRY=1 or --contribute flag.
This commit is contained in:
teknium1
2026-03-04 18:07:18 -08:00
parent 5f85fe4be9
commit 58aa8c1846
3 changed files with 32 additions and 15 deletions

View File

@@ -1,6 +1,6 @@
---
name: obliteratus
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — SVD-based weight projection that excises guardrails while preserving reasoning. Supports 13 methods, 27 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
version: 1.0.0
author: Hermes Agent
license: MIT
@@ -13,7 +13,7 @@ metadata:
# OBLITERATUS Skill
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses SVD-based weight projection to surgically excise refusal directions from model weights while preserving reasoning capabilities.
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
@@ -100,25 +100,31 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct
| Reasoning model (R1 distills) | `surgical` | CoT-aware, preserves chain-of-thought |
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
| Want reversible changes | Use steering vectors (see Analysis section) |
| Reproducing prior work | `failspy`, `gabliteration`, `heretic`, `rdo` |
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
### All 13 Methods
### 9 CLI Methods
These can be passed to `--method` on the command line:
- **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024)
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
- **advanced** — Multiple SVD directions, norm-preserving projection. Good default.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
- **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
- **spectral_cascade** — DCT frequency-domain decomposition
- **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
- **nuclear** — Maximum force combo for stubborn MoE models.
### 4 Python-API-Only Methods
These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.**
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
## Step 5: Run Abliteration
### Basic Usage
@@ -226,7 +232,7 @@ huggingface-cli upload your-username/model-name-abliterated ./abliterated-models
vllm serve ./abliterated-models/model-name --port 8000
```
## Analysis Modules (Pre-Abliteration, Optional)
## Analysis Modules (15 Modules, Pre-Abliteration, Optional)
For understanding refusal geometry before committing to abliteration.
@@ -275,9 +281,9 @@ obliteratus run my_study.yaml
## Telemetry Notice
- **Local installs**: Telemetry OFF by default. Opt-in via `OBLITERATUS_TELEMETRY=1`
- **HuggingFace Spaces**: Telemetry ON by default
- Collected: model ID, method, scores, hardware, timing (anonymous)
- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
- NOT collected: IP addresses, user identity, prompt content
- Force off: `export OBLITERATUS_TELEMETRY=0`

View File

@@ -1,8 +1,12 @@
# OBLITERATUS Analysis Modules — Reference
27 analysis modules for mechanistic interpretability of refusal in LLMs.
15 analysis modules for mechanistic interpretability of refusal in LLMs.
These help you understand HOW a model refuses before you decide to remove it.
> **Note:** The `analysis/` directory contains additional utility files (utils.py,
> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
> listed below. The module count matches the README's "15 deep analysis modules."
## Core Analysis (Run These First)
### Alignment Imprint Detection

View File

@@ -1,5 +1,10 @@
# OBLITERATUS Methods — Detailed Guide
> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
> are available only via the Python API and will be rejected by argparse if used on CLI.
## How Abliteration Works (Theory)
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
@@ -84,11 +89,13 @@ The informed pipeline runs these analysis modules during abliteration:
**Best for:** When you want the model to be maximally helpful
**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
### failspy / gabliteration / heretic / rdo
### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
**Technique:** Faithful reproductions of prior community/academic work
**Speed:** Varies
**Quality:** Known baselines
**Best for:** Reproducing published results, comparing methods
**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
Do not use `--method failspy` etc. in CLI commands; argparse will reject them.
## Method Selection Flowchart