mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 06:51:16 +08:00
fix: correct method count and analysis module count per creator review
Fixes based on feedback from OBLITERATUS creator: - CLI only accepts 9 methods (basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized, inverted, nuclear). The 4 reproduction methods (failspy, gabliteration, heretic, rdo) are Python-API-only and will be rejected by argparse. Separated into 'CLI Methods' and 'Python-API-Only Methods' sections with clear warnings. - Analysis module count corrected from 27 to 15, matching the README. The analysis/ directory has 24+ .py files but includes utilities, visualization helpers, and __init__.py beyond the 15 core modules. - Description broadened from 'SVD-based weight projection' to 'mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.)' to better represent the method diversity. - Telemetry notice clarified: CLI defaults to OFF, opt-in via OBLITERATUS_TELEMETRY=1 or --contribute flag.
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: obliteratus
|
||||
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — SVD-based weight projection that excises guardrails while preserving reasoning. Supports 13 methods, 27 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
|
||||
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
|
||||
version: 1.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
@@ -13,7 +13,7 @@ metadata:
|
||||
|
||||
# OBLITERATUS Skill
|
||||
|
||||
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses SVD-based weight projection to surgically excise refusal directions from model weights while preserving reasoning capabilities.
|
||||
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
|
||||
|
||||
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
|
||||
|
||||
@@ -100,25 +100,31 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct
|
||||
| Reasoning model (R1 distills) | `surgical` | CoT-aware, preserves chain-of-thought |
|
||||
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
|
||||
| Want reversible changes | Use steering vectors (see Analysis section) |
|
||||
| Reproducing prior work | `failspy`, `gabliteration`, `heretic`, `rdo` |
|
||||
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
|
||||
|
||||
### All 13 Methods
|
||||
### 9 CLI Methods
|
||||
|
||||
These can be passed to `--method` on the command line:
|
||||
|
||||
- **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024)
|
||||
- **failspy** — FailSpy/abliterator reproduction
|
||||
- **gabliteration** — Gabliteration reproduction
|
||||
- **heretic** — Heretic/p-e-w reproduction
|
||||
- **rdo** — Refusal Direction Optimization (ICML 2025)
|
||||
- **advanced** — Multiple SVD directions, norm-preserving projection. Good default.
|
||||
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
|
||||
- **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
|
||||
- **spectral_cascade** — DCT frequency-domain decomposition
|
||||
- **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
|
||||
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
|
||||
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
|
||||
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
|
||||
- **nuclear** — Maximum force combo for stubborn MoE models.
|
||||
|
||||
### 4 Python-API-Only Methods
|
||||
|
||||
These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.**
|
||||
|
||||
- **failspy** — FailSpy/abliterator reproduction
|
||||
- **gabliteration** — Gabliteration reproduction
|
||||
- **heretic** — Heretic/p-e-w reproduction
|
||||
- **rdo** — Refusal Direction Optimization (ICML 2025)
|
||||
|
||||
## Step 5: Run Abliteration
|
||||
|
||||
### Basic Usage
|
||||
@@ -226,7 +232,7 @@ huggingface-cli upload your-username/model-name-abliterated ./abliterated-models
|
||||
vllm serve ./abliterated-models/model-name --port 8000
|
||||
```
|
||||
|
||||
## Analysis Modules (Pre-Abliteration, Optional)
|
||||
## Analysis Modules (15 Modules, Pre-Abliteration, Optional)
|
||||
|
||||
For understanding refusal geometry before committing to abliteration.
|
||||
|
||||
@@ -275,9 +281,9 @@ obliteratus run my_study.yaml
|
||||
|
||||
## Telemetry Notice
|
||||
|
||||
- **Local installs**: Telemetry OFF by default. Opt-in via `OBLITERATUS_TELEMETRY=1`
|
||||
- **HuggingFace Spaces**: Telemetry ON by default
|
||||
- Collected: model ID, method, scores, hardware, timing (anonymous)
|
||||
- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
|
||||
- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
|
||||
- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
|
||||
- NOT collected: IP addresses, user identity, prompt content
|
||||
- Force off: `export OBLITERATUS_TELEMETRY=0`
|
||||
|
||||
|
||||
@@ -1,8 +1,12 @@
|
||||
# OBLITERATUS Analysis Modules — Reference
|
||||
|
||||
27 analysis modules for mechanistic interpretability of refusal in LLMs.
|
||||
15 analysis modules for mechanistic interpretability of refusal in LLMs.
|
||||
These help you understand HOW a model refuses before you decide to remove it.
|
||||
|
||||
> **Note:** The `analysis/` directory contains additional utility files (utils.py,
|
||||
> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
|
||||
> listed below. The module count matches the README's "15 deep analysis modules."
|
||||
|
||||
## Core Analysis (Run These First)
|
||||
|
||||
### Alignment Imprint Detection
|
||||
|
||||
@@ -1,5 +1,10 @@
|
||||
# OBLITERATUS Methods — Detailed Guide
|
||||
|
||||
> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
|
||||
> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
|
||||
> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
|
||||
> are available only via the Python API and will be rejected by argparse if used on CLI.
|
||||
|
||||
## How Abliteration Works (Theory)
|
||||
|
||||
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
|
||||
@@ -84,11 +89,13 @@ The informed pipeline runs these analysis modules during abliteration:
|
||||
**Best for:** When you want the model to be maximally helpful
|
||||
**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
|
||||
|
||||
### failspy / gabliteration / heretic / rdo
|
||||
### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
|
||||
**Technique:** Faithful reproductions of prior community/academic work
|
||||
**Speed:** Varies
|
||||
**Quality:** Known baselines
|
||||
**Best for:** Reproducing published results, comparing methods
|
||||
**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
|
||||
Do not use `--method failspy` etc. in CLI commands; argparse will reject them.
|
||||
|
||||
## Method Selection Flowchart
|
||||
|
||||
|
||||
Reference in New Issue
Block a user