mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-05 02:07:34 +08:00
Broad drift audit against origin/main (b52b63396).
Reference pages (most user-visible drift):
- slash-commands: add /busy, /curator, /footer, /indicator, /redraw, /steer
that were missing; drop non-existent /terminal-setup; fix /q footnote
(resolves to /queue, not /quit); extend CLI-only list with all 24
CLI-only commands in the registry
- cli-commands: add dedicated sections for hermes curator / fallback /
hooks (new subcommands not previously documented); remove stale
hermes honcho standalone section (the plugin registers dynamically
via hermes memory); list curator/fallback/hooks in top-level table;
fix completion to include fish
- toolsets-reference: document the real 52-toolset count; split browser
vs browser-cdp; add discord / discord_admin / spotify / yuanbao;
correct hermes-cli tool count from 36 to 38; fix misleading claim
that hermes-homeassistant adds tools (it's identical to hermes-cli)
- tools-reference: bump tool count 55 -> 68; add 7 Spotify, 5 Yuanbao,
2 Discord toolsets; move browser_cdp/browser_dialog to their own
browser-cdp toolset section
- environment-variables: add 40+ user-facing HERMES_* vars that were
undocumented (--yolo, --accept-hooks, --ignore-*, inference model
override, agent/stream/checkpoint timeouts, OAuth trace, per-platform
batch tuning for Telegram/Discord/Matrix/Feishu/WeCom, cron knobs,
gateway restart/connect timeouts); dedupe the Cron Scheduler section;
replace stale QQ_SANDBOX with QQ_PORTAL_HOST
User-guide (top level):
- cli.md: compression preserves last 20 turns, not 4 (protect_last_n: 20)
- configuration.md: display.platforms is the canonical per-platform
override key; tool_progress_overrides is deprecated and auto-migrated
- profiles.md: model.default is the config key, not model.model
- sessions.md: CLI/TUI session IDs use 6-char hex, gateway uses 8
- checkpoints-and-rollback.md: destructive-command list now matches
_DESTRUCTIVE_PATTERNS (adds rmdir, cp, install, dd)
- docker.md: the container runs as non-root hermes (UID 10000) via
gosu; fix install command (uv pip); add missing --insecure on the
dashboard compose example (required for non-loopback bind)
- security.md: systemctl danger pattern also matches 'restart'
- index.md: built-in tool count 47 -> 68
- integrations/index.md: 6 STT providers, 8 memory providers
- integrations/providers.md: drop fictional dashscope/qwen aliases
Features:
- overview.md: 9 image models (not 8), 9 TTS providers (not 5),
8 memory providers (Supermemory was missing)
- tool-gateway.md: 9 image models
- tools.md: extend common-toolsets list with search / messaging /
spotify / discord / debugging / safe
- fallback-providers.md: add 6 real providers from PROVIDER_REGISTRY
(lmstudio, kimi-coding-cn, stepfun, alibaba-coding-plan,
tencent-tokenhub, azure-foundry)
- plugins.md: Available Hooks table now includes on_session_finalize,
on_session_reset, subagent_stop
- built-in-plugins.md: add the 7 bundled plugins the page didn't
mention (spotify, google_meet, three image_gen providers, two
dashboard examples)
- web-dashboard.md: add --insecure and --tui flags
- cron.md: hermes cron create takes positional schedule/prompt, not
flags
Messaging:
- telegram.md: TELEGRAM_WEBHOOK_SECRET is now REQUIRED when
TELEGRAM_WEBHOOK_URL is set (gateway refuses to start without it
per GHSA-3vpc-7q5r-276h). Biggest user-visible drift in the batch.
- discord.md: HERMES_DISCORD_TEXT_BATCH_SPLIT_DELAY_SECONDS default
is 2.0, not 0.1
- dingtalk.md: document DINGTALK_REQUIRE_MENTION /
FREE_RESPONSE_CHATS / MENTION_PATTERNS / HOME_CHANNEL /
ALLOW_ALL_USERS that the adapter supports
- bluebubbles.md: drop fictional BLUEBUBBLES_SEND_READ_RECEIPTS env
var; the setting lives in platforms.bluebubbles.extra only
- qqbot.md: drop dead QQ_SANDBOX; add real QQ_PORTAL_HOST and
QQ_GROUP_ALLOWED_USERS
- wecom-callback.md: replace 'hermes gateway start' (service-only)
with 'hermes gateway' for first-time setup
Developer-guide:
- architecture.md: refresh tool/toolset counts (61/52), terminal
backend count (7), line counts for run_agent.py (~13.7k), cli.py
(~11.5k), main.py (~10.4k), setup.py (~3.5k), gateway/run.py
(~12.2k), mcp_tool.py (~3.1k); add yuanbao adapter, bump platform
adapter count 18 -> 20
- agent-loop.md: run_agent.py line count 10.7k -> 13.7k
- tools-runtime.md: add vercel_sandbox backend
- adding-tools.md: remove stale 'Discovery import added to
model_tools.py' checklist item (registry auto-discovery)
- adding-platform-adapters.md: mark send_typing / get_chat_info as
concrete base methods; only connect/disconnect/send are abstract
- acp-internals.md: ACP sessions now persist to SessionDB
(~/.hermes/state.db); acp.run_agent call uses
use_unstable_protocol=True
- cron-internals.md: gateway runs scheduler in a dedicated background
thread via _start_cron_ticker, not on a maintenance cycle; locking
is cross-process via fcntl.flock (Unix) / msvcrt.locking (Windows)
- gateway-internals.md: gateway/run.py ~12k lines
- provider-runtime.md: cron DOES support fallback (run_job reads
fallback_providers from config)
- session-storage.md: SCHEMA_VERSION = 11 (not 9); add migrations
10 and 11 (trigram FTS, inline-mode FTS5 re-index); add
api_call_count column to Sessions DDL; document messages_fts_trigram
and state_meta in the architecture tree
- context-compression-and-caching.md: remove the obsolete 'context
pressure warnings' section (warnings were removed for causing
models to give up early)
- context-engine-plugin.md: compress() signature now includes
focus_topic param
- extending-the-cli.md: _build_tui_layout_children signature now
includes model_picker_widget; add to default layout
Also fixed three pre-existing broken links/anchors the build warned
about (docker.md -> api-server.md, yuanbao.md -> cron-jobs.md and
tips#background-tasks, nix-setup.md -> #container-aware-cli).
Regenerated per-skill pages via website/scripts/generate-skill-docs.py
so catalog tables and sidebar are consistent with current SKILL.md
frontmatter.
docusaurus build: clean, no broken links or anchors.
477 lines
13 KiB
Markdown
477 lines
13 KiB
Markdown
---
|
|
title: "Fine Tuning With Trl — TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF"
|
|
sidebar_label: "Fine Tuning With Trl"
|
|
description: "TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF"
|
|
---
|
|
|
|
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
|
|
|
# Fine Tuning With Trl
|
|
|
|
TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF.
|
|
|
|
## Skill metadata
|
|
|
|
| | |
|
|
|---|---|
|
|
| Source | Bundled (installed by default) |
|
|
| Path | `skills/mlops/training/trl-fine-tuning` |
|
|
| Version | `1.0.0` |
|
|
| Author | Orchestra Research |
|
|
| License | MIT |
|
|
| Dependencies | `trl`, `transformers`, `datasets`, `peft`, `accelerate`, `torch` |
|
|
| Tags | `Post-Training`, `TRL`, `Reinforcement Learning`, `Fine-Tuning`, `SFT`, `DPO`, `PPO`, `GRPO`, `RLHF`, `Preference Alignment`, `HuggingFace` |
|
|
|
|
## Reference: full SKILL.md
|
|
|
|
:::info
|
|
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
|
:::
|
|
|
|
# TRL - Transformer Reinforcement Learning
|
|
|
|
## Quick start
|
|
|
|
TRL provides post-training methods for aligning language models with human preferences.
|
|
|
|
**Installation**:
|
|
```bash
|
|
pip install trl transformers datasets peft accelerate
|
|
```
|
|
|
|
**Supervised Fine-Tuning** (instruction tuning):
|
|
```python
|
|
from trl import SFTTrainer
|
|
|
|
trainer = SFTTrainer(
|
|
model="Qwen/Qwen2.5-0.5B",
|
|
train_dataset=dataset, # Prompt-completion pairs
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
**DPO** (align with preferences):
|
|
```python
|
|
from trl import DPOTrainer, DPOConfig
|
|
|
|
config = DPOConfig(output_dir="model-dpo", beta=0.1)
|
|
trainer = DPOTrainer(
|
|
model=model,
|
|
args=config,
|
|
train_dataset=preference_dataset, # chosen/rejected pairs
|
|
processing_class=tokenizer
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
## Common workflows
|
|
|
|
### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
|
|
|
|
Complete pipeline from base model to human-aligned model.
|
|
|
|
Copy this checklist:
|
|
|
|
```
|
|
RLHF Training:
|
|
- [ ] Step 1: Supervised fine-tuning (SFT)
|
|
- [ ] Step 2: Train reward model
|
|
- [ ] Step 3: PPO reinforcement learning
|
|
- [ ] Step 4: Evaluate aligned model
|
|
```
|
|
|
|
**Step 1: Supervised fine-tuning**
|
|
|
|
Train base model on instruction-following data:
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
from trl import SFTTrainer, SFTConfig
|
|
from datasets import load_dataset
|
|
|
|
# Load model
|
|
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
|
|
|
|
# Load instruction dataset
|
|
dataset = load_dataset("trl-lib/Capybara", split="train")
|
|
|
|
# Configure training
|
|
training_args = SFTConfig(
|
|
output_dir="Qwen2.5-0.5B-SFT",
|
|
per_device_train_batch_size=4,
|
|
num_train_epochs=1,
|
|
learning_rate=2e-5,
|
|
logging_steps=10,
|
|
save_strategy="epoch"
|
|
)
|
|
|
|
# Train
|
|
trainer = SFTTrainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=dataset,
|
|
tokenizer=tokenizer
|
|
)
|
|
trainer.train()
|
|
trainer.save_model()
|
|
```
|
|
|
|
**Step 2: Train reward model**
|
|
|
|
Train model to predict human preferences:
|
|
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification
|
|
from trl import RewardTrainer, RewardConfig
|
|
|
|
# Load SFT model as base
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
"Qwen2.5-0.5B-SFT",
|
|
num_labels=1 # Single reward score
|
|
)
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
|
|
|
|
# Load preference data (chosen/rejected pairs)
|
|
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
|
|
# Configure training
|
|
training_args = RewardConfig(
|
|
output_dir="Qwen2.5-0.5B-Reward",
|
|
per_device_train_batch_size=2,
|
|
num_train_epochs=1,
|
|
learning_rate=1e-5
|
|
)
|
|
|
|
# Train reward model
|
|
trainer = RewardTrainer(
|
|
model=model,
|
|
args=training_args,
|
|
processing_class=tokenizer,
|
|
train_dataset=dataset
|
|
)
|
|
trainer.train()
|
|
trainer.save_model()
|
|
```
|
|
|
|
**Step 3: PPO reinforcement learning**
|
|
|
|
Optimize policy using reward model:
|
|
|
|
```bash
|
|
python -m trl.scripts.ppo \
|
|
--model_name_or_path Qwen2.5-0.5B-SFT \
|
|
--reward_model_path Qwen2.5-0.5B-Reward \
|
|
--dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
|
|
--output_dir Qwen2.5-0.5B-PPO \
|
|
--learning_rate 3e-6 \
|
|
--per_device_train_batch_size 64 \
|
|
--total_episodes 10000
|
|
```
|
|
|
|
**Step 4: Evaluate**
|
|
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
# Load aligned model
|
|
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
|
|
|
|
# Test
|
|
prompt = "Explain quantum computing to a 10-year-old"
|
|
output = generator(prompt, max_length=200)[0]["generated_text"]
|
|
print(output)
|
|
```
|
|
|
|
### Workflow 2: Simple preference alignment with DPO
|
|
|
|
Align model with preferences without reward model.
|
|
|
|
Copy this checklist:
|
|
|
|
```
|
|
DPO Training:
|
|
- [ ] Step 1: Prepare preference dataset
|
|
- [ ] Step 2: Configure DPO
|
|
- [ ] Step 3: Train with DPOTrainer
|
|
- [ ] Step 4: Evaluate alignment
|
|
```
|
|
|
|
**Step 1: Prepare preference dataset**
|
|
|
|
Dataset format:
|
|
```json
|
|
{
|
|
"prompt": "What is the capital of France?",
|
|
"chosen": "The capital of France is Paris.",
|
|
"rejected": "I don't know."
|
|
}
|
|
```
|
|
|
|
Load dataset:
|
|
```python
|
|
from datasets import load_dataset
|
|
|
|
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
|
# Or load your own
|
|
# dataset = load_dataset("json", data_files="preferences.json")
|
|
```
|
|
|
|
**Step 2: Configure DPO**
|
|
|
|
```python
|
|
from trl import DPOConfig
|
|
|
|
config = DPOConfig(
|
|
output_dir="Qwen2.5-0.5B-DPO",
|
|
per_device_train_batch_size=4,
|
|
num_train_epochs=1,
|
|
learning_rate=5e-7,
|
|
beta=0.1, # KL penalty strength
|
|
max_prompt_length=512,
|
|
max_length=1024,
|
|
logging_steps=10
|
|
)
|
|
```
|
|
|
|
**Step 3: Train with DPOTrainer**
|
|
|
|
```python
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
from trl import DPOTrainer
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
|
|
|
|
trainer = DPOTrainer(
|
|
model=model,
|
|
args=config,
|
|
train_dataset=dataset,
|
|
processing_class=tokenizer
|
|
)
|
|
|
|
trainer.train()
|
|
trainer.save_model()
|
|
```
|
|
|
|
**CLI alternative**:
|
|
```bash
|
|
trl dpo \
|
|
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
|
|
--dataset_name argilla/Capybara-Preferences \
|
|
--output_dir Qwen2.5-0.5B-DPO \
|
|
--per_device_train_batch_size 4 \
|
|
--learning_rate 5e-7 \
|
|
--beta 0.1
|
|
```
|
|
|
|
### Workflow 3: Memory-efficient online RL with GRPO
|
|
|
|
Train with reinforcement learning using minimal memory.
|
|
|
|
For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py)**.
|
|
|
|
Copy this checklist:
|
|
|
|
```
|
|
GRPO Training:
|
|
- [ ] Step 1: Define reward function
|
|
- [ ] Step 2: Configure GRPO
|
|
- [ ] Step 3: Train with GRPOTrainer
|
|
```
|
|
|
|
**Step 1: Define reward function**
|
|
|
|
```python
|
|
def reward_function(completions, **kwargs):
|
|
"""
|
|
Compute rewards for completions.
|
|
|
|
Args:
|
|
completions: List of generated texts
|
|
|
|
Returns:
|
|
List of reward scores (floats)
|
|
"""
|
|
rewards = []
|
|
for completion in completions:
|
|
# Example: reward based on length and unique words
|
|
score = len(completion.split()) # Favor longer responses
|
|
score += len(set(completion.lower().split())) # Reward unique words
|
|
rewards.append(score)
|
|
return rewards
|
|
```
|
|
|
|
Or use a reward model:
|
|
```python
|
|
from transformers import pipeline
|
|
|
|
reward_model = pipeline("text-classification", model="reward-model-path")
|
|
|
|
def reward_from_model(completions, prompts, **kwargs):
|
|
# Combine prompt + completion
|
|
full_texts = [p + c for p, c in zip(prompts, completions)]
|
|
# Get reward scores
|
|
results = reward_model(full_texts)
|
|
return [r["score"] for r in results]
|
|
```
|
|
|
|
**Step 2: Configure GRPO**
|
|
|
|
```python
|
|
from trl import GRPOConfig
|
|
|
|
config = GRPOConfig(
|
|
output_dir="Qwen2-GRPO",
|
|
per_device_train_batch_size=4,
|
|
num_train_epochs=1,
|
|
learning_rate=1e-5,
|
|
num_generations=4, # Generate 4 completions per prompt
|
|
max_new_tokens=128
|
|
)
|
|
```
|
|
|
|
**Step 3: Train with GRPOTrainer**
|
|
|
|
```python
|
|
from datasets import load_dataset
|
|
from trl import GRPOTrainer
|
|
|
|
# Load prompt-only dataset
|
|
dataset = load_dataset("trl-lib/tldr", split="train")
|
|
|
|
trainer = GRPOTrainer(
|
|
model="Qwen/Qwen2-0.5B-Instruct",
|
|
reward_funcs=reward_function, # Your reward function
|
|
args=config,
|
|
train_dataset=dataset
|
|
)
|
|
|
|
trainer.train()
|
|
```
|
|
|
|
**CLI**:
|
|
```bash
|
|
trl grpo \
|
|
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
|
|
--dataset_name trl-lib/tldr \
|
|
--output_dir Qwen2-GRPO \
|
|
--num_generations 4
|
|
```
|
|
|
|
## When to use vs alternatives
|
|
|
|
**Use TRL when:**
|
|
- Need to align model with human preferences
|
|
- Have preference data (chosen/rejected pairs)
|
|
- Want to use reinforcement learning (PPO, GRPO)
|
|
- Need reward model training
|
|
- Doing RLHF (full pipeline)
|
|
|
|
**Method selection**:
|
|
- **SFT**: Have prompt-completion pairs, want basic instruction following
|
|
- **DPO**: Have preferences, want simple alignment (no reward model needed)
|
|
- **PPO**: Have reward model, need maximum control over RL
|
|
- **GRPO**: Memory-constrained, want online RL
|
|
- **Reward Model**: Building RLHF pipeline, need to score generations
|
|
|
|
**Use alternatives instead:**
|
|
- **HuggingFace Trainer**: Basic fine-tuning without RL
|
|
- **Axolotl**: YAML-based training configuration
|
|
- **LitGPT**: Educational, minimal fine-tuning
|
|
- **Unsloth**: Fast LoRA training
|
|
|
|
## Common issues
|
|
|
|
**Issue: OOM during DPO training**
|
|
|
|
Reduce batch size and sequence length:
|
|
```python
|
|
config = DPOConfig(
|
|
per_device_train_batch_size=1, # Reduce from 4
|
|
max_length=512, # Reduce from 1024
|
|
gradient_accumulation_steps=8 # Maintain effective batch
|
|
)
|
|
```
|
|
|
|
Or use gradient checkpointing:
|
|
```python
|
|
model.gradient_checkpointing_enable()
|
|
```
|
|
|
|
**Issue: Poor alignment quality**
|
|
|
|
Tune beta parameter:
|
|
```python
|
|
# Higher beta = more conservative (stays closer to reference)
|
|
config = DPOConfig(beta=0.5) # Default 0.1
|
|
|
|
# Lower beta = more aggressive alignment
|
|
config = DPOConfig(beta=0.01)
|
|
```
|
|
|
|
**Issue: Reward model not learning**
|
|
|
|
Check loss type and learning rate:
|
|
```python
|
|
config = RewardConfig(
|
|
learning_rate=1e-5, # Try different LR
|
|
num_train_epochs=3 # Train longer
|
|
)
|
|
```
|
|
|
|
Ensure preference dataset has clear winners:
|
|
```python
|
|
# Verify dataset
|
|
print(dataset[0])
|
|
# Should have clear chosen > rejected
|
|
```
|
|
|
|
**Issue: PPO training unstable**
|
|
|
|
Adjust KL coefficient:
|
|
```python
|
|
config = PPOConfig(
|
|
kl_coef=0.1, # Increase from 0.05
|
|
cliprange=0.1 # Reduce from 0.2
|
|
)
|
|
```
|
|
|
|
## Advanced topics
|
|
|
|
**SFT training guide**: See [references/sft-training.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/sft-training.md) for dataset formats, chat templates, packing strategies, and multi-GPU training.
|
|
|
|
**DPO variants**: See [references/dpo-variants.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/dpo-variants.md) for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
|
|
|
|
**Reward modeling**: See [references/reward-modeling.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/reward-modeling.md) for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
|
|
|
|
**Online RL methods**: See [references/online-rl.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
|
|
|
|
**GRPO deep dive**: See [references/grpo-training.md](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py).
|
|
|
|
## Hardware requirements
|
|
|
|
- **GPU**: NVIDIA (CUDA required)
|
|
- **VRAM**: Depends on model and method
|
|
- SFT 7B: 16GB (with LoRA)
|
|
- DPO 7B: 24GB (stores reference model)
|
|
- PPO 7B: 40GB (policy + reward model)
|
|
- GRPO 7B: 24GB (more memory efficient)
|
|
- **Multi-GPU**: Supported via `accelerate`
|
|
- **Mixed precision**: BF16 recommended (A100/H100)
|
|
|
|
**Memory optimization**:
|
|
- Use LoRA/QLoRA for all methods
|
|
- Enable gradient checkpointing
|
|
- Use smaller batch sizes with gradient accumulation
|
|
|
|
## Resources
|
|
|
|
- Docs: https://huggingface.co/docs/trl/
|
|
- GitHub: https://github.com/huggingface/trl
|
|
- Papers:
|
|
- "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
|
|
- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
|
|
- "Group Relative Policy Optimization" (GRPO, 2024)
|
|
- Examples: https://github.com/huggingface/trl/tree/main/examples/scripts
|