mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-28 06:51:16 +08:00
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.
Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.
- website/scripts/generate-skill-docs.py — generator that reads skills/ and
optional-skills/, writes per-skill pages, regenerates both catalog indexes,
and rewrites the Skills section of sidebars.ts. Handles MDX escaping
(outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
before docusaurus build so CI stays in sync with the source SKILL.md files.
Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
272 lines
7.4 KiB
Markdown
272 lines
7.4 KiB
Markdown
---
|
||
title: "Clip — OpenAI's model connecting vision and language"
|
||
sidebar_label: "Clip"
|
||
description: "OpenAI's model connecting vision and language"
|
||
---
|
||
|
||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||
|
||
# Clip
|
||
|
||
OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
|
||
|
||
## Skill metadata
|
||
|
||
| | |
|
||
|---|---|
|
||
| Source | Optional — install with `hermes skills install official/mlops/clip` |
|
||
| Path | `optional-skills/mlops/clip` |
|
||
| Version | `1.0.0` |
|
||
| Author | Orchestra Research |
|
||
| License | MIT |
|
||
| Dependencies | `transformers`, `torch`, `pillow` |
|
||
| Tags | `Multimodal`, `CLIP`, `Vision-Language`, `Zero-Shot`, `Image Classification`, `OpenAI`, `Image Search`, `Cross-Modal Retrieval`, `Content Moderation` |
|
||
|
||
## Reference: full SKILL.md
|
||
|
||
:::info
|
||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||
:::
|
||
|
||
# CLIP - Contrastive Language-Image Pre-Training
|
||
|
||
OpenAI's model that understands images from natural language.
|
||
|
||
## When to use CLIP
|
||
|
||
**Use when:**
|
||
- Zero-shot image classification (no training data needed)
|
||
- Image-text similarity/matching
|
||
- Semantic image search
|
||
- Content moderation (detect NSFW, violence)
|
||
- Visual question answering
|
||
- Cross-modal retrieval (image→text, text→image)
|
||
|
||
**Metrics**:
|
||
- **25,300+ GitHub stars**
|
||
- Trained on 400M image-text pairs
|
||
- Matches ResNet-50 on ImageNet (zero-shot)
|
||
- MIT License
|
||
|
||
**Use alternatives instead**:
|
||
- **BLIP-2**: Better captioning
|
||
- **LLaVA**: Vision-language chat
|
||
- **Segment Anything**: Image segmentation
|
||
|
||
## Quick start
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
pip install git+https://github.com/openai/CLIP.git
|
||
pip install torch torchvision ftfy regex tqdm
|
||
```
|
||
|
||
### Zero-shot classification
|
||
|
||
```python
|
||
import torch
|
||
import clip
|
||
from PIL import Image
|
||
|
||
# Load model
|
||
device = "cuda" if torch.cuda.is_available() else "cpu"
|
||
model, preprocess = clip.load("ViT-B/32", device=device)
|
||
|
||
# Load image
|
||
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
|
||
|
||
# Define possible labels
|
||
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
|
||
|
||
# Compute similarity
|
||
with torch.no_grad():
|
||
image_features = model.encode_image(image)
|
||
text_features = model.encode_text(text)
|
||
|
||
# Cosine similarity
|
||
logits_per_image, logits_per_text = model(image, text)
|
||
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
|
||
|
||
# Print results
|
||
labels = ["a dog", "a cat", "a bird", "a car"]
|
||
for label, prob in zip(labels, probs[0]):
|
||
print(f"{label}: {prob:.2%}")
|
||
```
|
||
|
||
## Available models
|
||
|
||
```python
|
||
# Models (sorted by size)
|
||
models = [
|
||
"RN50", # ResNet-50
|
||
"RN101", # ResNet-101
|
||
"ViT-B/32", # Vision Transformer (recommended)
|
||
"ViT-B/16", # Better quality, slower
|
||
"ViT-L/14", # Best quality, slowest
|
||
]
|
||
|
||
model, preprocess = clip.load("ViT-B/32")
|
||
```
|
||
|
||
| Model | Parameters | Speed | Quality |
|
||
|-------|------------|-------|---------|
|
||
| RN50 | 102M | Fast | Good |
|
||
| ViT-B/32 | 151M | Medium | Better |
|
||
| ViT-L/14 | 428M | Slow | Best |
|
||
|
||
## Image-text similarity
|
||
|
||
```python
|
||
# Compute embeddings
|
||
image_features = model.encode_image(image)
|
||
text_features = model.encode_text(text)
|
||
|
||
# Normalize
|
||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||
|
||
# Cosine similarity
|
||
similarity = (image_features @ text_features.T).item()
|
||
print(f"Similarity: {similarity:.4f}")
|
||
```
|
||
|
||
## Semantic image search
|
||
|
||
```python
|
||
# Index images
|
||
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
|
||
image_embeddings = []
|
||
|
||
for img_path in image_paths:
|
||
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
|
||
with torch.no_grad():
|
||
embedding = model.encode_image(image)
|
||
embedding /= embedding.norm(dim=-1, keepdim=True)
|
||
image_embeddings.append(embedding)
|
||
|
||
image_embeddings = torch.cat(image_embeddings)
|
||
|
||
# Search with text query
|
||
query = "a sunset over the ocean"
|
||
text_input = clip.tokenize([query]).to(device)
|
||
with torch.no_grad():
|
||
text_embedding = model.encode_text(text_input)
|
||
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
|
||
|
||
# Find most similar images
|
||
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
|
||
top_k = similarities.topk(3)
|
||
|
||
for idx, score in zip(top_k.indices, top_k.values):
|
||
print(f"{image_paths[idx]}: {score:.3f}")
|
||
```
|
||
|
||
## Content moderation
|
||
|
||
```python
|
||
# Define categories
|
||
categories = [
|
||
"safe for work",
|
||
"not safe for work",
|
||
"violent content",
|
||
"graphic content"
|
||
]
|
||
|
||
text = clip.tokenize(categories).to(device)
|
||
|
||
# Check image
|
||
with torch.no_grad():
|
||
logits_per_image, _ = model(image, text)
|
||
probs = logits_per_image.softmax(dim=-1)
|
||
|
||
# Get classification
|
||
max_idx = probs.argmax().item()
|
||
max_prob = probs[0, max_idx].item()
|
||
|
||
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
|
||
```
|
||
|
||
## Batch processing
|
||
|
||
```python
|
||
# Process multiple images
|
||
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
|
||
images = torch.stack(images).to(device)
|
||
|
||
with torch.no_grad():
|
||
image_features = model.encode_image(images)
|
||
image_features /= image_features.norm(dim=-1, keepdim=True)
|
||
|
||
# Batch text
|
||
texts = ["a dog", "a cat", "a bird"]
|
||
text_tokens = clip.tokenize(texts).to(device)
|
||
|
||
with torch.no_grad():
|
||
text_features = model.encode_text(text_tokens)
|
||
text_features /= text_features.norm(dim=-1, keepdim=True)
|
||
|
||
# Similarity matrix (10 images × 3 texts)
|
||
similarities = image_features @ text_features.T
|
||
print(similarities.shape) # (10, 3)
|
||
```
|
||
|
||
## Integration with vector databases
|
||
|
||
```python
|
||
# Store CLIP embeddings in Chroma/FAISS
|
||
import chromadb
|
||
|
||
client = chromadb.Client()
|
||
collection = client.create_collection("image_embeddings")
|
||
|
||
# Add image embeddings
|
||
for img_path, embedding in zip(image_paths, image_embeddings):
|
||
collection.add(
|
||
embeddings=[embedding.cpu().numpy().tolist()],
|
||
metadatas=[{"path": img_path}],
|
||
ids=[img_path]
|
||
)
|
||
|
||
# Query with text
|
||
query = "a sunset"
|
||
text_embedding = model.encode_text(clip.tokenize([query]))
|
||
results = collection.query(
|
||
query_embeddings=[text_embedding.cpu().numpy().tolist()],
|
||
n_results=5
|
||
)
|
||
```
|
||
|
||
## Best practices
|
||
|
||
1. **Use ViT-B/32 for most cases** - Good balance
|
||
2. **Normalize embeddings** - Required for cosine similarity
|
||
3. **Batch processing** - More efficient
|
||
4. **Cache embeddings** - Expensive to recompute
|
||
5. **Use descriptive labels** - Better zero-shot performance
|
||
6. **GPU recommended** - 10-50× faster
|
||
7. **Preprocess images** - Use provided preprocess function
|
||
|
||
## Performance
|
||
|
||
| Operation | CPU | GPU (V100) |
|
||
|-----------|-----|------------|
|
||
| Image encoding | ~200ms | ~20ms |
|
||
| Text encoding | ~50ms | ~5ms |
|
||
| Similarity compute | <1ms | <1ms |
|
||
|
||
## Limitations
|
||
|
||
1. **Not for fine-grained tasks** - Best for broad categories
|
||
2. **Requires descriptive text** - Vague labels perform poorly
|
||
3. **Biased on web data** - May have dataset biases
|
||
4. **No bounding boxes** - Whole image only
|
||
5. **Limited spatial understanding** - Position/counting weak
|
||
|
||
## Resources
|
||
|
||
- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
|
||
- **Paper**: https://arxiv.org/abs/2103.00020
|
||
- **Colab**: https://colab.research.google.com/github/openai/clip/
|
||
- **License**: MIT
|