diff --git a/skills/ocr-and-documents/DESCRIPTION.md b/skills/ocr-and-documents/DESCRIPTION.md new file mode 100644 index 0000000000..b74c8a0c68 --- /dev/null +++ b/skills/ocr-and-documents/DESCRIPTION.md @@ -0,0 +1,3 @@ +--- +description: Skills for extracting text from PDFs, scanned documents, images, and other file formats using OCR and document parsing tools. +--- diff --git a/skills/ocr-and-documents/SKILL.md b/skills/ocr-and-documents/SKILL.md new file mode 100644 index 0000000000..cbbc07aadf --- /dev/null +++ b/skills/ocr-and-documents/SKILL.md @@ -0,0 +1,133 @@ +--- +name: ocr-and-documents +description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill. +version: 2.3.0 +author: Hermes Agent +license: MIT +metadata: + hermes: + tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR] + related_skills: [powerpoint] +--- + +# PDF & Document Extraction + +For DOCX: use `python-docx` (parses actual document structure, far better than OCR). +For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support). +This skill covers **PDFs and scanned documents**. + +## Step 1: Remote URL Available? + +If the document has a URL, **always try `web_extract` first**: + +``` +web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) +web_extract(urls=["https://example.com/report.pdf"]) +``` + +This handles PDF-to-markdown conversion via Firecrawl with no local dependencies. + +Only use local extraction when: the file is local, web_extract fails, or you need batch processing. + +## Step 2: Choose Local Extractor + +| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) | +|---------|-----------------|---------------------| +| **Text-based PDF** | ✅ | ✅ | +| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) | +| **Tables** | ✅ (basic) | ✅ (high accuracy) | +| **Equations / LaTeX** | ❌ | ✅ | +| **Code blocks** | ❌ | ✅ | +| **Forms** | ❌ | ✅ | +| **Headers/footers removal** | ❌ | ✅ | +| **Reading order detection** | ❌ | ✅ | +| **Images extraction** | ✅ (embedded) | ✅ (with context) | +| **Images → text (OCR)** | ❌ | ✅ | +| **EPUB** | ✅ | ✅ | +| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) | +| **Install size** | ~25MB | ~3-5GB (PyTorch + models) | +| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) | + +**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis. + +If the user needs marker capabilities but the system lacks ~5GB free disk: +> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations." + +--- + +## pymupdf (lightweight) + +```bash +pip install pymupdf pymupdf4llm +``` + +**Via helper script**: +```bash +python scripts/extract_pymupdf.py document.pdf # Plain text +python scripts/extract_pymupdf.py document.pdf --markdown # Markdown +python scripts/extract_pymupdf.py document.pdf --tables # Tables +python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images +python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages +python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages +``` + +**Inline**: +```bash +python3 -c " +import pymupdf +doc = pymupdf.open('document.pdf') +for page in doc: + print(page.get_text()) +" +``` + +--- + +## marker-pdf (high-quality OCR) + +```bash +# Check disk space first +python scripts/extract_marker.py --check + +pip install marker-pdf +``` + +**Via helper script**: +```bash +python scripts/extract_marker.py document.pdf # Markdown +python scripts/extract_marker.py document.pdf --json # JSON with metadata +python scripts/extract_marker.py document.pdf --output_dir out/ # Save images +python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) +python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy +``` + +**CLI** (installed with marker-pdf): +```bash +marker_single document.pdf --output_dir ./output +marker /path/to/folder --workers 4 # Batch +``` + +--- + +## Arxiv Papers + +``` +# Abstract only (fast) +web_extract(urls=["https://arxiv.org/abs/2402.03300"]) + +# Full paper +web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) + +# Search +web_search(query="arxiv GRPO reinforcement learning 2026") +``` + +## Notes + +- `web_extract` is always first choice for URLs +- pymupdf is the safe default — instant, no models, works everywhere +- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed +- Both helper scripts accept `--help` for full usage +- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use +- For Word docs: `pip install python-docx` (better than OCR — parses actual structure) +- For PowerPoint: see the `powerpoint` skill (uses python-pptx) diff --git a/skills/ocr-and-documents/scripts/extract_marker.py b/skills/ocr-and-documents/scripts/extract_marker.py new file mode 100644 index 0000000000..4f301aac7b --- /dev/null +++ b/skills/ocr-and-documents/scripts/extract_marker.py @@ -0,0 +1,87 @@ +#!/usr/bin/env python3 +"""Extract text from documents using marker-pdf. High-quality OCR + layout analysis. + +Requires ~3-5GB disk (PyTorch + models downloaded on first use). +Supports: PDF, DOCX, PPTX, XLSX, HTML, EPUB, images. + +Usage: + python extract_marker.py document.pdf + python extract_marker.py document.pdf --output_dir ./output + python extract_marker.py presentation.pptx + python extract_marker.py spreadsheet.xlsx + python extract_marker.py scanned_doc.pdf # OCR works here + python extract_marker.py document.pdf --json # Structured output + python extract_marker.py document.pdf --use_llm # LLM-boosted accuracy +""" +import sys +import os + +def convert(path, output_dir=None, output_format="markdown", use_llm=False): + from marker.converters.pdf import PdfConverter + from marker.models import create_model_dict + from marker.config.parser import ConfigParser + + config_dict = {} + if use_llm: + config_dict["use_llm"] = True + + config_parser = ConfigParser(config_dict) + models = create_model_dict() + converter = PdfConverter(config=config_parser.generate_config_dict(), artifact_dict=models) + rendered = converter(path) + + if output_format == "json": + import json + print(json.dumps({ + "markdown": rendered.markdown, + "metadata": rendered.metadata if hasattr(rendered, "metadata") else {}, + }, indent=2, ensure_ascii=False)) + else: + print(rendered.markdown) + + # Save images if output_dir specified + if output_dir and hasattr(rendered, "images") and rendered.images: + from pathlib import Path + Path(output_dir).mkdir(parents=True, exist_ok=True) + for name, img_data in rendered.images.items(): + img_path = os.path.join(output_dir, name) + with open(img_path, "wb") as f: + f.write(img_data) + print(f"\nSaved {len(rendered.images)} image(s) to {output_dir}/", file=sys.stderr) + + +def check_requirements(): + """Check disk space before installing.""" + import shutil + free_gb = shutil.disk_usage("/").free / (1024**3) + if free_gb < 5: + print(f"⚠️ Only {free_gb:.1f}GB free. marker-pdf needs ~5GB for PyTorch + models.") + print("Use pymupdf instead (scripts/extract_pymupdf.py) or free up disk space.") + sys.exit(1) + print(f"✓ {free_gb:.1f}GB free — sufficient for marker-pdf") + + +if __name__ == "__main__": + args = sys.argv[1:] + if not args or args[0] in ("-h", "--help"): + print(__doc__) + sys.exit(0) + + if args[0] == "--check": + check_requirements() + sys.exit(0) + + path = args[0] + output_dir = None + output_format = "markdown" + use_llm = False + + if "--output_dir" in args: + idx = args.index("--output_dir") + output_dir = args[idx + 1] + if "--json" in args: + output_format = "json" + if "--use_llm" in args: + use_llm = True + + convert(path, output_dir=output_dir, output_format=output_format, use_llm=use_llm) diff --git a/skills/ocr-and-documents/scripts/extract_pymupdf.py b/skills/ocr-and-documents/scripts/extract_pymupdf.py new file mode 100644 index 0000000000..22063e7348 --- /dev/null +++ b/skills/ocr-and-documents/scripts/extract_pymupdf.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 +"""Extract text from documents using pymupdf. Lightweight (~25MB), no models. + +Usage: + python extract_pymupdf.py document.pdf + python extract_pymupdf.py document.pdf --markdown + python extract_pymupdf.py document.pdf --pages 0-4 + python extract_pymupdf.py document.pdf --images output_dir/ + python extract_pymupdf.py document.pdf --tables + python extract_pymupdf.py document.pdf --metadata +""" +import sys +import json + +def extract_text(path, pages=None): + import pymupdf + doc = pymupdf.open(path) + page_range = range(len(doc)) if pages is None else pages + for i in page_range: + if i < len(doc): + print(f"\n--- Page {i+1}/{len(doc)} ---\n") + print(doc[i].get_text()) + +def extract_markdown(path, pages=None): + import pymupdf4llm + md = pymupdf4llm.to_markdown(path, pages=pages) + print(md) + +def extract_tables(path): + import pymupdf + doc = pymupdf.open(path) + for i, page in enumerate(doc): + tables = page.find_tables() + for j, table in enumerate(tables.tables): + print(f"\n--- Page {i+1}, Table {j+1} ---\n") + df = table.to_pandas() + print(df.to_markdown(index=False)) + +def extract_images(path, output_dir): + import pymupdf + from pathlib import Path + Path(output_dir).mkdir(parents=True, exist_ok=True) + doc = pymupdf.open(path) + count = 0 + for i, page in enumerate(doc): + for img_idx, img in enumerate(page.get_images(full=True)): + xref = img[0] + pix = pymupdf.Pixmap(doc, xref) + if pix.n >= 5: + pix = pymupdf.Pixmap(pymupdf.csRGB, pix) + out_path = f"{output_dir}/page{i+1}_img{img_idx+1}.png" + pix.save(out_path) + count += 1 + print(f"Extracted {count} images to {output_dir}/") + +def show_metadata(path): + import pymupdf + doc = pymupdf.open(path) + print(json.dumps({ + "pages": len(doc), + "title": doc.metadata.get("title", ""), + "author": doc.metadata.get("author", ""), + "subject": doc.metadata.get("subject", ""), + "creator": doc.metadata.get("creator", ""), + "producer": doc.metadata.get("producer", ""), + "format": doc.metadata.get("format", ""), + }, indent=2)) + +if __name__ == "__main__": + args = sys.argv[1:] + if not args or args[0] in ("-h", "--help"): + print(__doc__) + sys.exit(0) + + path = args[0] + pages = None + + if "--pages" in args: + idx = args.index("--pages") + p = args[idx + 1] + if "-" in p: + start, end = p.split("-") + pages = list(range(int(start), int(end) + 1)) + else: + pages = [int(p)] + + if "--metadata" in args: + show_metadata(path) + elif "--tables" in args: + extract_tables(path) + elif "--images" in args: + idx = args.index("--images") + output_dir = args[idx + 1] if idx + 1 < len(args) else "./images" + extract_images(path, output_dir) + elif "--markdown" in args: + extract_markdown(path, pages=pages) + else: + extract_text(path, pages=pages) diff --git a/tools/web_tools.py b/tools/web_tools.py index 4e0772166b..0e5baaa29e 100644 --- a/tools/web_tools.py +++ b/tools/web_tools.py @@ -1240,7 +1240,7 @@ WEB_SEARCH_SCHEMA = { WEB_EXTRACT_SCHEMA = { "name": "web_extract", - "description": "Extract content from web page URLs. Returns page content in markdown format. Pages under 5000 chars return full markdown; larger pages are LLM-summarized and capped at ~5000 chars per page. Pages over 2M chars are refused. If a URL fails or times out, use the browser tool to access it instead.", + "description": "Extract content from web page URLs. Returns page content in markdown format. Also works with PDF URLs (arxiv papers, documents, etc.) — pass the PDF link directly and it converts to markdown text. Pages under 5000 chars return full markdown; larger pages are LLM-summarized and capped at ~5000 chars per page. Pages over 2M chars are refused. If a URL fails or times out, use the browser tool to access it instead.", "parameters": { "type": "object", "properties": {