feat(ocr-and-documents): add OCR and document extraction skills

- Introduced new skills for extracting text from PDFs, scanned documents, and images using OCR and document parsing tools. - Added detailed documentation for usage and installation of `pymupdf` and `marker-pdf` for local extraction. - Implemented scripts for text extraction with both lightweight and high-quality options, including support for various document formats. - Updated web extraction functionality to handle PDF URLs directly, enhancing usability for academic papers and documents.
2026-04-28 06:51:16 +08:00 · 2026-02-26 23:06:08 -08:00
parent 21cf339a85
commit 19abbfff96
5 changed files with 322 additions and 1 deletions
--- a/skills/ocr-and-documents/DESCRIPTION.md
+++ b/skills/ocr-and-documents/DESCRIPTION.md
@@ -0,0 +1,3 @@
 ---
 description: Skills for extracting text from PDFs, scanned documents, images, and other file formats using OCR and document parsing tools.
 ---
--- a/skills/ocr-and-documents/SKILL.md
+++ b/skills/ocr-and-documents/SKILL.md
@@ -0,0 +1,133 @@
 ---
 name: ocr-and-documents
 description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
 version: 2.3.0
 author: Hermes Agent
 license: MIT
 metadata:
  hermes:
    tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
    related_skills: [powerpoint]
 ---
 # PDF & Document Extraction
 For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
 For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
 This skill covers **PDFs and scanned documents**.
 ## Step 1: Remote URL Available?
 If the document has a URL, **always try `web_extract` first**:
 ```
 web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
 web_extract(urls=["https://example.com/report.pdf"])
 ```
 This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
 Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
 ## Step 2: Choose Local Extractor
 | Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
 |---------|-----------------|---------------------|
 | **Text-based PDF** | ✅ | ✅ |
 | **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
 | **Tables** | ✅ (basic) | ✅ (high accuracy) |
 | **Equations / LaTeX** | ❌ | ✅ |
 | **Code blocks** | ❌ | ✅ |
 | **Forms** | ❌ | ✅ |
 | **Headers/footers removal** | ❌ | ✅ |
 | **Reading order detection** | ❌ | ✅ |
 | **Images extraction** | ✅ (embedded) | ✅ (with context) |
 | **Images → text (OCR)** | ❌ | ✅ |
 | **EPUB** | ✅ | ✅ |
 | **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
 | **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
 | **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
 **Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
 If the user needs marker capabilities but the system lacks ~5GB free disk:
 > "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
 ---
 ## pymupdf (lightweight)
 ```bash
 pip install pymupdf pymupdf4llm
 ```
 **Via helper script**:
 ```bash
 python scripts/extract_pymupdf.py document.pdf              # Plain text
 python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
 python scripts/extract_pymupdf.py document.pdf --tables      # Tables
 python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
 python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
 python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
 ```
 **Inline**:
 ```bash
 python3 -c "
 import pymupdf
 doc = pymupdf.open('document.pdf')
 for page in doc:
    print(page.get_text())
 "
 ```
 ---
 ## marker-pdf (high-quality OCR)
 ```bash
 # Check disk space first
 python scripts/extract_marker.py --check
 pip install marker-pdf
 ```
 **Via helper script**:
 ```bash
 python scripts/extract_marker.py document.pdf                # Markdown
 python scripts/extract_marker.py document.pdf --json         # JSON with metadata
 python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
 python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
 python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
 ```
 **CLI** (installed with marker-pdf):
 ```bash
 marker_single document.pdf --output_dir ./output
 marker /path/to/folder --workers 4    # Batch
 ```
 ---
 ## Arxiv Papers
 ```
 # Abstract only (fast)
 web_extract(urls=["https://arxiv.org/abs/2402.03300"])
 # Full paper
 web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
 # Search
 web_search(query="arxiv GRPO reinforcement learning 2026")
 ```
 ## Notes
 - `web_extract` is always first choice for URLs
 - pymupdf is the safe default — instant, no models, works everywhere
 - marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
 - Both helper scripts accept `--help` for full usage
 - marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
 - For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
 - For PowerPoint: see the `powerpoint` skill (uses python-pptx)
--- a/skills/ocr-and-documents/scripts/extract_marker.py
+++ b/skills/ocr-and-documents/scripts/extract_marker.py
@@ -0,0 +1,87 @@
 #!/usr/bin/env python3
 """Extract text from documents using marker-pdf. High-quality OCR + layout analysis.
 Requires ~3-5GB disk (PyTorch + models downloaded on first use).
 Supports: PDF, DOCX, PPTX, XLSX, HTML, EPUB, images.
 Usage:
    python extract_marker.py document.pdf
    python extract_marker.py document.pdf --output_dir ./output
    python extract_marker.py presentation.pptx
    python extract_marker.py spreadsheet.xlsx
    python extract_marker.py scanned_doc.pdf           # OCR works here
    python extract_marker.py document.pdf --json        # Structured output
    python extract_marker.py document.pdf --use_llm     # LLM-boosted accuracy
 """
 import sys
 import os
 def convert(path, output_dir=None, output_format="markdown", use_llm=False):
    from marker.converters.pdf import PdfConverter
    from marker.models import create_model_dict
    from marker.config.parser import ConfigParser
    config_dict = {}
    if use_llm:
        config_dict["use_llm"] = True
    config_parser = ConfigParser(config_dict)
    models = create_model_dict()
    converter = PdfConverter(config=config_parser.generate_config_dict(), artifact_dict=models)
    rendered = converter(path)
    if output_format == "json":
        import json
        print(json.dumps({
            "markdown": rendered.markdown,
            "metadata": rendered.metadata if hasattr(rendered, "metadata") else {},
        }, indent=2, ensure_ascii=False))
    else:
        print(rendered.markdown)
    # Save images if output_dir specified
    if output_dir and hasattr(rendered, "images") and rendered.images:
        from pathlib import Path
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        for name, img_data in rendered.images.items():
            img_path = os.path.join(output_dir, name)
            with open(img_path, "wb") as f:
                f.write(img_data)
        print(f"\nSaved {len(rendered.images)} image(s) to {output_dir}/", file=sys.stderr)
 def check_requirements():
    """Check disk space before installing."""
    import shutil
    free_gb = shutil.disk_usage("/").free / (1024**3)
    if free_gb < 5:
        print(f"⚠️  Only {free_gb:.1f}GB free. marker-pdf needs ~5GB for PyTorch + models.")
        print("Use pymupdf instead (scripts/extract_pymupdf.py) or free up disk space.")
        sys.exit(1)
    print(f"✓ {free_gb:.1f}GB free — sufficient for marker-pdf")
 if __name__ == "__main__":
    args = sys.argv[1:]
    if not args or args[0] in ("-h", "--help"):
        print(__doc__)
        sys.exit(0)
    if args[0] == "--check":
        check_requirements()
        sys.exit(0)
    path = args[0]
    output_dir = None
    output_format = "markdown"
    use_llm = False
    if "--output_dir" in args:
        idx = args.index("--output_dir")
        output_dir = args[idx + 1]
    if "--json" in args:
        output_format = "json"
    if "--use_llm" in args:
        use_llm = True
    convert(path, output_dir=output_dir, output_format=output_format, use_llm=use_llm)
--- a/skills/ocr-and-documents/scripts/extract_pymupdf.py
+++ b/skills/ocr-and-documents/scripts/extract_pymupdf.py
@@ -0,0 +1,98 @@
 #!/usr/bin/env python3
 """Extract text from documents using pymupdf. Lightweight (~25MB), no models.
 Usage:
    python extract_pymupdf.py document.pdf
    python extract_pymupdf.py document.pdf --markdown
    python extract_pymupdf.py document.pdf --pages 0-4
    python extract_pymupdf.py document.pdf --images output_dir/
    python extract_pymupdf.py document.pdf --tables
    python extract_pymupdf.py document.pdf --metadata
 """
 import sys
 import json
 def extract_text(path, pages=None):
    import pymupdf
    doc = pymupdf.open(path)
    page_range = range(len(doc)) if pages is None else pages
    for i in page_range:
        if i < len(doc):
            print(f"\n--- Page {i+1}/{len(doc)} ---\n")
            print(doc[i].get_text())
 def extract_markdown(path, pages=None):
    import pymupdf4llm
    md = pymupdf4llm.to_markdown(path, pages=pages)
    print(md)
 def extract_tables(path):
    import pymupdf
    doc = pymupdf.open(path)
    for i, page in enumerate(doc):
        tables = page.find_tables()
        for j, table in enumerate(tables.tables):
            print(f"\n--- Page {i+1}, Table {j+1} ---\n")
            df = table.to_pandas()
            print(df.to_markdown(index=False))
 def extract_images(path, output_dir):
    import pymupdf
    from pathlib import Path
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    doc = pymupdf.open(path)
    count = 0
    for i, page in enumerate(doc):
        for img_idx, img in enumerate(page.get_images(full=True)):
            xref = img[0]
            pix = pymupdf.Pixmap(doc, xref)
            if pix.n >= 5:
                pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
            out_path = f"{output_dir}/page{i+1}_img{img_idx+1}.png"
            pix.save(out_path)
            count += 1
    print(f"Extracted {count} images to {output_dir}/")
 def show_metadata(path):
    import pymupdf
    doc = pymupdf.open(path)
    print(json.dumps({
        "pages": len(doc),
        "title": doc.metadata.get("title", ""),
        "author": doc.metadata.get("author", ""),
        "subject": doc.metadata.get("subject", ""),
        "creator": doc.metadata.get("creator", ""),
        "producer": doc.metadata.get("producer", ""),
        "format": doc.metadata.get("format", ""),
    }, indent=2))
 if __name__ == "__main__":
    args = sys.argv[1:]
    if not args or args[0] in ("-h", "--help"):
        print(__doc__)
        sys.exit(0)
    path = args[0]
    pages = None
    if "--pages" in args:
        idx = args.index("--pages")
        p = args[idx + 1]
        if "-" in p:
            start, end = p.split("-")
            pages = list(range(int(start), int(end) + 1))
        else:
            pages = [int(p)]
    if "--metadata" in args:
        show_metadata(path)
    elif "--tables" in args:
        extract_tables(path)
    elif "--images" in args:
        idx = args.index("--images")
        output_dir = args[idx + 1] if idx + 1 < len(args) else "./images"
        extract_images(path, output_dir)
    elif "--markdown" in args:
        extract_markdown(path, pages=pages)
    else:
        extract_text(path, pages=pages)
--- a/tools/web_tools.py
+++ b/tools/web_tools.py
@@ -1240,7 +1240,7 @@ WEB_SEARCH_SCHEMA = {
 WEB_EXTRACT_SCHEMA = {
    "name": "web_extract",
-    "description": "Extract content from web page URLs. Returns page content in markdown format. Pages under 5000 chars return full markdown; larger pages are LLM-summarized and capped at ~5000 chars per page. Pages over 2M chars are refused. If a URL fails or times out, use the browser tool to access it instead.",
+    "description": "Extract content from web page URLs. Returns page content in markdown format. Also works with PDF URLs (arxiv papers, documents, etc.) — pass the PDF link directly and it converts to markdown text. Pages under 5000 chars return full markdown; larger pages are LLM-summarized and capped at ~5000 chars per page. Pages over 2M chars are refused. If a URL fails or times out, use the browser tool to access it instead.",
    "parameters": {
        "type": "object",
        "properties": {