website/docs/user-guide/skills/bundled/productivity/productivity-ocr-and-documents.md

---
title: "Ocr And Documents — Extract text from PDFs and scanned documents"
sidebar_label: "Ocr And Documents"
description: "Extract text from PDFs and scanned documents"
---

{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}

# Ocr And Documents

Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.

## Skill metadata

| | |
|---|---|
| Source | Bundled (installed by default) |
| Path | `skills/productivity/ocr-and-documents` |
| Version | `2.3.0` |
| Author | Hermes Agent |
| License | MIT |
| Tags | `PDF`, `Documents`, `Research`, `Arxiv`, `Text-Extraction`, `OCR` |
| Related skills | [`powerpoint`](/docs/user-guide/skills/bundled/productivity/productivity-powerpoint) |

## Reference: full SKILL.md

:::info
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
:::

# PDF & Document Extraction

For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
This skill covers **PDFs and scanned documents**.

## Step 1: Remote URL Available?

If the document has a URL, **always try `web_extract` first**:

```
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
```

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

## Step 2: Choose Local Extractor

| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---------|-----------------|---------------------|
| **Text-based PDF** | ✅ | ✅ |
| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
| **Tables** | ✅ (basic) | ✅ (high accuracy) |
| **Equations / LaTeX** | ❌ | ✅ |
| **Code blocks** | ❌ | ✅ |
| **Forms** | ❌ | ✅ |
| **Headers/footers removal** | ❌ | ✅ |
| **Reading order detection** | ❌ | ✅ |
| **Images extraction** | ✅ (embedded) | ✅ (with context) |
| **Images → text (OCR)** | ❌ | ✅ |
| **EPUB** | ✅ | ✅ |
| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |

**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:
> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

---

## pymupdf (lightweight)

```bash
pip install pymupdf pymupdf4llm
```

**Via helper script**:
```bash
python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
```

**Inline**:
```bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"
```

---

## marker-pdf (high-quality OCR)

```bash
# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf
```

**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
```

**CLI** (installed with marker-pdf):
```bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch
```

---

## Arxiv Papers

```
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
```

## Split, Merge & Search

pymupdf handles these natively — use `execute_code` or inline Python:

```python
# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
    new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
```

```python
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
```

```python
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
    results = page.search_for("revenue")
    if results:
        print(f"Page {i+1}: {len(results)} match(es)")
        print(page.get_text("text"))
```

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

---

## Notes

- `web_extract` is always first choice for URLs
- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept `--help` for full usage
- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
- For PowerPoint: see the `powerpoint` skill (uses python-pptx)
docs(website): dedicated page per bundled + optional skill (#14929) Generates a full dedicated Docusaurus page for every one of the 132 skills (73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/. Each page carries the skill's description, metadata (version, author, license, dependencies, platform gating, tags, related skills cross-linked to their own pages), and the complete SKILL.md body that Hermes loads at runtime. Previously the two catalog pages just listed skills with a one-line blurb and no way to see what the skill actually did — users had to go read the source repo. Now every skill has a browsable, searchable, cross-linked reference in the docs. - website/scripts/generate-skill-docs.py — generator that reads skills/ and optional-skills/, writes per-skill pages, regenerates both catalog indexes, and rewrites the Skills section of sidebars.ts. Handles MDX escaping (outside fenced code blocks: curly braces, unsafe HTML-ish tags) and rewrites relative references/*.md links to point at the GitHub source. - website/docs/reference/skills-catalog.md — regenerated; each row links to the new dedicated page. - website/docs/reference/optional-skills-catalog.md — same. - website/sidebars.ts — Skills section now has Bundled / Optional subtrees with one nested category per skill folder. - .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator before docusaurus build so CI stays in sync with the source SKILL.md files. Build verified locally with `npx docusaurus build`. Only remaining warnings are pre-existing broken link/anchor issues in unrelated pages. 2026-04-23 22:22:11 -07:00			`---`
			`title: "Ocr And Documents — Extract text from PDFs and scanned documents"`
			`sidebar_label: "Ocr And Documents"`
			`description: "Extract text from PDFs and scanned documents"`
			`---`

			`{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}`

			`# Ocr And Documents`

			`Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.`

			`## Skill metadata`

			`\| \| \|`
			`\|---\|---\|`
			`\| Source \| Bundled (installed by default) \|`
			\| Path \| `skills/productivity/ocr-and-documents` \|
			\| Version \| `2.3.0` \|
			`\| Author \| Hermes Agent \|`
			`\| License \| MIT \|`
			\| Tags \| `PDF`, `Documents`, `Research`, `Arxiv`, `Text-Extraction`, `OCR` \|
			\| Related skills \| [`powerpoint`](/docs/user-guide/skills/bundled/productivity/productivity-powerpoint) \|

			`## Reference: full SKILL.md`

			`:::info`
			`The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.`
			`:::`

			`# PDF & Document Extraction`

			For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
			For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
			`This skill covers PDFs and scanned documents.`

			`## Step 1: Remote URL Available?`

			If the document has a URL, always try `web_extract` first:

			```
			`web_extract(urls=["https://arxiv.org/pdf/2402.03300"])`
			`web_extract(urls=["https://example.com/report.pdf"])`
			```

			`This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.`

			`Only use local extraction when: the file is local, web_extract fails, or you need batch processing.`

			`## Step 2: Choose Local Extractor`

			`\| Feature \| pymupdf (~25MB) \| marker-pdf (~3-5GB) \|`
			`\|---------\|-----------------\|---------------------\|`
			`\| Text-based PDF \| ✅ \| ✅ \|`
			`\| Scanned PDF (OCR) \| ❌ \| ✅ (90+ languages) \|`
			`\| Tables \| ✅ (basic) \| ✅ (high accuracy) \|`
			`\| Equations / LaTeX \| ❌ \| ✅ \|`
			`\| Code blocks \| ❌ \| ✅ \|`
			`\| Forms \| ❌ \| ✅ \|`
			`\| Headers/footers removal \| ❌ \| ✅ \|`
			`\| Reading order detection \| ❌ \| ✅ \|`
			`\| Images extraction \| ✅ (embedded) \| ✅ (with context) \|`
			`\| Images → text (OCR) \| ❌ \| ✅ \|`
			`\| EPUB \| ✅ \| ✅ \|`
			`\| Markdown output \| ✅ (via pymupdf4llm) \| ✅ (native, higher quality) \|`
			`\| Install size \| ~25MB \| ~3-5GB (PyTorch + models) \|`
			`\| Speed \| Instant \| ~1-14s/page (CPU), ~0.2s/page (GPU) \|`

			`Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.`

			`If the user needs marker capabilities but the system lacks ~5GB free disk:`
			`> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."`

			`---`

			`## pymupdf (lightweight)`

			```bash
			`pip install pymupdf pymupdf4llm`
			```

			`Via helper script:`
			```bash
			`python scripts/extract_pymupdf.py document.pdf # Plain text`
			`python scripts/extract_pymupdf.py document.pdf --markdown # Markdown`
			`python scripts/extract_pymupdf.py document.pdf --tables # Tables`
			`python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images`
			`python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages`
			`python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages`
			```

			`Inline:`
			```bash
			`python3 -c "`
			`import pymupdf`
			`doc = pymupdf.open('document.pdf')`
			`for page in doc:`
			`print(page.get_text())`
			`"`
			```

			`---`

			`## marker-pdf (high-quality OCR)`

			```bash
			`# Check disk space first`
			`python scripts/extract_marker.py --check`

			`pip install marker-pdf`
			```

			`Via helper script:`
			```bash
			`python scripts/extract_marker.py document.pdf # Markdown`
			`python scripts/extract_marker.py document.pdf --json # JSON with metadata`
			`python scripts/extract_marker.py document.pdf --output_dir out/ # Save images`
			`python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)`
			`python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy`
			```

			`CLI (installed with marker-pdf):`
			```bash
			`marker_single document.pdf --output_dir ./output`
			`marker /path/to/folder --workers 4 # Batch`
			```

			`---`

			`## Arxiv Papers`

			```
			`# Abstract only (fast)`
			`web_extract(urls=["https://arxiv.org/abs/2402.03300"])`

			`# Full paper`
			`web_extract(urls=["https://arxiv.org/pdf/2402.03300"])`

			`# Search`
			`web_search(query="arxiv GRPO reinforcement learning 2026")`
			```

			`## Split, Merge & Search`

			pymupdf handles these natively — use `execute_code` or inline Python:

			```python
			`# Split: extract pages 1-5 to a new PDF`
			`import pymupdf`
			`doc = pymupdf.open("report.pdf")`
			`new = pymupdf.open()`
			`for i in range(5):`
			`new.insert_pdf(doc, from_page=i, to_page=i)`
			`new.save("pages_1-5.pdf")`
			```

			```python
			`# Merge multiple PDFs`
			`import pymupdf`
			`result = pymupdf.open()`
			`for path in ["a.pdf", "b.pdf", "c.pdf"]:`
			`result.insert_pdf(pymupdf.open(path))`
			`result.save("merged.pdf")`
			```

			```python
			`# Search for text across all pages`
			`import pymupdf`
			`doc = pymupdf.open("report.pdf")`
			`for i, page in enumerate(doc):`
			`results = page.search_for("revenue")`
			`if results:`
			`print(f"Page {i+1}: {len(results)} match(es)")`
			`print(page.get_text("text"))`
			```

			`No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.`

			`---`

			`## Notes`

			- `web_extract` is always first choice for URLs
			`- pymupdf is the safe default — instant, no models, works everywhere`
			`- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed`
			- Both helper scripts accept `--help` for full usage
			- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
			- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
			- For PowerPoint: see the `powerpoint` skill (uses python-pptx)