Skip to main content

This is a new service. Help us improve it and give your feedback by email.

Step 3: Open a document and search — Paperless-ngx Walkthrough

Drop into a single document, see its OCR’d content, then run a full-text search

Walkthrough progress

Step 3 of 4 • 3 minutes

Step 3 3 minutes

Open a document and search

See how OCR, Tika and Gotenberg made every document searchable, then try a keyword search.

Every document is OCR'd by Tesseract; .docx and .odt go via Gotenberg, .eml via Apache Tika — the full text becomes searchable
Full-text search combines Paperless's text index over OCR'd content with the AI-generated metadata

Expected outcome

  • Single document detail shows the OCR'd text alongside the original
  • Document notes panel shows the AI-generated summary
  • Full-text search returns matches across the AI metadata and OCR'd content

What to try

  1. Open a document

    From the Documents view, click any card — try one of the .eml emails (Highways East or the WREN grant decision are good) or one of the planning notice PDFs.

  2. Look at the Content tab

    The right-hand pane shows the document's OCR'd content. For PDFs this is Tesseract output. For .docx and .odt it's been converted via Gotenberg first. For .eml files it's the body extracted by Apache Tika. All three pipelines feed the same searchable text store.

  3. Look at the Notes tab

    The notes panel shows the AI-generated 1-2 sentence summary that the post-consume hook stored against the document. Useful when a clerk wants to scan an inbox at a glance.

  4. Look at the Details tab

    You'll see the AI-rewritten title, the AI-extracted correspondent, the AI-classified document type, and the AI tags — all editable if you want to override Bedrock's choices.

  5. Search the archive

    Use the search bar at the top. Try planning to find every document mentioning planning. Try Mill Lane — Paperless full-text-searches the OCR'd content, so this hits the body of the planning notice for The Old Forge. Try WREN grant — this picks up the email confirming the grant award.

The OCR pipeline:
  • Tesseract — OCRs scanned PDFs and image-only PDFs in the main Paperless container.
  • Apache Tika — runs as a sidecar on localhost:9998; extracts content from .eml, .odt, .rtf and dozens of other formats.
  • Gotenberg — runs as a sidecar on localhost:3000; converts .docx and other Office formats to PDF before they reach Tesseract.

All three sidecars are in the same ECS task, talking via localhost — there's no service-discovery DNS, no extra network surface, and they all scale together.

Build: 38afc52 (opens in new tab)