Step 3: Open a document and search — Paperless-ngx Walkthrough
Drop into a single document, see its OCR’d content, then run a full-text search
Great! You've deployed the demo
Now let's walk through what you just deployed and see it in action.
Start WalkthroughChoose your next step
Generate Evidence Pack
Create your business case documentation with what you've learned.
Generate Evidence PackWalkthrough progress
Step 3 of 4 • 3 minutes
Open a document and search
See how OCR, Tika and Gotenberg made every document searchable, then try a keyword search.
Screenshot updating - please check back soon
Screenshot updating - please check back soon
Expected outcome
- Single document detail shows the OCR'd text alongside the original
- Document notes panel shows the AI-generated summary
- Full-text search returns matches across the AI metadata and OCR'd content
What to try
-
Open a document
From the Documents view, click any card — try one of the .eml emails (Highways East or the WREN grant decision are good) or one of the planning notice PDFs.
-
Look at the Content tab
The right-hand pane shows the document's OCR'd content. For PDFs this is Tesseract output. For .docx and .odt it's been converted via Gotenberg first. For .eml files it's the body extracted by Apache Tika. All three pipelines feed the same searchable text store.
-
Look at the Notes tab
The notes panel shows the AI-generated 1-2 sentence summary that the post-consume hook stored against the document. Useful when a clerk wants to scan an inbox at a glance.
-
Look at the Details tab
You'll see the AI-rewritten title, the AI-extracted correspondent, the AI-classified document type, and the AI tags — all editable if you want to override Bedrock's choices.
-
Search the archive
Use the search bar at the top. Try
planningto find every document mentioning planning. TryMill Lane— Paperless full-text-searches the OCR'd content, so this hits the body of the planning notice for The Old Forge. TryWREN grant— this picks up the email confirming the grant award.
- Tesseract — OCRs scanned PDFs and image-only PDFs in the main Paperless container.
- Apache Tika — runs as a sidecar on
localhost:9998; extracts content from .eml, .odt, .rtf and dozens of other formats. - Gotenberg — runs as a sidecar on
localhost:3000; converts .docx and other Office formats to PDF before they reach Tesseract.
All three sidecars are in the same ECS task, talking via localhost — there's no service-discovery DNS, no extra network surface, and they all scale together.