Skip to content

Extract (text + ocr) from pdf

Description

The Extract (Text + OCR) from PDF activity converts native and scanned PDF files into readable and structured text using Optical Character Recognition (OCR). It processes selected page ranges and outputs content as plain text or structured data suitable for use in downstream workflow activities such as text parsing, document validation, or record generation.

This activity is useful in scenarios where PDFs are not machine-readable, such as scanned invoices, receipts, reports, or contracts. It enables you to automate content extraction and digitize workflows end-to-end.

Use case:
In a document digitization workflow, scanned PDFs of tax invoices are passed through this activity to extract customer names, invoice dates, and item details. The output is then cleaned using ExtractFields and uploaded into a centralized accounting system.


Input

FieldRequiredDescription
PDF FilesYesOne or more PDF files to be processed. Can be uploaded or received from a prior file activity.

Output

Output TypeFormatDescription
DataTextExtracted content from the specified page range of the PDF.
StructuredTabularExtracted data that can be split or processed downstream using structured tools.

Configuration Fields

Field NameDescription
Start PageStarting page number for OCR extraction. Set to 1 to begin from the first page.
End PageEnding page number for extraction. Leave blank to process until the last page.

Sample Input

Not Applicable


Sample Configuration

FieldValue
Start Page1
End Page3

Sample Output

File NameFile TypeSize (bytes)Contains
invoice-2025-a.csvCSV1,234Table data from page 1–3
invoice-2025-b.csvCSV980Table data from page 1 only