Description
The Extract (Text + OCR) from PDF activity converts native and scanned PDF files into readable and structured text using Optical Character Recognition (OCR). It processes selected page ranges and outputs content as plain text or structured data suitable for use in downstream workflow activities such as text parsing, document validation, or record generation.
This activity is useful in scenarios where PDFs are not machine-readable, such as scanned invoices, receipts, reports, or contracts. It enables you to automate content extraction and digitize workflows end-to-end.
Use case:
In a document digitization workflow, scanned PDFs of tax invoices are passed through this activity to extract customer names, invoice dates, and item details. The output is then cleaned using ExtractFields
and uploaded into a centralized accounting system.
Field | Required | Description |
---|
PDF Files | Yes | One or more PDF files to be processed. Can be uploaded or received from a prior file activity. |
Output
Output Type | Format | Description |
---|
Data | Text | Extracted content from the specified page range of the PDF. |
Structured | Tabular | Extracted data that can be split or processed downstream using structured tools. |
Configuration Fields
Field Name | Description |
---|
Start Page | Starting page number for OCR extraction. Set to 1 to begin from the first page. |
End Page | Ending page number for extraction. Leave blank to process until the last page. |
Not Applicable
Sample Configuration
Field | Value |
---|
Start Page | 1 |
End Page | 3 |
Sample Output
File Name | File Type | Size (bytes) | Contains |
---|
invoice-2025-a.csv | CSV | 1,234 | Table data from page 1–3 |
invoice-2025-b.csv | CSV | 980 | Table data from page 1 only |