---
title: Extract (text + ocr) from pdf
description: Extract machine-readable text or structured data from image-based or native PDF files using OCR.
category: Document Processing
tags: [pdf, ocr, text extraction, document, invoice, scanned file]
---

# Extract (text + ocr) from pdf

## **Description**

The **Extract (Text + OCR) from PDF** activity converts native and scanned PDF files into readable and structured text using **Optical Character Recognition (OCR)**. It processes selected page ranges and outputs content as plain text or structured data suitable for use in downstream workflow activities such as text parsing, document validation, or record generation.

This activity is useful in scenarios where PDFs are not machine-readable, such as scanned invoices, receipts, reports, or contracts. It enables you to automate content extraction and digitize workflows end-to-end.

> **Use case**:  
> In a document digitization workflow, scanned PDFs of tax invoices are passed through this activity to extract customer names, invoice dates, and item details. The output is then cleaned using `ExtractFields` and uploaded into a centralized accounting system.

---

## **Input**

| **Field**   | **Required** | **Description**                                                                                |
| ----------- | ------------ | ---------------------------------------------------------------------------------------------- |
| `PDF Files` | Yes          | One or more PDF files to be processed. Can be uploaded or received from a prior file activity. |

---

## **Output**

| **Output Type** | **Format** | **Description**                                                                  |
| --------------- | ---------- | -------------------------------------------------------------------------------- |
| `Data`          | Text       | Extracted content from the specified page range of the PDF.                      |
| `Structured`    | Tabular    | Extracted data that can be split or processed downstream using structured tools. |

---

## **Configuration Fields**

| **Field Name** | **Description**                                                                   |
| -------------- | --------------------------------------------------------------------------------- |
| **Start Page** | Starting page number for OCR extraction. Set to `1` to begin from the first page. |
| **End Page**   | Ending page number for extraction. Leave blank to process until the last page.    |

---

## **Sample Input**

_Not Applicable_

---

## **Sample Configuration**

| Field        | Value |
| ------------ | ----- |
| `Start Page` | `1`   |
| `End Page`   | `3`   |

---

## **Sample Output**

| **File Name**        | **File Type** | **Size (bytes)** | **Contains**                |
| -------------------- | ------------- | ---------------- | --------------------------- |
| `invoice-2025-a.csv` | CSV           | 1,234            | Table data from page 1–3    |
| `invoice-2025-b.csv` | CSV           | 980              | Table data from page 1 only |