---
title: Extract text from pdf
description: Extract structured or unstructured text from machine-readable PDF files using markers, regex, and column mapping.
category: File Transformation
tags: [pdf, text extraction, document parsing, regex, markers]
---

# Extract text from pdf

## **Description**

The **Extract Text from PDF** activity processes uploaded PDFs and extracts structured or free-form text using a combination of **markers**, **regex extractors**, and **column mapping**. This activity supports parsing structured documents such as invoices, contracts, or system-generated reports where field boundaries and patterns can be identified.

Unlike OCR-based activities, this process assumes PDFs are machine-readable, and the content layer is directly accessible.

> **Use case**:  
> A PDF invoice from a vendor is parsed using this activity by setting regex extractors for "Invoice Number", "Customer Name", and "Amount Due". The output is a structured table of extracted values, which is then processed by `FillEmptyCells`, validated, and sent to an ERP system.

---

## **Input**

| **Field** | **Required** | **Description**                                                              |
| --------- | ------------ | ---------------------------------------------------------------------------- |
| PDF Files | Yes          | One or more PDF documents to extract content from. Must be machine-readable. |

---

## **Output**

| **Output Type** | **Format** | **Description**                                                                         |
| --------------- | ---------- | --------------------------------------------------------------------------------------- |
| `Data`          | Table      | List of rows containing extracted key-value pairs. One row per document or data region. |

---

## **Configuration Fields**

| **Field Name**       | **Description**                                                                           |
| -------------------- | ----------------------------------------------------------------------------------------- |
| **Markers**          | List of start and end markers to segment sections of interest in the PDF.                 |
| **Column Map**       | Maps the extracted values to specific columns (e.g., map “Invoice Total” to `Amount`).    |
| **Regex Extractors** | List of regular expressions to extract text patterns like email, invoice number, or date. |

---

## **Sample Input**

_Not Applicable_

---

## **Sample Configuration**

| Field              | Value                                                                               |
| ------------------ | ----------------------------------------------------------------------------------- |
| `Markers`          | `["StartInvoiceDetails", "EndInvoiceDetails"]`                                      |
| `Column Map`       | `{"Amount Due": "Total", "Invoice No": "Invoice Number"}`                           |
| `Regex Extractors` | `["(?i)Invoice\\s*No\\s*:\\s*(\\w+)", "(?i)Amount\\s*Due\\s*:\\s*(\\d+\\.\\d{2})"]` |

---

## **Sample Output**

| Invoice No | Customer Name | Amount Due | Date       |
| ---------- | ------------- | ---------- | ---------- |
| INV-12345  | Acme Corp     | 500.00     | 2025-07-01 |
| INV-67890  | Beta Ltd.     | 1200.50    | 2025-07-02 |

> Output rows can be filtered, transformed, or routed in downstream activities such as `Filter`, `Send Email`, or `GoogleSheetUpload`.

---
