Extract text from pdf

Description

The Extract Text from PDF activity processes uploaded PDFs and extracts structured or free-form text using a combination of markers, regex extractors, and column mapping. This activity supports parsing structured documents such as invoices, contracts, or system-generated reports where field boundaries and patterns can be identified.

Unlike OCR-based activities, this process assumes PDFs are machine-readable, and the content layer is directly accessible.

Use case:
A PDF invoice from a vendor is parsed using this activity by setting regex extractors for “Invoice Number”, “Customer Name”, and “Amount Due”. The output is a structured table of extracted values, which is then processed by FillEmptyCells, validated, and sent to an ERP system.

Input

Field	Required	Description
PDF Files	Yes	One or more PDF documents to extract content from. Must be machine-readable.

Output

Output Type	Format	Description
`Data`	Table	List of rows containing extracted key-value pairs. One row per document or data region.

Configuration Fields

Field Name	Description
Markers	List of start and end markers to segment sections of interest in the PDF.
Column Map	Maps the extracted values to specific columns (e.g., map “Invoice Total” to `Amount`).
Regex Extractors	List of regular expressions to extract text patterns like email, invoice number, or date.

Sample Input

Not Applicable

Sample Configuration

Field	Value
`Markers`	`["StartInvoiceDetails", "EndInvoiceDetails"]`
`Column Map`	`{"Amount Due": "Total", "Invoice No": "Invoice Number"}`
`Regex Extractors`	`["(?i)Invoice\\sNo\\s:\\s(\\w+)", "(?i)Amount\\sDue\\s:\\s(\\d+\\.\\d{2})"]`

Sample Output

Invoice No	Customer Name	Amount Due	Date
INV-12345	Acme Corp	500.00	2025-07-01
INV-67890	Beta Ltd.	1200.50	2025-07-02

Output rows can be filtered, transformed, or routed in downstream activities such as Filter, Send Email, or GoogleSheetUpload.