Skip to content

Extract text from pdf

Description

The Extract Text from PDF activity processes uploaded PDFs and extracts structured or free-form text using a combination of markers, regex extractors, and column mapping. This activity supports parsing structured documents such as invoices, contracts, or system-generated reports where field boundaries and patterns can be identified.

Unlike OCR-based activities, this process assumes PDFs are machine-readable, and the content layer is directly accessible.

Use case:
A PDF invoice from a vendor is parsed using this activity by setting regex extractors for “Invoice Number”, “Customer Name”, and “Amount Due”. The output is a structured table of extracted values, which is then processed by FillEmptyCells, validated, and sent to an ERP system.


Input

FieldRequiredDescription
PDF FilesYesOne or more PDF documents to extract content from. Must be machine-readable.

Output

Output TypeFormatDescription
DataTableList of rows containing extracted key-value pairs. One row per document or data region.

Configuration Fields

Field NameDescription
MarkersList of start and end markers to segment sections of interest in the PDF.
Column MapMaps the extracted values to specific columns (e.g., map “Invoice Total” to Amount).
Regex ExtractorsList of regular expressions to extract text patterns like email, invoice number, or date.

Sample Input

Not Applicable


Sample Configuration

FieldValue
Markers["StartInvoiceDetails", "EndInvoiceDetails"]
Column Map{"Amount Due": "Total", "Invoice No": "Invoice Number"}
Regex Extractors["(?i)Invoice\\s*No\\s*:\\s*(\\w+)", "(?i)Amount\\s*Due\\s*:\\s*(\\d+\\.\\d{2})"]

Sample Output

Invoice NoCustomer NameAmount DueDate
INV-12345Acme Corp500.002025-07-01
INV-67890Beta Ltd.1200.502025-07-02

Output rows can be filtered, transformed, or routed in downstream activities such as Filter, Send Email, or GoogleSheetUpload.