Extract text from pdf
Description
The Extract Text from PDF activity processes uploaded PDFs and extracts structured or free-form text using a combination of markers, regex extractors, and column mapping. This activity supports parsing structured documents such as invoices, contracts, or system-generated reports where field boundaries and patterns can be identified.
Unlike OCR-based activities, this process assumes PDFs are machine-readable, and the content layer is directly accessible.
Use case:
A PDF invoice from a vendor is parsed using this activity by setting regex extractors for “Invoice Number”, “Customer Name”, and “Amount Due”. The output is a structured table of extracted values, which is then processed byFillEmptyCells
, validated, and sent to an ERP system.
Input
Field | Required | Description |
---|---|---|
PDF Files | Yes | One or more PDF documents to extract content from. Must be machine-readable. |
Output
Output Type | Format | Description |
---|---|---|
Data | Table | List of rows containing extracted key-value pairs. One row per document or data region. |
Configuration Fields
Field Name | Description |
---|---|
Markers | List of start and end markers to segment sections of interest in the PDF. |
Column Map | Maps the extracted values to specific columns (e.g., map “Invoice Total” to Amount ). |
Regex Extractors | List of regular expressions to extract text patterns like email, invoice number, or date. |
Sample Input
Not Applicable
Sample Configuration
Field | Value |
---|---|
Markers | ["StartInvoiceDetails", "EndInvoiceDetails"] |
Column Map | {"Amount Due": "Total", "Invoice No": "Invoice Number"} |
Regex Extractors | ["(?i)Invoice\\s*No\\s*:\\s*(\\w+)", "(?i)Amount\\s*Due\\s*:\\s*(\\d+\\.\\d{2})"] |
Sample Output
Invoice No | Customer Name | Amount Due | Date |
---|---|---|---|
INV-12345 | Acme Corp | 500.00 | 2025-07-01 |
INV-67890 | Beta Ltd. | 1200.50 | 2025-07-02 |
Output rows can be filtered, transformed, or routed in downstream activities such as
Filter
,Send Email
, orGoogleSheetUpload
.