Description
The Tokenizing Text activity processes textual data from one or more columns and converts the values into individual tokens (words or phrases), making them easier to analyze or manipulate. It provides options to apply Natural Language Processing (NLP) techniques such as:
- Stop word removal: Eliminate common filler words (e.g., “the”, “is”, “and”).
- Stemming: Reduce words to their root forms (e.g., “running” → “run”).
- Sorting: Sort words alphabetically for consistency or matching.
The output can be structured in different formats based on the selected Option Mode:
- JSON array of tokens
- One token per row
- One token per column
Use case:
A user wants to clean up a large block of product reviews to analyze key terms used by customers. This activity can tokenize the text, remove noise (stop words), and optionally stem or sort terms, preparing the data for visualizations like word clouds or frequency analysis.
- Data – Required
Tabular data with one or more columns containing text content.
Output
Output Type | Format | Description |
---|
Data | Tabular | Tokenized version of the input text data in the specified output structure mode |
Configuration Fields
Field Name | Description |
---|
Column Names | Required. List of one or more text columns to tokenize. |
Option Mode | Token format output mode. Options: JSON , One token per row , One token per column . |
Output Column | Required. Name of the column(s) that will store the tokenized output. |
Include Original | Optional. If enabled, retains the original column data in addition to the transformed output. |
Clear Stop Words | Optional. Removes common stop words like “is”, “and”, “the”, etc. |
Stem Words | Optional. Reduces words to their base/root form (e.g., “running” → “run”). |
Sort Words | Optional. Sorts tokens alphabetically to standardize formatting. |
ID | TextColumn |
---|
1 | This is a sample text. |
2 | Tokenizing helps in NLP tasks. |
Sample Configuration
Field | Value |
---|
Column Names | TextColumn |
Option Mode | One token per row |
Output Column | TokenizedText |
Include Original | true |
Clear Stop Words | true |
Stem Words | true |
Sort Words | true |
Sample Output: One token per row
ID | TextColumn | TokenizedText |
---|
1 | This is a sample text. | sample |
1 | This is a sample text. | text |
2 | Tokenizing helps in NLP tasks. | tokenizing |
2 | Tokenizing helps in NLP tasks. | helps |
2 | Tokenizing helps in NLP tasks. | NLP |
2 | Tokenizing helps in NLP tasks. | tasks |
Sample Output: One token per column
ID | TextColumn | TokenizedText1 | TokenizedText2 | TokenizedText3 | TokenizedText4 |
---|
1 | This is a sample text. | sample | text | | |
2 | Tokenizing helps in NLP tasks. | tokenizing | helps | NLP | tasks |