Skip to content

Tokenizing text

Description

The Tokenizing Text activity processes textual data from one or more columns and converts the values into individual tokens (words or phrases), making them easier to analyze or manipulate. It provides options to apply Natural Language Processing (NLP) techniques such as:

  • Stop word removal: Eliminate common filler words (e.g., “the”, “is”, “and”).
  • Stemming: Reduce words to their root forms (e.g., “running” → “run”).
  • Sorting: Sort words alphabetically for consistency or matching.

The output can be structured in different formats based on the selected Option Mode:

  • JSON array of tokens
  • One token per row
  • One token per column

Use case:
A user wants to clean up a large block of product reviews to analyze key terms used by customers. This activity can tokenize the text, remove noise (stop words), and optionally stem or sort terms, preparing the data for visualizations like word clouds or frequency analysis.

Input

  • Data – Required
    Tabular data with one or more columns containing text content.

Output

Output TypeFormatDescription
DataTabularTokenized version of the input text data in the specified output structure mode

Configuration Fields

Field NameDescription
Column NamesRequired. List of one or more text columns to tokenize.
Option ModeToken format output mode. Options: JSON, One token per row, One token per column.
Output ColumnRequired. Name of the column(s) that will store the tokenized output.
Include OriginalOptional. If enabled, retains the original column data in addition to the transformed output.
Clear Stop WordsOptional. Removes common stop words like “is”, “and”, “the”, etc.
Stem WordsOptional. Reduces words to their base/root form (e.g., “running” → “run”).
Sort WordsOptional. Sorts tokens alphabetically to standardize formatting.

Sample Input

IDTextColumn
1This is a sample text.
2Tokenizing helps in NLP tasks.

Sample Configuration

FieldValue
Column NamesTextColumn
Option ModeOne token per row
Output ColumnTokenizedText
Include Originaltrue
Clear Stop Wordstrue
Stem Wordstrue
Sort Wordstrue

Sample Output: One token per row

IDTextColumnTokenizedText
1This is a sample text.sample
1This is a sample text.text
2Tokenizing helps in NLP tasks.tokenizing
2Tokenizing helps in NLP tasks.helps
2Tokenizing helps in NLP tasks.NLP
2Tokenizing helps in NLP tasks.tasks

Sample Output: One token per column

IDTextColumnTokenizedText1TokenizedText2TokenizedText3TokenizedText4
1This is a sample text.sampletext
2Tokenizing helps in NLP tasks.tokenizinghelpsNLPtasks