Tokenizing Text

Description

This activity converts column values into tokens and applies transformations based on user specifications. Transformations can include clearing stop words, stemming words, and sorting words.

Input

Data only

Output

Transformed data

Configuration Fields

Column Names The columns containing text data to tokenize.
Option Mode The tokenization mode.
- Options JSON, One token per row, One token per column
Output Column The column where the tokenized text will be stored.
Include Original If enabled, the input columns will be retained along with the transformed column.
Clear Stop Words If enabled, common stop words (such as “and”, “the”, “is”) will be removed.
Stem Words If enabled, words will be reduced to their root form (e.g., “running” → “run”).
Sort Words If enabled, words in the column will be sorted alphabetically.

Sample Input

ID	TextColumn
1	This is a sample text.
2	Tokenizing helps in NLP tasks.

Sample Configuration

alt text

Sample Output

With `One token per row` mode

ID	TextColumn	TokenizedText
1	This is a sample text.	sample
1	This is a sample text.	text
2	Tokenizing helps in NLP tasks.	tokenizing
2	Tokenizing helps in NLP tasks.	helps
2	Tokenizing helps in NLP tasks.	NLP
2	Tokenizing helps in NLP tasks.	tasks

With `One token per column` mode

ID	TextColumn	TokenizedText1	TokenizedText2	TokenizedText3	TokenizedText4
1	This is a sample text.	sample	text
2	Tokenizing helps in NLP tasks.	tokenizing	helps	NLP	tasks