Tokenizing Text
Description
This activity converts column values into tokens and applies transformations based on user specifications. Transformations can include clearing stop words, stemming words, and sorting words.
Input
Data only
Output
Transformed data
Configuration Fields
- Column Names The columns containing text data to tokenize.
- Option Mode The tokenization mode.
- Options
JSON
,One token per row
,One token per column
- Options
- Output Column The column where the tokenized text will be stored.
- Include Original If enabled, the input columns will be retained along with the transformed column.
- Clear Stop Words If enabled, common stop words (such as “and”, “the”, “is”) will be removed.
- Stem Words If enabled, words will be reduced to their root form (e.g., “running” → “run”).
- Sort Words If enabled, words in the column will be sorted alphabetically.
Sample Input
ID | TextColumn |
---|---|
1 | This is a sample text. |
2 | Tokenizing helps in NLP tasks. |
Sample Configuration
Sample Output
With One token per row
mode
ID | TextColumn | TokenizedText |
---|---|---|
1 | This is a sample text. | sample |
1 | This is a sample text. | text |
2 | Tokenizing helps in NLP tasks. | tokenizing |
2 | Tokenizing helps in NLP tasks. | helps |
2 | Tokenizing helps in NLP tasks. | NLP |
2 | Tokenizing helps in NLP tasks. | tasks |
With One token per column
mode
ID | TextColumn | TokenizedText1 | TokenizedText2 | TokenizedText3 | TokenizedText4 |
---|---|---|---|---|---|
1 | This is a sample text. | sample | text | ||
2 | Tokenizing helps in NLP tasks. | tokenizing | helps | NLP | tasks |