Description
The Extract Ngrams activity processes text data by extracting sequences of words (n-grams) from a selected column. This is useful for tasks involving natural language processing (NLP), pattern recognition, or feature generation for machine learning.
N-grams are contiguous sequences of n items (typically words) in a sentence. For example:
- 2-gram (bigram):
"great product"
- 3-gram (trigram):
"fast delivery time"
This activity provides options to remove stop words, apply stemming, and sort terms before forming n-grams.
Use case:
Extracting bigrams or trigrams from customer reviews, survey responses, or feedback fields for sentiment analysis or topic modeling.
Type | Description |
---|
Data | Textual data containing the column to process |
Output
Type | Description |
---|
Transformed Data | Table with extracted n-gram tokens |
Configuration Fields
Field Name | Required | Description |
---|
Column To Extract | Yes | Name of the column containing text to extract n-grams from. (Uses Previous Data Column Editor) |
Output Method | Yes | Format for outputting extracted n-grams:- One per row
- One per column
- JSON
|
Output Column | Yes | Name of the column to store the resulting n-grams |
Include Original | No | Whether to include the original columns in the output alongside the n-grams |
Size | Yes | Number of words per n-gram (e.g., 2 = bigram, 3 = trigram) |
Clear Stop Words | No | Remove common stop words (e.g., “the”, “is”, “and”) before generating n-grams |
Stem Words | No | Reduce words to their root form before generating n-grams (e.g., “running” → “run”) |
Sort Words | No | Sort words alphabetically within each n-gram (e.g., “great product” → “product great”) |
ReviewID | ReviewText | Rating | Date | Reviewer |
---|
101 | The product quality is amazing | 5 | 2024-02-01 | Alice |
102 | Great service and fast delivery | 4 | 2024-02-02 | Bob |
103 | The material is poor and fragile | 2 | 2024-02-03 | Charlie |
104 | Excellent support and great help | 5 | 2024-02-04 | David |
105 | Delivery was slow, but good item | 3 | 2024-02-05 | Emma |
Sample Configuration
Field | Value |
---|
Column To Extract | ReviewText |
Output Method | One per row |
Output Column | ExtractedNgrams |
Include Original | No |
Size | 2 |
Clear Stop Words | Yes |
Stem Words | Yes |
Sort Words | No |
Sample Output
ExtractedNgrams |
---|
product quality |
quality amazing |
great service |
service fast |
fast delivery |
material poor |
poor fragile |
excellent support |
support great |
great help |
delivery slow |
slow good |
good item |
Use Sort Words and Stem Words options when generating normalized features for text clustering or classification.