Skip to content

Extract Ngrams

Description

The Extract Ngrams activity processes text data by extracting sequences of words (n-grams) from a selected column. This is useful for tasks involving natural language processing (NLP), pattern recognition, or feature generation for machine learning.

N-grams are contiguous sequences of n items (typically words) in a sentence. For example:

  • 2-gram (bigram): "great product"
  • 3-gram (trigram): "fast delivery time"

This activity provides options to remove stop words, apply stemming, and sort terms before forming n-grams.

Use case:
Extracting bigrams or trigrams from customer reviews, survey responses, or feedback fields for sentiment analysis or topic modeling.


Input

TypeDescription
DataTextual data containing the column to process

Output

TypeDescription
Transformed DataTable with extracted n-gram tokens

Configuration Fields

Field NameRequiredDescription
Column To ExtractYesName of the column containing text to extract n-grams from. (Uses Previous Data Column Editor)
Output MethodYesFormat for outputting extracted n-grams:
  • One per row
  • One per column
  • JSON
Output ColumnYesName of the column to store the resulting n-grams
Include OriginalNoWhether to include the original columns in the output alongside the n-grams
SizeYesNumber of words per n-gram (e.g., 2 = bigram, 3 = trigram)
Clear Stop WordsNoRemove common stop words (e.g., “the”, “is”, “and”) before generating n-grams
Stem WordsNoReduce words to their root form before generating n-grams (e.g., “running” → “run”)
Sort WordsNoSort words alphabetically within each n-gram (e.g., “great product” → “product great”)

Sample Input

ReviewIDReviewTextRatingDateReviewer
101The product quality is amazing52024-02-01Alice
102Great service and fast delivery42024-02-02Bob
103The material is poor and fragile22024-02-03Charlie
104Excellent support and great help52024-02-04David
105Delivery was slow, but good item32024-02-05Emma

Sample Configuration

FieldValue
Column To ExtractReviewText
Output MethodOne per row
Output ColumnExtractedNgrams
Include OriginalNo
Size2
Clear Stop WordsYes
Stem WordsYes
Sort WordsNo

Sample Output

ExtractedNgrams
product quality
quality amazing
great service
service fast
fast delivery
material poor
poor fragile
excellent support
support great
great help
delivery slow
slow good
good item

Use Sort Words and Stem Words options when generating normalized features for text clustering or classification.