---
title: Extract Ngrams
description: Extract n-grams from a text column for linguistic or pattern analysis.
category: Data Transforms
tags: [nlp, ngram, text analysis, tokenization, bigram, trigram]
---

# Extract Ngrams

## Description

The **Extract Ngrams** activity processes text data by extracting sequences of words (n-grams) from a selected column. This is useful for tasks involving natural language processing (NLP), pattern recognition, or feature generation for machine learning.

N-grams are contiguous sequences of _n_ items (typically words) in a sentence. For example:

- 2-gram (bigram): `"great product"`
- 3-gram (trigram): `"fast delivery time"`

This activity provides options to remove stop words, apply stemming, and sort terms before forming n-grams.

> **Use case**:  
> Extracting bigrams or trigrams from customer reviews, survey responses, or feedback fields for sentiment analysis or topic modeling.

---

## Input

| Type | Description                                   |
| ---- | --------------------------------------------- |
| Data | Textual data containing the column to process |

---

## Output

| Type             | Description                        |
| ---------------- | ---------------------------------- |
| Transformed Data | Table with extracted n-gram tokens |

---

## Configuration Fields

| Field Name            | Required | Description                                                                                               |
| --------------------- | -------- | --------------------------------------------------------------------------------------------------------- |
| **Column To Extract** | Yes      | Name of the column containing text to extract n-grams from. (Uses Previous Data Column Editor)            |
| **Output Method**     | Yes      | Format for outputting extracted n-grams:<ul><li>One per row</li><li>One per column</li><li>JSON</li></ul> |
| **Output Column**     | Yes      | Name of the column to store the resulting n-grams                                                         |
| **Include Original**  | No       | Whether to include the original columns in the output alongside the n-grams                               |
| **Size**              | Yes      | Number of words per n-gram (e.g., 2 = bigram, 3 = trigram)                                                |
| **Clear Stop Words**  | No       | Remove common stop words (e.g., “the”, “is”, “and”) before generating n-grams                             |
| **Stem Words**        | No       | Reduce words to their root form before generating n-grams (e.g., “running” → “run”)                       |
| **Sort Words**        | No       | Sort words alphabetically within each n-gram (e.g., “great product” → “product great”)                    |

---

## Sample Input

| ReviewID | ReviewText                       | Rating | Date       | Reviewer |
| -------- | -------------------------------- | ------ | ---------- | -------- |
| 101      | The product quality is amazing   | 5      | 2024-02-01 | Alice    |
| 102      | Great service and fast delivery  | 4      | 2024-02-02 | Bob      |
| 103      | The material is poor and fragile | 2      | 2024-02-03 | Charlie  |
| 104      | Excellent support and great help | 5      | 2024-02-04 | David    |
| 105      | Delivery was slow, but good item | 3      | 2024-02-05 | Emma     |

---

## Sample Configuration

| Field             | Value           |
| ----------------- | --------------- |
| Column To Extract | ReviewText      |
| Output Method     | One per row     |
| Output Column     | ExtractedNgrams |
| Include Original  | No              |
| Size              | 2               |
| Clear Stop Words  | Yes             |
| Stem Words        | Yes             |
| Sort Words        | No              |

<!-- ![alt text](extract-ngrams-img.png) -->

---

## Sample Output

| ExtractedNgrams   |
| ----------------- |
| product quality   |
| quality amazing   |
| great service     |
| service fast      |
| fast delivery     |
| material poor     |
| poor fragile      |
| excellent support |
| support great     |
| great help        |
| delivery slow     |
| slow good         |
| good item         |

---

> Use **Sort Words** and **Stem Words** options when generating normalized features for text clustering or classification.
