Skip to content

Group longtail values

Description

The Group Longtail Values activity helps streamline datasets by consolidating lesser-used, low-frequency, or non-priority values in a column into a single replacement value.

It is often used to reduce category fragmentation and simplify downstream analysis by focusing on the most relevant or allowed values and grouping the remaining entries under a common label (e.g., Others).

Use this activity to:

  • Clean and normalize long-tail categorical data
  • Replace values not in an allow-list with a defined label
  • Focus analysis on key brands, categories, or terms
  • Minimize noise from low-frequency entries in visualization or reporting

Use case:
A dataset contains numerous product brands, many of which appear only once or twice. To improve chart readability, you can group all brands not in the top 3 (Apple, Samsung, Google) as Others, using this activity before visualizing brand performance.

Input

Input TypeDescription
DataInput dataset to transform

Output

Output TypeFormatDescription
DataTableTransformed data with grouped values

Configuration Fields

Field NameDescription
Column NameThe name of the column where longtail values should be grouped.
Allow ListList of allowed values. Any value in the column not in this list will be replaced.
Replacement ValueThe value used to replace entries not in the allow list (e.g., Others, Misc, Unknown).

Sample Input

product_idproduct_namebrand_names
P001SmartphoneApple, Samsung, Google
P002LaptopDell, HP, Lenovo
P003HeadphonesBose, Sony, Sennheiser
P004TVLG, Samsung, Sony
P005SmartwatchFitbit, Garmin, Apple

Sample Configuration

FieldValue
columnNameproduct_name
allowListSmartphone, Headphones
replacementValueOthers

Sample Output

product_idproduct_namebrand_names
P001SmartphoneApple, Samsung, Google
P002OthersDell, HP, Lenovo
P003HeadphonesBose, Sony, Sennheiser
P004OthersLG, Samsung, Sony
P005OthersFitbit, Garmin, Apple

In the above example, only Smartphone and Headphones were part of the allow list. All other values in the product_name column were replaced with Others.