Automatic Classification of Products into Categories

Why Product Categorization Matters and How It Can Be Done.

Tim Gilbert · 12 October 2017

Why it matters

  • You'll need to have products in categories to properly structure your Adwords campaigns for efficient PPC campaigns. Products should be placed in the most specific Google Product Category Possible to optimize SEO and display your product in the most relevant searches.
  • Categorization to proper categories allows automatic checking for required attributes. More significantly, it lets you optimize product titles to increase clicks by ensuring that titles include the product type specific attributes that buyers are searching for.
  • Incorrect categorization will lead to problems with other analyses and quality checks, and poor ecommerce performance.

How it can be done

There are different techniques to classify a product into the correct category of an existing taxonomy (e.g. Google Product Category, Amazon Taxonomy, custom Merchant Category). I'll briefly describe them here, and go into more detail below.

The first technique is to find a product type in the title or description that says exactly what the product is. It might be a highly specific phrase like "swimming pool float", or single word like "bikini". Once we have identified the product type, we can look up which category it belongs to.

The second technique is to directly infer which category the product should be in using product attributes, brand, and other data. This is necessary where either the text doesn't contain something saying exactly what the product is (which happens with surprising frequency in data feeds that originated from a retailers website), or where the described product type is too generic to map to one category (e.g. "lock" could be a bicycle accessory or door hardware or computer software).

Technique 1: Automatically discovering and classifying by product type

The first step in this technique is to identify the possible words/phrases in the product title and description that could be describing what the product type is.

Method A — Titles vs known product types


Usually we start with the title. First we try to exclude parts of the title we already know aren't the product type (e.g. brand, or quantity). Then we go through and look for phrases are known product types. If multiple known product types are found, such as in titles like "Charm, Bicycle, Sterling Silver" or "Basketball Shirt - XL", we need mechanisms to choose which is more likely to be correct "Charm" or "Bicycle", "Basketball" or "Shirt".

One way is to calculate how confident the product type is (how often a word is the correct product type vs how many times it appears in titles). "Basketball" might only be the real product type on 10% of the titles it appears in, while "Shirt" might be the product type 99% of the time, so "Shirt" is a better guess.
Another way is to look for other words that would provide indirect evidence the word is the product type. For "Charm", the word "Sterling" is a very strong indication that it's the product type. For "Shirt", the size "XL" is similar evidence.

Method B — Descriptions and indicators


Next we look in the product description. Descriptions have more text so they contain more words that might be incorrect, but they also are written in a more conversational style than titles. This means we can look for phrases like "this coverall is made of" or "this tote features", and see that "coverall" and "tote" are likely product types for their respective descriptions.

This method is highly effective when those phrases are present, but in many cases the description doesn't include one of these phrases.

Method C — Word Vectors


Word vectors are a way of capturing semantic similarity of words using machine learning. If the products are already divided into groups/categories (even if they aren't the categories that we need to classify to), we can use word vectors to help identify the product types in text. First we create word vectors (or use an existing vector model like the ones from GloVe) for each title. Then we find the average vector of the group's titles together with the vector of the category name to find the semantic center of the group.

Word Vectors in Categories

Last we go through the words in each title, and the noun-phrases in each description (identifed by Natural Language Processing with NLTK or spaCy), and pick the words/phrases that are closest to the semantic center for the group.

There are other word vector methods of classification that I have ideas for but haven't experimented with yet.


Once we have identified the possible product type using these methods, we can look up the associated category. The list of product types has to be supervised by a human, and will need new product types discovered by paired indicators and word vectors to be added and mapped to the correct category.

Technique 2: Inferring category from other data

For this technique, we don't directly try to find a phrase that describes what the product is (because it's missing, or vague, or even incorrect). For this technique, we need training data and lots of it (preferably thousands of products for each individual category). Training data is information on products thats that we already know the correct categories for. This lets us have the computer learn how to recognize more subtle indications on how a product should be classified.

Method A — Naive Bayes Classifier


We divide up the training data titles and descriptions into n-grams (sequences of words of varying length) or skip-grams (groups of words that aren't necessarily in a specific order), and use a machine learning naive bayes model to calculate the relationships of the n-grams of each product with the each category.

Note: Product data is much messier than normal prose text, and requires greater pre-cleaning of punctuation, unicode characters, abbreviations, units of measurements, etc.

Then we take the data on the products that haven't been classified yet, and run the naive bayes classification model on them to predict the most likely category for each unknown product, and the confidence of the suggestion. It's also good practice to take the 2nd best guess from the model see how close it is to the 1st choice in probability. If it's close, then the product could be a set of products from multiple categories, or not fit well in any existing categories, or be lacking information that would reveal the category it belongs in.

This method is very memory intensive, and works better on a smaller taxonomies (several dozen to hundred, rather than many thousands like the Google Product Category or Amazon Taxonomy).

Method B — Category-specific ngrams


This is similar to the naive bayes classifier, but much simpler. For this, we only look at the ngrams that are highly specific or even unique to a particular category. We exclude ngrams that have a low Herfindahl index for the categories they appear in. Then we look for these trigger ngrams in the titles and descriptions of each product and classify accordingly.

This is much faster, requires less memory, and is easier to track why a particular product was classified to a category. It's only really suited for classifying to the root nodes "Apparel & Accessories" vs "Animals & Pet Supplies". It doesn't do well trying to distinguishing between the very specific nodes like "Apparel & Accessories > Clothing > Activewear > Hunting Clothing > Ghillie Suits" and "Apparel & Accessories > Clothing > Activewear > Hunting Clothing > Hunting & Tactical Pants" because of the rare number of unique and meaningful ngrams at that level.

Method C - Hierarchical classifier


This is also similar in concept to the bayes, but it doesn't try to classify products immediately to a specific node like "Health & Beauty > Personal Care > Cosmetics > Bath & Body > Bar Soap". Instead if first classifies the product to "Health & Beauty", then from the nodes under it into "Personal Care", then into "Cosmetics" and so on.

This method still makes mistakes at the detail levels, but is less likely to classify products in the completely wrong root node. It also can handle larger/deeper taxonomies better, although it is slower and even more memory intensive.

Method D — Product attributes


Specific attributes or combinations of attributes from structured data fields like "Color", or extracted from title and description. Many categories have products with a value in the color field, but only a few categories also have "Sleeve-Length" or "Button-Style". Sometimes a combination of attributes can be more specific than either is separately. For example, several types of products have "Fan Speed", and several types have "Number of Bulbs", but combine them together and it is almost certainly "Home & Garden > Household Appliances > Climate Control Appliances > Fans > Ceiling Fans".

For this method, it doesn't matter what the particular values for the attributes are, just whether or not the product has them.

Method E — Brand/Manufacturer


Some brands or manufacturers have a small product range tightly focused on a very specific category. "Liberty Drums" makes drums and drum kits, so a product with this brand will almost certainly be "Arts & Entertainment > Hobbies & Creative Arts > Musical Instruments > Percussion".  Others may be more broad, but still associated with a root node. "Levi's" makes mostly pants, but also shirts, jackets, sweaters, shoes, and belts. A "Levi's" product is certainly under "Apparel & Accessories", probably under "Apparel & Accessories > Clothing", and more likely than not under "Apparel & Accessories > Clothing > Pants".

This method is less useful for the specific nodes since so many manufacturers have a range of product types, and doesn't work when the brands have products widely spread across root nodes.


Validating Categorization

Once all the products are classified in categories, we still aren't finished. The next step is do a check and make sure they are correct or identify which were mistaken. Here are some ways we can double-check the categorization.

Check 1 — Finding the same category by multiple methods


 If the brand and ngrams and knowledge base methods all agree that a product is "Furniture > Chairs > Bean Bag Chairs", we probably don't need to have a human examine it. If the category suggested by brand is only the root node "Furniture", it's also probably a good classification.

But if the brand is implying the category "Arts & Entertainment > Party Celebration > Party Supplies > Chair Sashes", then it mostly likely needs some review.

Check 2 — Check against another category


If you're classifying to a Google Product Category and you already have a Merchant Category, you can compare the two to see if they are similar in at least some words, or synonyms, or word vectors.

If the merchant category is "Food > Hot Drinks > Tea\Infusers", and the suggested Google Product Category is "Food, Beverages & Tobacco > Beverages > Tea & Infusions", they share several words in common (especially at the root and end nodes), which confirms the categorization.

Check 3 — Frequency of category in feed


If we classify a catalog and find that 999 products are in "Electronics" and 1 product is "Food, Beverages & Tobacco", that classification is highly suspicious. Maybe it detected a product type of "Blackberry" that mapped to "Food, Beverages & Tobacco > Food Items > Fruits & Vegetables", instead of "Electronics > Computers > Handheld Devices > PDAs".

Check 4 — Word vector distance


We expect the average word vector of the title and noun phrases from description to be similar to the rest of the products in the category. If they are very different, either it is misclassified, or the product text needs to be rewritten.

For example, if the category "Arts & Entertainment > Party & Celebration > Gift Giving > Greeting & Note Cards" contained mostly holiday greeting cards, and a business card case accidentally got put in the same category, its word vectors would be very different than the rest of the category. We would then move it to "Apparel & Accessories > Handbags, Wallets & Cases > Business Card Cases"

Check 5 — Price


Products in a category usually vary in price, but not infinitely so. If most products in a category are between $250 to $750, and one product is $20,000, then it needs to be checked.

For each category, we can calculate the average price and standard deviation, and use that highlight products that are outliers and need to be verified.

Now that all the products have been classified into categories, we can:

  • check structured attributes to meet Google Shopping data feed requirements
  • optimize ecommerce titles for PPC and SEO performance
  • properly structure our Adwords campaigns to maximize revenue and minimize costs

If you like reading about how algorithms and AI can be used to analyze and automatically improve product data, follow me on twitter @AlgorithmistTim




Highlighted Posts

  • Announcing the Product Title Performance Grader

    Today, we’re announcing the release of our new Product Title Performance Grader, a free tool...

    Read More
  • Case Study: Optimizing Product Titles for Google Shopping

    The Problem Our client suspected that their product titles were limiting the effectiveness of...

    Read More

FREE Product Title Grader

Are Your Product Titles Optimized?

  • Increase Impressions
  • Increase Clicks
  • Increase CTR