The FindWAtt Blog
Google Shopping, Product Feed Management & Product Data Optimization
Crack Phrases in Product Titles to Unleash their Power
Problem: failure to correctly identify, analyze and manipulate key phrases in Product Titles
Why it matters: can block and impede optimum use of title and negatively impact performance of product in Google Shopping or other shopping campaign
Findwatt’s solution: automatic analysis and correction/management
Why do Phrases Matter?
Phrases in product data, especially in Product Titles, can be ambiguous. Cracking this ambiguity is critical to being able to manipulate Product Titles for maximum impact in your Google Shopping or other campaign. And word-by-word analysis can fail miserably in handling ambiguity because it overlooks the contextual linkage between adjacent words that we humans see without even realizing.
For example, with the Product Title:
Blue Lagoon Hot Springs T-Shirt, Safety Orange
Analyzing the title word-by-word would identify the possible attributes value pairs:
- Colors: "Blue" and "Orange"
- Product Types: "Springs" and "T-shirt"
as well as other possible attribute values such as:
- “Safety” – Is this Safety Gear?
- “Hot” - Does this thing produce heat?
But analyzing the same title on a phrase basis shows that:
- "Blue Lagoon" is a brand.
- "Hot Springs" is a compound noun; likely a theme, style, or image.
- "T-shirt" is a product type.
- "Safety Orange" is a color.
These “cracked phrases,” each with an identified and discrete meaning, can now be manipulated in numerous ways to test the optimal Product Title for Shopping Campaigns. Note that the optimal Product Title order may vary by channel.
The Power of Manipulating Phrases in Product Titles
Manipulations of Product Titles can vary from very minor to significant, for example:
Clearly Indicating Variant
- Blue Lagoon Hot Springs T-Shirt - Safety Orange
Using a dash at the end of the Product Title to indicate the variant of color signals that the product is available in more than one color.
- Blue Lagoon, Hot Springs, T-Shirt - Safety Orange
Probably not necessary for search engines but helpful to the human eye.
- Blue Lagoon - Hot Springs - T-Shirt - Safety Orange
Removes any possible ambiguity, for both search engines and humans.
- Blue Lagoon T-Shirt - Hot Springs - Safety Orange
If “Hot Springs” represents a style, perhaps Blue Lagoon makes other styles, so it could be treated as a variant.
- Hot Springs T-Shirt by Blue Lagoon - Safety Orange
Also, a possibility if name recognition of “Blue Lagoon” is much higher than “Hot Springs.”
This is a great example of the need to understand your Search Query Report to properly optimize product data.
Attribute Addition - Insertion
Unless this is a unisex T-Shirt, and perhaps even then, gender should be added. Typically, gender is too important an attribute to be tacked onto a Product Title at the end. Besides, gender is much less frequently a variant than attributes like size and color so, if a variant, it should be placed in a senior position to (i.e. left of) size and color.
Without knowing the phrases and their meaning it would be possible to insert a new word (in this case gender) and destroy a phrase – e.g. “Hot Men’s Springs.”
With phrases clearly identified, additional attributes can be inserted accurately with various experimental positions: For example:
- Blue Lagoon Men’s T-Shirt - Hot Springs - Safety Orange
- Men’s Blue Lagoon T-Shirt - Hot Springs - Safety Orange
- Blue Lagoon Men’s Hot Springs T-Shirt - Safety Orange
How to do this yourself at scale
You might say:
- “Doing this work for one Product Title is OK manually.”
- “And, ten, yeah I could do that.”
- “100, Hmmm” I don’t like the sound of that.”
- “And, 1000, please don’t make me!”
- “10,000 – fuggedaboutit.”
For us, 10,000 SKUs is a pretty small Data Feed. We frequently receive over 100,000 and sometimes get into the millions. So, we had no choice but to develop an automated tool for processing product data more efficiently – we call ours “The System.” Here’s the blueprint to build this aspect of our System if you’d like to be able to handle phrase identification at scale.
- Split up each product title and description into words and sentences.
- Many algorithms freely available, but none are optimized for the peculiarities of product titles.
- Parsing sentences can be surprisingly tricky – here’s a primer.
- Generate sequential phrases of various lengths – these are known as n-grams.
- N-gram logic in normal sentences is simple and there’s lots of material online.
- Product categories (like google product category or merchant category) can be a good source of product type phrases. However, parsing these into n-grams is more complicated because they can contain multiple nodes, and can contain slashes or ampersands that indicate different n-gram versions should be created. This means you’ll need to write your own custom n-gram extractor from your category field.
- Count n-gram frequencies across all products.
- Examine the frequency of each n-gram against the frequencies of the larger patterns it fits inside to exclude combinations that are unlikely to be phrases.
- If the pattern appears as many times as a larger pattern in which is it found, then the shorter pattern is probably not a specific phrase so remove it from list of possibilities.
- If the pattern appears less frequently, then the longer pattern is probably not a very specific phrase so remove it from list of possibilities.
- Examines the frequency of n-gram against the frequencies of shorter phrases that form it.
- If the n-gram is 3 or more words, and one of the two word combinations inside it has a higher frequency, it is unlikely to be a phrase and should be removed from the list of possibilities
- If the pattern has a very low frequency relative to the frequencies of the individual words in it, and those words aren't very common either, it is also less likely to be a phrase.
- Now that you have a list of potential phrases based on their frequencies, validate them by examining how different their meanings are to the meanings of the words inside them.
- For example, "Hello Kitty" means something completely different than saying "hello" (a greeting) to a "kitty" (a domestic cat). "Golden Gate" is neither golden, nor an actual gate. But "glass bowl" is just a bowl made of glass, so doesn’t need to be treated as a phrase to extract the accurate meaning.
- We approximate semantic meaning by comparing the context that the words and phrase appear in. This is the same technique used in word vectors for machine learning, but in a simplified form (no complicated cosine distance metrics or curse of high-dimensionality)
- Collect 4 sets of frequencies: the words that appear before the phrase, the words that appear after the phrase, the words that appear before the first word in the phrase, and the words that appear after the last word in the phrase.
- Measure the Jaccard similarity of the two word-before sets, and two word after sets (works best if you take the frequencies into account instead of just the unique words). The similarity is the measurement of how much the phrase and words are used in the same contexts, which is our approximation of similar semantic meaning.
- The inverted probability (1 - jaccard similarity) is therefore our confidence that the phrase is important to treat differently.
- We can exclude any phrase candidates with a confidence below a selected ratio.
Once we have found the important phrases, we can more accurately perform a variety of analyses on raw product data that allows us to do the following:
- Find complete product types to improve product titles
- Extract structured attributes from unstructured product data
- Improve spell-checking by recognizing trademark phrases, brands and product lines
- Automatically discover and optimize the order of elements in a title
- Create a semantic relationship between similar meaning words for suggesting related products
- Combining individual product variants into appropriate family groups
When you make maximum use of all the components in your product data, you avoid ambiguity but more importantly, you hold the power to make maximum use of that data in Product Titles to get the best performance.