What it is
One of the most common technical problems we see is duplicate and near-duplicate content. Often this arises from faceted navigation systems, for example the color, size, or style selection dropdowns or tick boxes on ecommerce sites.
Let’s say we have a range of 200 donut products. The user can filter this by 6 different specialities and 14 different flavors, and 3 different pack sizes. Assuming that the user can only select a maximum of one speciality, flavor, and pack size at a time, the number of possible combinations- and therefore the number of potentially indexable pages created by this faceted navigation system – is 7 x 15 x 4 (because not selecting from any group also counts as an option) = 420 pages, or more than twice the number of actual products we have. If the user can select more than one option at a time or the order of parameters in the resulting URL isn’t strictly enforced, the number of pages grows even faster.
Why it matters
We don’t want all these pages to be indexed because we have a limited amount of search engine attention (‘crawl budget’) and it’s unlikely that there will be worthwhile search traffic for ‘low fat strawberry donuts 12 pack’. Even in the long tail, it’s unlikely that combination warrants its own landing page. On the other hand, ‘vegan jelly donuts’ may well be a useful landing page.
So we need to decide which pages should and should not be indexed, and how to deal with the ones that shouldn’t be. As you may have guessed, this relates heavily to domain management and site organization.
What to do
- Consider where the content is likely to have greatest value, eg where the desired landing points for organic traffic are.
- Measures like enforcing search parameter order can dramatically reduce the number of pages produced.
- Choices will have to be made about which copy of duplicate content is most suited to being the primary or ‘canonical’ copy.
- Where possible, rel=canonical markers should be used to indicate parent URLs for near-duplicate or similar child URLs with lower search value.