What's the problem with duplication of content? - Webcertain Multilingual SEO

tel: +44 330 330 9000

Duplication And De-Duplication

What's the problem with duplication? What's the problem with duplication...? What's the...?

In order to present the most appropriate web pages to searchers, Google and other search engines assess the pages in their candidate list - the selection of pages which might be presented in response to a query or search - to eliminate pages that are very similar to other pages in the list.

These so-called ‘duplicate filters' look mainly at the text content of the page after stripping commonly repeated elements, such as the navigation systems, away. The result is that only the best page for a query (such as the one with the greatest number of good quality inbound links) will be shown - the rest will be hidden away in a link at the very bottom of the results with the words:

"In order to show you the most relevant results, we have omitted some entries very similar to the 617 already displayed. If you like, you can repeat the search with the omitted results included."

These duplicate filters are often described as the ‘duplicate penalty' but in fact they no more a penalty than not ranking at the top of a set of results because your page wasn't the most relevant. In fact, for many organisations, if the duplicate filter is hitting your results, it has relatively little impact because at least one of your pages is ranking successfully.

However, for others who have different pricing or product offerings per country or who need to meet specific regulations that apply in one country and not another, the duplicate filters can make a real mess with their conversations with customers and prospects.

It is also true that there are some impacts on a site's performance if there are LARGE amounts of duplication floating about. Ultimately, this can indeed result in some kind of penalty - though this is relatively rare - but on very large sites (in the 100,000s of pages) the crawl rate of the site can be affected by the number of duplicate pages presented to the search engines. If the crawl rate is limited or search engine robots hit a crawler cap, then a new problem emerges which is that not all your quality content may reach the index because the crawler never actually reaches it within your site.

Additionally, if the duplication takes place across different sites - and many of those are sites you don't own, then you have a potentially serious problem. A few years ago, we were working on an international travel site where the same hotel content was being published without any changes across 18 different sites in multiple countries but all in English. Our client was not winning the duplication battle as another more ‘relevant' site was being chosen by search engines as the most relevant.

It is important to note that duplication arises only within the same language. The same page in English, Spanish and French - even it expresses identical ideas - will never be considered identical. The duplication challenge is most common in sites which are make heavy use of global languages such as English, Spanish, French and German. So an organisation targeting the US, UK, Australia, New Zealand and South Africa or one which is aiming to reach Argentina, Chile, Peru and Mexico, are the most likely victims of duplication filters.

How do you achieve ‘De-Duplication'?

De-duplication is an ugly word - but regrettably it's the best description of all the actions you can take to prevent duplication muddles. There are a number of ways you can deal with the problem and the solution for each site and company will vary depending on the circumstances. Let's start with the first area to look at which is site architecture.

Site architecture's role in de-duplicating?

Reviewing your site architecture is definitely one of the first places you should look to solve duplication issues. It's not the subject of this international SEO guide, but it is entirely possible to create duplication content on a site through the structure of the site itself - such as presenting categories of product by price and then by brand and an ecommerce site. These two ways of presenting the same products can easily result in a duplication problem that is entirely self-inflicted.

The best way to solve duplication in this instance is to change the site architecture if you can. Don't offer the same products with the same content in different ways.

On an international site this is actually more difficult. It's not easy to change the details about a product between different countries when it's the same product, with the same features and the same benefits. Or is it? You can present different information in a different order in each country provided that you don't basically present the same ‘shingles' to the search engines. However, changing the order and placement of the shingles will have no effect - the page itself will still be regarded as duplicate if it was before. So you need to address the ‘shingles' themselves to have an impact on your duplication issue. This may mean combining different blocks of text in a single shingle - and then presenting them in a different order on other sites. Or you can simply look at your copywriting methodology.

Changing the content or copywriting of your site is a very simple way to manage the impact of duplication filters. If the content of your site varies little, or it's a small site or your duplication issue is minor this may also be the most cost-effective and simplest solution to effect. But if your site has thousands or tens of thousands of pages, it's not going to be a viable option. To do this, all you need to do is rewrite your content so it says the same things - a different way - and the impact of the duplication filters should disappear. You should also consider the impact of domain names and geo-targeting on your issue.

If a search engine - especially Google - knows that a particular site is targeted at a specific geography ie France rather than Canada - it helps the search engine to choose between two duplicates and to pick the most relevant to that particular geography. The same rules apply here as described in Geo-location. The most important additional point to note in this context is that using a local domain (ccTLD) can help significantly to reduce the impact of duplication as it is the strongest search engine signal of which country is a site is targeted at. The other option would be to adopt a form of crawler blocking to remove duplication issues.

Let's say you have 9 duplicate pages - US, Canada, UK, Ireland, South Africa, Australia, New Zealand, India and the Philipines. All these sites show the same content in English. In addition to thinking about localising some of your content (such as the Philipines and India which could help reduce the issues as well as possibly improve conversion) you can also think about preventing some of these pages appearing in the search engine indexes.

In certain circumstances, actually blocking the search engines from content can help to reduce your problems with duplication.

You would choose to block content from search engines where:

  • Local pricing or product feature variations are not an issue from country to country.
  • At least one form of your content is ranking for relevant terms.

There are various ways you can restrict the content contained in a search engines index to achieve this:

  1. You can block that part of the site from the robots.txt file.
  2. You can add tags to the page you no longer wish to appear (best to leave the ‘follow' aspect of this tag open).
  3. You can construct some device so that that part of the site isn't crawlable by search engines - such as using javascript of images - but we wouldn't recommend this!

If you're going to allow the incorrect country content to appear in a particular country - a wise move may be to add "Looking for our information for Austria" on say a German page ranking in Google.at so that the visitors themselves re-navigate to the right location on your site.

In certain instances, the best solution to duplication relates to timing issues. If your problem relates to blogging and content being redistributed via RSS, for instance, then why not simply delay the publication of that content in RSS for a full 24 hours - this will effectively give the search engine a clear clue that the originator of the content was the first to publish which is very clearly identifiable by the date and time.

In summary, duplication is a common problem for international sites and, despite the search engines best efforts to deal with this problem, the wise marketer will manage the repetition of content between sites - rather than leaving the search engine to do it for them - according to their rules and not yours!

Back to the International SEO activities checklist »

Login to live.webcertain.com