The Basics to Finding and Fixing Duplicate Content
Duplicate content is probably the single most common problem we come across when auditing client's sites. Given the frequency of this issue I thought a quick and simple guide to what it is, why it's a problem and how to find it might be useful.
What Is Duplicate Content?
Duplicate content is simply when you have two or more URLs which serve up the same or very similar content. The URL part of this is important, a common cause of duplicate content is having two URLs pointing to the same file, to web developers this means that they pointing to the same page, but the search engines treat each URL as a separate page.
Content can be duplicated both within a single domain and across domains, either way if possible it should be avoided.
Why is Duplicate Content Problematic
Depending on the scale of the content issues it can vary, but there are four issues that can arise as a result of duplicate content.
1. It can split authority. The equation here is pretty simple. The more links you have pointing to a page the more authoritative it will be. If these links are split between two pages, then neither page will be as authoritative as it could have been.
2. It makes it harder for search engines to find and index new pages. Search Engines will only crawl a certain number of pages on your website. How many pages they crawl depends on a range of factors including page rank, how much traffic the page can support and so on. If you have a lot of duplicate content then they will take up a lot of your 'crawl budget' that could be better spent on new pages. Stone Temple has an interview with Matt Cutts that goes into this in more depth.
3. Panda. Part of the reason for this update was to target websites which tried to target a range of search terms by creating a lot of very similar pages with duplicate content apart from a few keywords swapped out. For example having pages for each city in the UK with same content on it, just with the city name changed.
4. Search Engines ranking the wrong page. Simply put it if you have two or more pages that could be relevant for a query it makes it harder for the search engines to identify which is the correct one to serve for that search term. How much a concern this is for you may vary, but subtle differences can make a big change to conversions.
So that's why you should get rid of it, how do you find it?
This is the most time consuming method, but if you know where to look for duplicate content then this can be can be a quick way of picking out some key pages. The classic example of this is duplicate home pages.
One of the first things I do when looking at site is just try the following combination to see if there are any duplicates.
- http://www.example.co.uk (and other top level country domains)
While these duplicates will often be picked up by the methods below the manual search makes for a quick way to check known trouble spots.
Other areas to check are products which fall into multiple categories such as:
Any pages where parameters applied such as tracking parameters, session ids and so on for example:
And paginated pages such as http://www.searchlaboratory.com/blog/category/seo/page/2/
There are a number of crawlers out there, but my favourite is Screaming Frog. Though SEOMoz's Crawl Test has the advantage of flagging pages with duplicate content, it only covers 3000 pages but that can often be more than enough to tell you how duplicate content is being generated. Google's Webmaster Tools isn't a crawler itself, but will give you information provided by Google Bot. However all of these can help you find duplicates.
The first stage is to check for duplicate title tags. These don't always result from duplicate pages but it is a quick way to find them. Once you have found duplicate tags you can then check the URLs to see if the content is duplicated or just the title tags.
I use http://www.urlopener.com/index.php along with the Snap Links Lite extension if I want to check a lot of pages at once. You should be able to spot a pattern fairly quickly which tells you which URLs are creating duplicate content and which aren't.
"Site:" Search Operator
The site search operator is another way to pick up on duplicate content. A quick and simple technique is just copy a block of text and then search your own site using the following syntax "site:www.example.com [paragraph of text copied from site]". If you get the message below then you know that there are duplicate results.
You can find the duplicates just by searching with the omitted search results.
Fixing Duplicate Content
As this is a basics blog post, and I want to avoid writing a dissertation I'll just cover the two simplest ways of fixing duplicates. My colleague Tom has covered using robots.txt files to block duplicate content in another post.
Personally I would use these when the content is identical and there is no reason for the duplicate page to exist independently. Examples of this include duplicate homepages (with some exceptions), product pages with different navigation paths and duplicate pages that have been created by mistake. With these pages simply place a 301 redirect on the duplicate to the original. This will pass most of the duplicates authority onto the original page and make it easier to navigate the site.
A canonical tag is simply a type of link you can add to thesection of the source code of the duplicate page. They have the following format:
This tag tells the search engines that you know that the page is very similar to another page and that they should index the other page in preference to the page with the canonical tag on it.