BlackHatMoneyMaker.com - BlackHat SEO - BlackHat ForumBlackHatMoneyMaker.com - BlackHat SEO - BlackHat Forum
  BlackHatMoneyMaker.com - BlackHat SEO - BlackHat Forum
Register Downloads FAQ Members List Calendar Search Referrals Today's Posts Mark Forums Read Experience

Go Back   BlackHatMoneyMaker.com - BlackHat SEO - BlackHat Forum > BlackHat Forum > Blog Archive

Home

Forums

Live Chat

Upgrade To VIP




BlackHat – “Duplicate Content Penalty”

Blog Archive Thread, BlackHat – “Duplicate Content Penalty” in BlackHat Forum; Duplicate Content Penalty – Does it Really Exist? Youve been hearing this phrase for a few years now. Youve been ...

Reply
 
LinkBack Thread Tools Display Modes
  #1  
Old 01-06-2010, 05:20 PM
BlackHat Novice
Points: 905, Level: 17 Points: 905, Level: 17 Points: 905, Level: 17
Activity: 99.0% Activity: 99.0% Activity: 99.0%
Last Achievements
 
Join Date: Dec 2009
Posts: 201
Thanks: 0
Thanked 17 Times in 15 Posts
Downloads: 0
Uploads: 0
Default BlackHat – “Duplicate Content Penalty”

Duplicate Content Penalty – Does it Really Exist?

Youve been hearing this phrase for a few years now. Youve been told that it is bad that it can kill your sites rankings, get your site deindexed by Google, drop pages, and penalize pages. This list goes on and on. What is truth? What is myth? Are there any facts?

Ive got the answers for you.

Quick Summary: Duplicate content is detected fairly quickly and accurately by Google within a single domain. Mostly it is detected by the following:

* Repeated Page Titles on Multiple Pages
* Print Friendly Pages that are the same as HTML pages
* Inconsistent Linking (www and non-www) and not using a Server Redirect
* Circular Navigation when using breadcrumbs in eCommerce sites you can have (brand/category/item and category/brand/item)
* Product Only Pages (for example, different pages featuring different colors but using the same content).

However, duplicate content outside the domain isnt detected at the level that Google claims it can. There are numerous examples where the same article listed on two different domains is listed #1 and #2 in the SERPs.

In terms of an actual duplicate content penalty there is none that I have detected. The page is either sent to the Supplemental Index or is deindexed. There is no drop in ranking which can be verified. From testing and research it is conclusive that the duplicate content penalty is more of an internal site issue than an external site issue. There are too many exceptions to believe that there are duplicate matches being picked up by the filters. Any filtration that is occurring across domains must be through manual filters, responses to spam reports, or the duplicate content is linked together, therefore making it easy to find by Google.

What You Should Do: Stop stressing over duplicate content but ensure that you are following basic protocol for site building as explained below:

Uncovering the Duplicate Content Penalty

Lets first define Duplicate Content as per written by Adam Lasnik from Google on December 18, 2006:

What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely matches other content or is appreciably similar. Most of the time when we see this, its unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and worse yet linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.

Translation: Duplicate content is defined as a substantive block of content appearing on more than one page in Googles index.

Question: What is the definition of substantive block of content? Google doesnt tell us that, and of course that is done for a reason. Lets look for a moment at a paragraph of content and test it to see if it is duplicate content:

There are many aspects that set us apart from our competitorslow prices, no minimum orders, years of experience, etc. Ultimately, however, we offer exactly what todays retail customer wants most: the best products at the best prices.

Come on in and visit our store now.

The above text is taken from a dynamic web builder that has auto generated default text placed on the home page for new accounts. Doing a search for that exact paragraph produces 193,000 results in Google. At least we know there are 193,000 lazy website owners who couldnt even change the default text. So, we know that a single paragraph isnt substantive block of content.

So, lets look at more:

There are many aspects that set us apart from our competitors-low prices, no minimum orders, years of experience, etc. Ultimately, however, we offer exactly what todays retail customer wants most: the best products at the best prices.

Come on in and visit our store now

Our Commitment To You

Were committed to bringing you the products you want at the best prices every day, and to providing you with the convenient, secure shopping experience you deserve. If you have any ideas about how we can serve you better, please contact us and well do our best to help.

There are 687 pages in the Google index that have this exact wording. Even down to the ellipses.

So, that isnt duplicate content either. Lets try again. I took full articles and did searches. They were indexed in multiple places, the majority of them with PageRank. Hmmm. Okay, so that isnt duplicate content either. I give up. What exactly is substantive content? We may never know.

Now that we have the framework that duplicate content on external sites is not occurring on a consistent basis, lets go back and figure out where this phrase duplicate content penalty came from and what Google does about it. We again turn to Adam Lasnik for a breakdown:

What does Google do about Duplicate Content?

During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in regular and printer versions and neither set is blocked in robots.txt or via a noindex meta tag, well choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, well also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments so in the vast majority of cases, the worst thing thatll befall webmasters is to see the less desired version of a page shown in our index.

Translation: If Google detects duplication, the pages will be removed via a filter (removal from index) rather than a ranking adjustment (drop in the SERPs).

History of the term Duplicate Content as it relates to Google Search

Starting in September of 2003, Google patents an update to their filters which helps detect query-specific duplicate documents. Then later in the year in December of 2003, they get another patent which added a filter that assisted in detecting duplicate or near-duplicate files. Google began assigning a number of fingerprints to a given document and duplicate content would be determined if one of the fingerprints matched another document.

Note: I have tested this and tested this and I cannot discover the existence of a fingerprint or multiple fingerprints in a document.

At Webmaster World in 2004 Matt Cutts, Spam Czar of Google, stated that there is a waiting period for offenders of the duplicate content issue to have their pages placed back in the Google index. While Matt didnt use the term sand box that is essentially what was alluded to. No warning will be issued to the webmaster via email for the reason of the removal. Here is the sentence that will be given:

First Offense – 30 day waiting time
Second Offense – 60 day waiting time
Third Offense – 90 day waiting time

Then in January 2005, Danny Sullivan reports that after asking Matt Cutts about the 30-60-90 numbers, Matt backed down and said, I mentioned concrete numbers [at PubCon], but as a for example illustration. Its not a 30-60-90 day thing.

Danny added commentary, but it was pure speculation: So, for example, a first offense might might go two weeks, a second six weeks. Exactly what the timing would be between offenses would be up to Google and might even vary depending on the site involved, it sounds like.

Then in January 2006, Google announces another patent for phrase identification which essentially identifies the documents as being unique.

November 2006 at Webmaster World in Vegas, Brian White, who works with Matt Cutts stated if you use syndicated content on your site, you need to place the link to the original source as an absolute link or you will get hit with duplicate content.

On January 5, 2007 Adam Lasnik clarified many issues in a recent post on Googles blog. The issue of note is that he commented on a suggestion made to add to Webmaster Central a duplicate content meter so webmasters would know if there was a problem. Adam thought it was a great idea, but also stated that it isnt as simple as Page A is x% the same as Page B so it is duplicate.

And Adam states, Penalties in the context of duplicate content are rare. Ignoring duplicate content or just picking a canonical version is MUCH more typical.

Adam states that there isnt a set time period for new sites to be sand boxed and further said that Googles algorithms look at pages of a site and ask What value is this site providing that users cant get from other sites or even the mother ship (originator of content).

I highly doubt the above is happening. If the Google algorithm is really that good that it has complex human decision making ability and can formulate opinions to gauge if the site is providing value to the user over what is available elsewhere, they should be able to run the entire company with five people: CEO, Server Babysitter, Gopher, Receptionist, and a Janitor.

October 2008, John Mueller in his Myths and Misconceptions presentation, stated that Dont worry too much about duplicates on your site and Were good at ignoring duplicate. Just the opposite is true. Internal duplicate content is extremely damaging and lets face it, if Google is really saying they are good and not finding external duplicate content, thus ignoring it they would be making a correct statement here.

Now, if you notice and really look at the threads, posts, transcripts of conferences, Matt, Adam and Brian rarely, if ever, say duplicate content penalty they usually say just duplicate content or duplicate content filter. The term penalty added onto the end of the phrase occurred by a poster in a forum.

It seems as if we have been paranoid over nothing all these years. But lets push forward.
Reply With Quote
The Following User Says Thank You to saad_sinpk For This Useful Post:
shadowpwner (01-25-2010)

  #2  
Old 01-06-2010, 05:21 PM
BlackHat Novice
Points: 905, Level: 17 Points: 905, Level: 17 Points: 905, Level: 17
Activity: 99.0% Activity: 99.0% Activity: 99.0%
Last Achievements
 
Join Date: Dec 2009
Posts: 201
Thanks: 0
Thanked 17 Times in 15 Posts
Downloads: 0
Uploads: 0
Default

Testing Setup and Results

On this test I used 32 domains and used 6 of my 8 servers on the test. I posted duplicate content from other sites on these domains, as well as posting duplicate content between the sites. I updated the XML feeds on all of these sites with the new pages, and Google spidered the duplicate pages within a week, and added them to its index within a week after spidering the pages. While the pages didnt rank well for the main keyword phrase for obvious reasons (no incoming links), they were listed in the index and traffic was being generated. It is clear that there was no duplicate content filter, let alone a penalty.


Duplicate Content Myths Debunked

1) Duplicate Content is a problem because Google says so. Just because they say so doesnt mean it is so. Duplicate content is rampant across the Google index, and has existed on their own site without a problem in the past just as Google has been caught cloaking as well.

2) Duplicate Content Confuses the Search Engine Spiders. Spiders are stupid by nature. They dont get confused by content; they get confused by bad code, circular navigation, session IDs, etc. Spiders dont care about content, they just grab what they can find so it can be indexed and cataloged at the mother ship.

3) If someone steals my content and they have a higher PageRank than me, they will get credit and my site will get penalized. This has been the fear of every Webmaster that has heard about the duplicate content penalty. Some Webmasters actually claim to have seen it happen to their site. However, closer examination, other factors were revealed which led to the drop. Most notably, their marketing efforts (link building) to the page had basically stopped. The point is, someone cant rip off your content, post it on their higher PR domain and then get your content dropped from Google.

4) If my article gets distributed, only one copy will count as a link, all the others will be ignored. This is only if it is tagged as duplicate content. This should not be something you should worry about as the more your article gets distributed, the more potential traffic your site can receive.

Q&A on Duplicate Content

Q: What if I have two pages on my site that are very similar in content?

A: Is there a real reason to have both pages? If you need both make sure each has a unique Title Tag and they are NOT even remotely similar. If you only need one version of the page, I recommend viewing both pages and the one that has the higher PageRank (verify by doing an InLink check in Yahoo! Site Explorer) you should keep and do a 301 redirect with the other.

Q: What internal linking strategies should I do to protect my site against duplicate content? Is this an absolute vs. relative linking issue?

A: No, this is not an issue between absolute linking and relative linking. What is recommended is that if you choose a linking method and you stick with it. While you wont get penalized for having a mix of relative links and absolute links on your site, for your sanity, you should stick to one and the preferred one is absolute linking.

Other areas that you should be consistent in is the trailing slash, and the use of index.html in your linking. For example, you can link to: /page/, /page, and /page/index.html. Pick one and stick with it. The one I recommend is the first, with the trailing slash after the directory; this includes your domain too. Your links should be http://www.domain.com/ not http://www.domain.com. Consistency will benefit you in the long run.

For a permanent fix, use this rewrite in your .htaccess file. It will help overcome a bug in Google in accidentally penalizing your site for duplicate content.

###### Begin Code ######
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{THE_REQUEST} ^.*/index.html
RewriteRule ^(.*)index.html$ http://www.yourdomain.com/$1 [R=301,L]
###### End Code ######

Note: Dont forget to put in your domain for the yourdomain.com. ;-)

Q: What do I do if I want print versions of all of my pages on my site?

A: Make sure that they are in their own folder, such as print, and then exclude that folder from being indexed in your robots.txt like this:

User-agent:*
Disallow: /print/

You can also use the noindex meta tag, but my testing shows that it is better to use the robots.txt file as making changes to dozens or hundreds of pages can be both time consuming and mistakes can happen, whereas the robots.txt entry is just two lines of text.

Q: What if I have multiple languages? Will I be penalized if the same content is found on the web, but in different languages?

A: No, you will not be penalized for having the same content in different languages. What I have done for my domains is I place the translated files in a subdomain instead of getting a TLD (Top Level Domain). While this has worked for me well, Google suggests directly getting the TLD as it will help them serve the right version of a document as they know that .de will be in German.

While this is different than what I would do, I would follow Googles advice here as having the TLD would be better long term.

Q: Our industry is heavily regulated and we must have a lengthy disclaimer at the bottom of each page. Will that get picked up as duplicate content?

A: Since it is on every page of your site, there is chance it could cause problems, but nothing that I have been able to solidify in testing in terms of a penalty which affected ranking. However, if you are paranoid and would rather be safe than sorry, then there are two choices for you:

1) Move the disclaimer to a separate page and include a brief summary of the disclaimer with a link to the complete text.
2) Create a graphic out of the text and include that at the bottom of every page. Since it is a graphic, the text will not be indexed. Your ALT text could be Company XYZ Disclaimer.

Q: I have a blog and it saves two copies of every post that I do, will that cause problems

A: Yes it will. Go into the settings and change it so only one copy is archived instead of two. This often happens with WordPress.

Q: Our company is taking our entire catalog online, over 6,000 products, what is our best option for ensuring that our pages get indexed and we dont run into issues with internal duplicate content? All the products will be served dynamically.

A: My answer will shock you, but you have the best chance of success if you follow this advice with a large online store. Most large catalog sites have a 1-2% conversion ratio, and those are the ones that are good. Others are well below 1% and they are struggling. I spoke today with three webmasters (good friends of mine) who all are in charge of large catalog sites. A few months ago, they asked the above question, and I came back with a suggestion. They implemented it (through a lot of swearing), and the results have been outstanding.

The Results: An average conversion ratio of 9%, more phone calls, more clicks on PPC for the same budget, more organic traffic, and the visitors that hit their landing pages, the conversion ratio is in double-digits.

So, what is it that they are doing? They built a custom landing page for every single product. Yes, every single one. They started with the products that had the highest sales volume and moved down the list. Some hired staff to help. One of the key components is that they surveyed their current customers and then they concentrated on focusing on a problem (obtained from the surveys) and offered a solution in the headline of the page. They also supplied an easy to find phone number, a Call to Action that is right where the customer expected it to be (right by the picture of the product), etc.

While the easy way is to have the dynamic pages indexed by the search engines, the hard way was to put forth an effort and create something that the prospect would love to see, would speak to them, and drive a higher conversion. In fact, all of them reported that in AdWords their quality score on these landing pages was Great as compared to OK to Poor with sending PPC traffic to the dynamically generating product pages.

And their duplicate content? It is nonexistent on their sites, whereas on almost all large catalog sites, there is a ton of duplicate content.

Summary on Duplicate Content

1) If your site is an authority in the space (PR5 or higher and 1,000+ pages indexed), your chances of being hit with any type of duplicate content filter for your internal pages is minimal, but it can still happen. So, if you created thousands of pages regarding school supplies, but swapped out the name of the school at the top of the page, in the Title, and in the description, in all likelihood, the pages would all be indexed.

2) Make sure each of your pages has a unique Title and Description.

3) If you have a long disclaimer at the bottom of each page for legal reasons, look to either place a link to a page listing the disclaimer, or create a graphic with the text. This will ensure that the repeated text is not indexed since it is a graphic.

4) Most duplicate content is caused by dynamic product pages. Correct this issue with specialized landing pages and denying the spiders access to your dynamic content. You will be better off in the long run with higher conversions and lower PPC costs.

5) Make sure you have the non-www URL rewrite for your domain. Below is an example for your .htaccess (Apache server). If you have a Windows server, you will need to go into the IIS Control.

# Begin non-www page protection #
<IfModule mod_rewrite.c>
RewriteEngine On RewriteBase /
RewriteCond %{HTTP_HOST} !^www.yourdomain.com [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [L,R=301]
</IfModule>
# End non-www page protection #
Reply With Quote
The Following User Says Thank You to saad_sinpk For This Useful Post:
yuviplays (05-06-2010)
  #3  
Old 01-06-2010, 05:22 PM
BlackHat Novice
Points: 905, Level: 17 Points: 905, Level: 17 Points: 905, Level: 17
Activity: 99.0% Activity: 99.0% Activity: 99.0%
Last Achievements
 
Join Date: Dec 2009
Posts: 201
Thanks: 0
Thanked 17 Times in 15 Posts
Downloads: 0
Uploads: 0
Default

If you do post duplicate content on your site, such as an article, or a reprint of some type, make sure you run the page of that you are taking through our <a href= alt=”>Ultimate SEO Tool, (login required – get it by logging into SEO Revolution - Tested SEO Advice from Jerry West). You would run the URL through the tool, ignore the output of Step One and continue to Step Two. This is the key information. Now, the best way to get nailed by the duplicate content filter/penalty or whatever you want to call it these days is to get a complaint submitted. And that can occur when the same article shows up at position #1 and #2 for a search. So, dont let that happen.

How do you do it?

Simple.

Step One: scanned the page and took all the keyword phrases that appeared on the page two or more times. Step Two: then takes those phrases and tells you where in Google they are ranked. All you do is create a Title and Description and get a few incoming links for phrases the page ISNT ranking for and you may want to change the phrases that are ranking well, to keep on the safe side.

Make sense? Good.
Reply With Quote
The Following 2 Users Say Thank You to saad_sinpk For This Useful Post:
bbbrown73 (05-03-2010), omoxx (01-20-2010)
  #4  
Old 01-25-2010, 10:30 PM
Banned
Points: 676, Level: 13 Points: 676, Level: 13 Points: 676, Level: 13
Activity: 12.5% Activity: 12.5% Activity: 12.5%
Last Achievements
 
Join Date: Nov 2009
Posts: 123
Thanks: 7
Thanked 5 Times in 4 Posts
Downloads: 0
Uploads: 0
Default

Dude, you post awesome content. I've searched through a lot of your posts and read them, and they are pure gold. I was wondering, do you have a blog, or are you copy pasting this from somewhere? (Not insinuating anything, this is real legit stuff man)
Reply With Quote
  #5  
Old 05-06-2010, 04:01 PM
BlackHat Newbie
Points: 11, Level: 1 Points: 11, Level: 1 Points: 11, Level: 1
Activity: 22.2% Activity: 22.2% Activity: 22.2%
Last Achievements
 
Join Date: May 2010
Posts: 2
Thanks: 1
Thanked 0 Times in 0 Posts
Downloads: 0
Uploads: 0
Default

Great post. One of my site has been hit by this penalty twice. I am pretty sure the second one was caused by Google reading through my javascript on the site.
Reply With Quote
  #6  
Old 05-06-2010, 05:46 PM
Moderator
Points: 660, Level: 13 Points: 660, Level: 13 Points: 660, Level: 13
Activity: 9.1% Activity: 9.1% Activity: 9.1%
Last Achievements
 
Join Date: Feb 2010
Posts: 33
Thanks: 5
Thanked 17 Times in 7 Posts
Downloads: 0
Uploads: 0
Default

waoo great post man..real info..
Reply With Quote
  #7  
Old 05-13-2010, 10:46 AM
Banned
Points: 366, Level: 7 Points: 366, Level: 7 Points: 366, Level: 7
Activity: 0% Activity: 0% Activity: 0%
Last Achievements
 
Join Date: Mar 2010
Posts: 6
Thanks: 0
Thanked 0 Times in 0 Posts
Downloads: 0
Uploads: 0
Default

nice post......................................
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT. The time now is 07:23 PM.


Powered by vBulletin® Version 3.8.2
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.3.2
Blackhat SEO, Blackhat Forum, Blackhat Forums, Blackhat Money, Blackhat CPA, Blackhat Marketing, Blackhat Method, Blackhat Tools, Blackhat Traffic, Blackhat Blog, Blackhat SEO Forum