Daily cyber threats and internet security news: network security, online safety and latest security alerts
March 22nd, 2008

Google Exploit Removes Any Website From The Index

There currently is a working method to knock a website out of Google’s search engine results. To understand this exploit, you should first know (or read) about Google’s Duplicate Content Filter. Since nowadays Google dumps duplicate sites in Google’s Supplemental Index, there is a way to make Google think that YOUR site should be removed from the index.

If someone copies your websites homepage and manages to convince Google that the copy is actually the original one, your homepage will get tossed into the supplemental index, and most likely will never get back into the Search Engine Ranking Pages again. If it will, it will take somewhere between 2 to 6 months.

When someone copies your website, Google will index that copy and correctly determine that it’s a duplicate. Google knows about 2 pages that it knows are complete duplicates, and now it has to decide which to dump in the supplemental index, and which to keep in the main one. Problem is, Google cant tell which is the original and which is the copy.
They have some clever algorithms to work it out, but even if they are 99% accurate, that leaves a lot of problems for that 1% of times they can get it wrong.

Another problem is that web proxies which send out spiders, just like Google, spider your page, take your content, and then they host a copy of your website on their proxy site, so that when their users request your page, they can serve up their local copy quickly rather than having to retrieve if off your server. The big issue is that Google can sometimes decide that the proxy copy of your web page is the original, and yours is not.

There are some evidences that people are deliberately and maliciously using proxy servers to cache copies of web pages, then using normal (white and black hat) Search Engine Optimization (SEO) techniques to make those proxy pages rank in the search engine, increasing the likelihood that your legitimate page will be the one dumped by the search engines’ duplicate content filters.

Most of the time, proxy spiders actively spoof their origins so that you don’t realize that it’s a spider from a proxy, as they pretend to be a Googlebot for example, or from Yahoo, or regular human users with IE or Firefox.

There are few possible defense solutions, depending on your web hosting technology and technical competence:

1 – using php and the .htaccess file and checking for search engine, not allowing some spiders and proxies but it will work only against proxies/spiders/bots that are identifying themselves correctly. If you are using MS Windows and IIS on your server, or if you are on a shared hosting solution that doesn’t give you the ability to do anything clever, it’s an awful lot harder and you should take the advice of a professional on how to defend yourself from this kind of attack.

2 – If you are running a PHP or ASP based website, set all pages robot meta tags to noindex and nofollow and implement a PHP or ASP script on each page that checks for valid spiders from the major search engines, and if so, resets the robot meta tags to index and follow. The important distinction here is that it’s easier to validate a real spider, and to discount a spider that’s trying to spoof you, because the major search engines publish processes and procedures to do this, including IP lookups.

Share this item with others:

More on CyberInsecure:
  • Google Doodle Poisoned By Rogue Anti-virus Scareware
  • Google Helps Most Phishing Sites
  • Google Groups Used By Trojan As Command Network
  • Google Code Project Abused By Spam And Malware
  • Another Google Adwords Phishing

  • If you found this information useful, consider linking to it from your own website.
    Just copy and paste the code below into your website (Ctrl+C to copy)
    It will look like this: Google Exploit Removes Any Website From The Index

    Leave a Reply

    Comments with unsolicited links to other resources will be marked as spam. DO NOT leave links in comments. Please leave your real email, it wont be published.

    To prove you’re a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.