Over the past few months we’ve noticed that more and more pages from inside the WordPress backend are finding their way in to the Google index. This has always been a problem, but as Google seems to index more useless pages, and crackers get more sophisticated at finding vulnerabilities in WordPress modules, it is important to protect your site from both the crackers and Google.
So what is the real harm?
The most obvious, and urgent harm, comes from exposing your website to potential comprimise. If a vulnerability is found in a WordPress plugin it can take just a few seconds to find a host of web sites to attack. Using Google’s inurl command a simple search of inurl:wp-content/plugins returns more than 8 million results for a cracker to start his or her search for likely targets.
A dedicated cracker will comprimise your site, but there is no reason to make it easy for them.
Another less obvious problem is created by Google itself. In just this one simple search we’ve seen more than 8 million web pages that have no reason to be in the index. They serve no useful purpose other than to show how invasive Google can be with it’s crawler. It also demonstrates a duplicate content issue that needs to be addressed.
The real problem, however, is the harm this can cause each website this happens to.
It is known that Google may not index all of the pages in a website for various reasons. Assume you have a website with 100 pages. Yet Google decided to index 30 pages of your /wp-content or wp-admin/ folders. You have lost the postential for 30% of your pages to be indexed in favor of pages that should never have been in the index at all. I have seen sites with more than 50% of their indexed pages coming from the back end of WordPress.
What can you do about it?
There are two things that you should do to help secure your site from search engines exploring where they don’t belong.
1. Robots.txt: With every WordPress install I do these days I add this to my robots.txt file.
Be sure to adjust the URL for your site’s install folders.
2. Google’s Webmaster Tools: If you find these pages indexed for your site first install the robots.txt file. Once that is done you should enter your GWT account and remove those pages from the index. Once removed the robots.txt should keep them from being re-indexed.
Unfortunately, from then on you will see and error message in your GWT account. You can ignore this error.
What does all of this tell us? The biggest thing it tells us is that the Google spiders are not as smart as everyone, including Google, would like us to believe. Indexing these pages serves no purpose, and it shows that the bots can and will go to places that they really should not be in and you must be proactive in protecting your website from them. A person would know that there is no reason to index more than 8 million of the exact same pages. An algorithm cannot make that decision.