How to control the way your site is indexed by search engines
Last time round, I started exploring how search engines index your site, with a look at some of the more basic aspects adding keyword and meta tags to your pages and looking at some of the tools available in Google’s webmasters area, which will let you see what queries people are using to reach your site, and where they rank.
This month, I’m following on from that with a look at two techniques that you can use to give yourself a little more control over which parts of your site are indexed.
If you have a busy site, it’s not beyond the realms of possibility to find that you have a huge number of simultaneous connections to your server from assorted search engine crawlers, potentially impacting upon the performance for other users.
While Google’s Dashboard allows you some control over the crawl rate for its own bot, even if other search engines were to provide such systems and many don’t it would be tedious to have to visit each one and tweak the settings. The obvious solution, then, would be to have a way that your site can hold information that dictates how search engines will index it, and there are two established methods.
Robots.txt
The first of those is called robots.txt. It’s a simple text file that’s intended
to be put at the root level of your site, so if the site is
www.nigelwhitfield.com,
for example, then the URL of the robots file would be
www.nigelwhitfield.com/robots.txt. If the crawler for a search engine doesn’t
find such a file, then it will index your site by reading all the pages and
following all the links.
If, on the other hand, the file exists and is readable by the crawler, then it will interpret it according to the Robots Exclusion Protocol, about which you can find more information at www.robotstxt.org. Essentially, though, it’s a very simple text file.
Lines that are comments begin with a # symbol, and you control the reading of your site by robots using User-agent and Disallow couplets. Together, these specify which parts of your site should not be read by the web crawlers. Here’s a simple example, to stop Google indexing a site forum, for instance:
# A simple robots.txt file
User-agent: googlebot
Disallow: /forum/
A well-behaved search engine should be liberal in interpreting its name so
if something is called SuperWebCrawler-3.07 then it should still ignore your
site if you refer to it as ‘superwebcrawler.’
There are various pages around that list the common and some not-so-common
search engine bot names, so you can exclude them specifically, but in most
cases, you’ll probably want to control access by all search engines, unless
you’re nursing a grudge, and you can do that simply with
User-agent: *
And what if you do want to be picky about who looks where? Well, the file should be read from top to bottom, and a bot will use the first matching couplet, so if you only want Google to index your forums, and not anyone else, you could say something like this:
# Let Google index my forums, but not anyone else
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /forum/
Related articles
Content Recommendation
Q.Why is Windows Backup skipping files?
Q.Why do my scanned documents display gibberish?
Q.How can I convert MTS files to edit in Windows Movie...
Updating your subscription status