Simple clear advice in plain English

Hands on: Indices and mapping

How to control the way your site is indexed by search engines

Last time round, I started exploring how search engines index your site, with a look at some of the more basic aspects ­ adding keyword and meta tags to your pages and looking at some of the tools available in Google’s webmasters area, which will let you see what queries people are using to reach your site, and where they rank.

This month, I’m following on from that with a look at two techniques that you can use to give yourself a little more control over which parts of your site are indexed.

If you have a busy site, it’s not beyond the realms of possibility to find that you have a huge number of simultaneous connections to your server from assorted search engine crawlers, potentially impacting upon the performance for other users.

While Google’s Dashboard allows you some control over the crawl rate for its own bot, even if other search engines were to provide such systems ­ and many don’t ­ it would be tedious to have to visit each one and tweak the settings. The obvious solution, then, would be to have a way that your site can hold information that dictates how search engines will index it, and there are two established methods.

Robots.txt
The first of those is called robots.txt. It’s a simple text file that’s intended to be put at the root level of your site, so if the site is www.nigelwhitfield.com, for example, then the URL of the robots file would be www.nigelwhitfield.com/robots.txt. If the crawler for a search engine doesn’t find such a file, then it will index your site by reading all the pages and following all the links.

If, on the other hand, the file exists and is readable by the crawler, then it will interpret it according to the Robots Exclusion Protocol, about which you can find more information at www.robotstxt.org. Essentially, though, it’s a very simple text file.

Lines that are comments begin with a # symbol, and you control the reading of your site by robots using User-agent and Disallow couplets. Together, these specify which parts of your site should not be read by the web crawlers. Here’s a simple example, to stop Google indexing a site forum, for instance:

# A simple robots.txt file
User-agent: googlebot
Disallow: /forum/

A well-behaved search engine should be liberal in interpreting its name ­ so if something is called SuperWebCrawler-3.07 then it should still ignore your site if you refer to it as ‘superwebcrawler.’
There are various pages around that list the common ­ and some not-so-common search engine bot names, so you can exclude them specifically, but in most cases, you’ll probably want to control access by all search engines, unless you’re nursing a grudge, and you can do that simply with

User-agent: *

And what if you do want to be picky about who looks where? Well, the file should be read from top to bottom, and a bot will use the first matching couplet, so if you only want Google to index your forums, and not anyone else, you could say something like this:

# Let Google index my forums, but not anyone else
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /forum/

Reader Comments

   

Add your comment

Please keep comments constructive and free from abuse of any kind and swearing. If you wish to link to a product or service online, please do so in such a way that makes it clear that it is not spam. If you are connected to any such product you should make that clear.

We may use your comments in the magazine. We may edit your comments for clarity or to remove unacceptable material. We will attribute your comments but not share your email address.

We request your email address and record your Internet Address (IP address) in order to block spam from our site. We will never share this information without your permission.

All comments are reviewed by the Computeractive Team before being published. Please bear with the slight delay this causes, you don't need to post more than once.

Click here to read our Privacy Policy

Click here to read our site Terms & Conditions

Related articles

We test out 4G at the fairground

Top 5 stories: Apple iPad Mini v Kindle Fire HD, rural broadband and more

Prisoners have mobile illicit phones but would they bother to have the Microsoft Surface RT tablet smuggled in?

microsoft-surface-cyan-cover-web

Microsoft Surface RT review

A sturdy tablet with a responsive touchscreen and preinstalled edition of Microsoft Office. But what's it like to use?

Hands on: Get your website noticed

Search engine tricks, and getting to grips with HTML Kit

Content Recommendation

Question & Answer

Q.Why is Windows Backup skipping files?

> Read the answer

Q.Why do my scanned documents display gibberish?

> Read the answer

Q.How can I convert MTS files to edit in Windows Movie...

> Read the answer

Best deals on the web

img

Samsung NP350E7C-A04UK

£349.99- Buy it now

img

Toshiba Satellite C850D-11Q (PSCC2E-00R00JEN)

£279.97- Buy it now

img

Lenovo G580 (MAANJUK)

£379.99- Buy it now

Updating your subscription status Loading

Most popular articles

No matching document

Poll

Do you have Windows 8?

Jargon Buster

Computing terms explained in plain English

Restore Point

A Windows backup of system files and settings.

Great shopping deals from Computeractive

Information currently unavailable