Simple clear advice in plain English

Hands on: Indices and mapping

How to control the way your site is indexed by search engines

Google’s bot will see the first entry, and realise that it’s allowed to look everywhere, while others won’t match and will look at the next entry.

Another key point to realise is that the text in the Disallow line will also be partially matched, so saying /forum/ will disallow any file that matches, which would include the whole contents of the forum subdirectory on your site.

But if you said ‘/forum’ then it would also disallow files with names like /forumhelp.html or /forum.php. And though I said ‘couplet’ earlier, you can also have more than one Disallow line, for example:

User-agent: *
Disallow: /secrets/
Disallow: /personal/

The RobotsTxt site has a database that lists many crawler bot names and the hostnames from which they usually crawl your site.

If you simply want to know which ones are visiting your site the most, then look through your server logs and see who’s been requesting the robots.txt file ­ if you don’t have one, depending on the server config, the requests may be in a separate error log. The User-agent information ­ if you log it ­ will tell you the name that the bots are using.

Can’t upload?
In case you can’t upload to your real root folder (see box), you can use a meta tag to control indexing too. The name of the tag is ‘ROBOTS’ and it can have a value of NOINDEX if you don’t want a page indexed, or NOFOLLOW if you don’t want links from the page followed, or both. So, a line like this in the header of your first page will do much the same as a robots.txt that excludes all search engines:

For those who are scratching their heads, one reason for doing this, of course, is if you’re working on a new site and you’d rather it doesn’t appear in search results until it’s finished.

Sitemaps
The Robots protocol is fairly old, and only serves one purpose; it tells search engines to go away. While that can be handy for some parts of your site, it’s not exactly a flexible approach. Far better is to have some control over what you want indexed, and when. You may have pages on your site that change quite frequently, for example, with news or product updates, while other pages are more static or simply don’t need indexing so much.

What if, for example, you’d like changes to the new products area to be picked up really often, but you don’t want bots crawling all over your forum every day, which will generate extra database traffic and a much heavier load on the system?

The answer to this lies in the sitemaps protocol, which you’ll find detailed at www.sitemaps.org. Like lots of things on the internet these days, it’s an XML-based schema, so it’s a little more detailed than a robots.txt file, but not excessively so. You can pop a sitemap.xml file in the root folder of your site just as you would with a robots.txt file and you can also submit one to Google and some other search engines. You can also specify the location of a sitemap in your robots.txt, like this:

Sitemap: http://www.nigelwhitfield.com/sitemap.xml

But we’re getting ahead of ourselves ­ let’s look at a basic sitemap file first.




http://www.nigelwhitfield.com/v2/index.php
monthly

Within the ‘urlset’ you can specify up to 50,000 separate URLs, which should be enough for most, and the only compulsory element for each one is the ‘loc’, which specifies the actual web address. You need to URL encode it, so if there’s an ampersand, for a page parameter, you should use the form &.

In the example above, we used an optional ‘changefreq’ element, which tells a search engine how often a page is likely to be changed. This isn’t considered an absolute command, but it serves to guide a search engine when it decides which pages on your site to index. You can specify always, hourly, daily, weekly, monthly, yearly or never.

Reader Comments

   

Add your comment

All fields must be completed. Your email address will not be displayed or used to send marketing messages.

All messages will be checked by moderators before appearing on the site.

See our Privacy Policy for more information.

Related articles

Turn Windows features on or off screenshot

Why won't Internet Explorer 9 let me access certain websites?

The problem Mr Wilks has with IE9 is caused by certain websites not recognising the upgrade from IE8. We offer a couple of ways to resolve the situation

Hands on: Get your website noticed

Search engine tricks, and getting to grips with HTML Kit

Hassle-free uploading to your web site

Discover the pros and cons of various ways of getting a web site onto a web server

Question & Answer

Q.Why are some of the keys on my keyboard doing strange...

> Read the answer

Q.Is my phone’s Bluetooth any use?

> Read the answer

Q.Can I switch boot drives so that I can work on older...

> Read the answer

Best deals on the web

img

Samsung RV520-A07

£359.98- Buy it now

img

Acer Aspire 5750G (LX.RXP02.019)

£399.99- Buy it now

img

Apple MacBook Pro (MD313B/A)

£904.37- Buy it now

Latest issue & subscription deals

Poll

Are you concerned about viruses that target mobile phones?

Jargon Buster

Computing terms explained in plain English

Restore point

A Windows backup of system files and settings.

Great shopping deals from Computeractive