How to control the way your site is indexed by search engines
Google’s bot will see the first entry, and realise that it’s allowed to look everywhere, while others won’t match and will look at the next entry.
Another key point to realise is that the text in the Disallow line will also be partially matched, so saying /forum/ will disallow any file that matches, which would include the whole contents of the forum subdirectory on your site.
But if you said ‘/forum’ then it would also disallow files with names like /forumhelp.html or /forum.php. And though I said ‘couplet’ earlier, you can also have more than one Disallow line, for example:
User-agent: *
Disallow: /secrets/
Disallow: /personal/
The RobotsTxt site has a database that lists many crawler bot names and the hostnames from which they usually crawl your site.
If you simply want to know which ones are visiting your site the most, then look through your server logs and see who’s been requesting the robots.txt file if you don’t have one, depending on the server config, the requests may be in a separate error log. The User-agent information if you log it will tell you the name that the bots are using.
Can’t upload?
In case you can’t upload to your real root folder (see box), you can use a meta
tag to control indexing too. The name of the tag is ‘ROBOTS’ and it can have a
value of NOINDEX if you don’t want a page indexed, or NOFOLLOW if you don’t want
links from the page followed, or both. So, a line like this in the header of
your first page will do much the same as a robots.txt that excludes all search
engines:
For those who are scratching their heads, one reason for doing this, of course, is if you’re working on a new site and you’d rather it doesn’t appear in search results until it’s finished.
Sitemaps
The Robots protocol is fairly old, and only serves one purpose; it tells search
engines to go away. While that can be handy for some parts of your site, it’s
not exactly a flexible approach. Far better is to have some control over what
you want indexed, and when. You may have pages on your site that change quite
frequently, for example, with news or product updates, while other pages are
more static or simply don’t need indexing so much.
What if, for example, you’d like changes to the new products area to be picked up really often, but you don’t want bots crawling all over your forum every day, which will generate extra database traffic and a much heavier load on the system?
The answer to this lies in the sitemaps protocol, which you’ll find detailed at www.sitemaps.org. Like lots of things on the internet these days, it’s an XML-based schema, so it’s a little more detailed than a robots.txt file, but not excessively so. You can pop a sitemap.xml file in the root folder of your site just as you would with a robots.txt file and you can also submit one to Google and some other search engines. You can also specify the location of a sitemap in your robots.txt, like this:
Sitemap: http://www.nigelwhitfield.com/sitemap.xml
But we’re getting ahead of ourselves let’s look at a basic sitemap file first.
Within the ‘urlset’ you can specify up to 50,000 separate URLs, which should be enough for most, and the only compulsory element for each one is the ‘loc’, which specifies the actual web address. You need to URL encode it, so if there’s an ampersand, for a page parameter, you should use the form &.
In the example above, we used an optional ‘changefreq’ element, which tells a search engine how often a page is likely to be changed. This isn’t considered an absolute command, but it serves to guide a search engine when it decides which pages on your site to index. You can specify always, hourly, daily, weekly, monthly, yearly or never.
Related articles
Q.Why are some of the keys on my keyboard doing strange...
Q.Is my phone’s Bluetooth any use?
Q.Can I switch boot drives so that I can work on older...
Old Street roundabout is being touted by the Government as the UK's answer to Silicon Valley, but it seems our best innovations are coming from all over the UK
|
|
|
|
|
Computeractive Excel (2010) Online tutorialPrice: £19.99 |
Computeractive Word (2010) Online TutorialPrice: £19.99 |
Computeractive Powerpoint (2010) Online TutorialPrice: £19.99 |
Angry BirdsPrice: £9.99 |
Back Issue CD-Rom 14 (2011)Price: £15.99 |