Currently Browsing: Optimizing Site Structure

Lesson (19): META Robots Tag and “robots.txt”

Robots

There are two ways you can restrict a spider from crawling all or part of your site. First is by placing the META Robots tag within the “head” section of your HTML file (making it effective only for the pages where the tag is inserted). The second is to write a special instruction file called “robots.txt” and put it in the root directory of your site.

Robots are useful in terms of SEO since it is understood that a search engine spider has a certain limit of pages within your domain to index. Whatever this limit might be, you don’t want to waste your search engine reserve by allowing it to index files which are not optimized or not meant to be seen by the search engines.

The need was also felt for the “robots.txt” file to stop robots from deluging servers with rapid-fire requests or re-indexing the same files repeatedly. If you have duplicate content on your site for any reason, the same can be controlled from getting indexed. This will help you avoid any duplicate content penalties.

Also, webmasters might want to exclude contents of private or secret folders from indexing.

The META Robots tag

The Robots META tag is a tag within the HTML code of a site that instructs search engine robots what pages of a site they should index and what pages they should avoid. Use robots to specify any pages you want kept out of the search engine indices (e.g., order forms and guest books).

In the HTML code of a Web page, a sample Robots META Tag looks like this:

<meta name=”robots” content=”index, follow” />

“index” means that search engine is allowed to index this page and “follow” means it is allowed to follow the links and discover the new pages this one links to.

You can instruct a search engine not to index a page by changing the content of the tag to “noindex, follow” or “noindex, nofollow” if you don’t want it to follow links on the page either.

The Robots META tag must be placed in the “head” section of your HTML code. Some search engines do not support this tag and require that only the Robots Exclusion Protocol is used (which is supported by every search engine).

Googlebot and MSNBot tags

As you remember, Google and MSN spiders are called GoogleBot and MSNBot respectively. When reading your html pages, these “bots” will look for special META tags called META GoogleBot and META MSNBot. These Meta tags are meant to provide webmasters who do not have access to the root domain directory (for placing a “robots.txt” file, discussed later) with a way to close parts of their sites from crawling by these two robots.

The syntax is as follows:

<meta name=”googlebot” content=”noindex” />

(you may use either “noindex”, or “nofollow”, or “noarchive”, or “nosnippet”, or any combination of these values separated by commas for the “content” attribute. For instance, “nosnippet, noarchive” will tell Google not to display snippets of your page in its SERP and not to archive a copy of the document).

The same syntax can be used on your page for MSNBot:

<meta name=”msnbot” content=”noindex, nofollow”>

Please keep in mind that GoogleBot will only recognize the four commands mentioned above, but MSNBot only two of them (noindex, nofollow). Commands like “index” or “follow” will be ignored.

Robots Exclusion Protocol (robots.txt File)

The Robots Exclusion Protocol, commonly referred to as the Robots.txt file, is another method to allow website administrators to instruct visiting robots which parts of their site should not be visited and indexed.

When a search robot visits a website, it first checks for the existence of the file called “robots.txt” in the root directory of the site (www.yoursite.com/robots.txt). If this document is detected, the spider will follow the instructions found within.

Robots.txt file contains information in the following format:

User-agent: *
Disallow: /

The file always contains two fields, the first being the robot it addresses, the second being the directory (or directories) disallowed for browsing.

The string with the “Disallow” instruction specifies URLs which the specified robots have no access to.

Here “*” means all robots and “/” means all URLs. When specifying URLs, you write everything that follows your root (home) URL, including the slash. Thus, using only a slash means your home directory itself. So this is read as “No access for any search engine to any URL”.

In the following example, nothing is restricted from Googlebot so it may browse any file and directory:

# Guarantees access for Googlebot (characters after # and up to newline
# are considered comments).
User-agent: Googlebot
Disallow:

If you ever need to instruct multiple spiders about multiple directories, you may pass several commands:

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /cgi-bin/

This will disallow all spiders to scan your “cgi-bin” directory (where most webmasters keep the server-side scripts) however the GoogleBot will have access to it.

Page 4 of 8« First...23456...Last »