0412 899 672

Robots Exclusion Protocol

Preventing search engines from indexing pages or a site
3 robots
1 Aug 2017

Robots Exclusion Protocol

//
Comments1

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. It works likes this: a robot wants to visit a URL, say: http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds it containing the following text in the file:

User-agent: *
Disallow: /

User-agent: *” means this section applies to all robots.  “Disallow: /” tells the robot that it should not visit any pages on the site. The example above could be used to exclude an entire website from search engine indexing whilst it is under development.

Important Considerations

There are two important considerations when using robots.txt:

  • Robots can choose to ignore your robots.txt. This is usually the case with malware robots that scan the web for security vulnerabilities and email address harvesters used by spammers.
  • The robots.txt file is a publicly available file. Anyone can see what sections of your site you don’t want robots to use. As such, don’t use robots.txt to hide information.

Creating & Saving

Location, Location, Location

robots.txt should be located in the top-level directory of your web site i.e. the same location as your home page.

File Naming

Name your robots file using all lower case characters i.e. “robots.txt”, not “Robots.TXT.

Rules & Structure

robots.txt is a text file with one or more records.
It can be created in Dreamweaver, Notepad (or any basic text editor). Do not use MS Word or similar programs.

The Nitty Gritty of Exclusion – Examples

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Allow: /

…or just create an empty robots.txt file, or don’t use one at all.

To exclude all robots from parts of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /source/
Disallow: /test/
Allow: /

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot:

User-agent: Google
Disallow:

Notes

  • Use a separate “Disallow” line for every URL prefix you want to exclude — you cannot say
    Disallow: /cgi-bin/ /source/” on a single line.
  • Do not have blank lines in a record, as they are used to delimit multiple records.
  • Regular expressions are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*“, “Disallow: /tmp/*” or “Disallow: *.gif“.
  • Everything not explicitly disallowed is considered permissible to index..

1 Response

  1. Great goods from you, man. I have have in mind your stuff prior to and you are just extremely great. I actually like what you’ve bought right here, really like what you are saying and the way in which through which you say it. You are making it entertaining and you continue to take care of to keep it sensible. I can’t wait to learn far more from you. That is really a terrific web site.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.