Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. It works likes this: a robot wants to visit a URL, say: http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds it containing the following text in the file:
User-agent: *” means this section applies to all robots. “
Disallow: /” tells the robot that it should not visit any pages on the site. The example above could be used to exclude an entire website from search engine indexing whilst it is under development.
There are two important considerations when using robots.txt:
- Robots can choose to ignore your robots.txt. This is usually the case with malware robots that scan the web for security vulnerabilities and email address harvesters used by spammers.
- The robots.txt file is a publicly available file. Anyone can see what sections of your site you don’t want robots to use. As such, don’t use robots.txt to hide information.
Creating & Saving
Location, Location, Location
robots.txt should be located in the top-level directory of your web site i.e. the same location as your home page.
Name your robots file using all lower case characters i.e. “robots.txt”, not “Robots.TXT.
Rules & Structure
robots.txt is a text file with one or more records.
It can be created in Dreamweaver, Notepad (or any basic text editor). Do not use MS Word or similar programs.
The Nitty Gritty of Exclusion – Examples
To exclude all robots from the entire server
To allow all robots complete access
…or just create an empty robots.txt file, or don’t use one at all.
To exclude all robots from parts of the server
To exclude a single robot
To allow a single robot:
- Use a separate “
Disallow” line for every URL prefix you want to exclude — you cannot say
Disallow: /cgi-bin/ /source/” on a single line.
- Do not have blank lines in a record, as they are used to delimit multiple records.
- Regular expressions are not supported in either the
User-agentfield is a special value meaning “any robot”. Specifically, you cannot have lines like “
User-agent: *bot*“, “
Disallow: /tmp/*” or “
- Everything not explicitly disallowed is considered permissible to index..