ROBOTS.TXT file - syntax, directives and examples
Table of content
The ROBOTS.TXT file shows search engines which files in the site's directory can index and which resources should not be indexed.
The file uses the Robots Exclusion Standard syntax to show which files are allowed to be indexed and which ones are not.
Since the connection between indexed web pages and visitors' traffic is obvious, the importance of robots.txt as an indexing adjustment tool is of great importance.
A small error in the robots.txt syntax can make many useful web pages invisible in the search engine results, and this will lead to negative consequences such as less traffic, less sales and less popularity.
Different robots interpret the same syntax differently
Although respectable web robots follow the directives in the robots.txt file, each robot can interpret the directives differently. You need to know the correct syntax for addressing different web robots, as some may not understand certain instructions.
The directives in ROBOTS.TXT do not have a prohibited action
Not all robots collaborate with the standard - robots searching for email addresses, spammers, malware, and robots scanning for vulnerabilities in security, can even purposefully scan the parts of the website they are told not to scan.
ROBOTS.TXT is not a security tool
You should not use robots.txt as a means to hide your webpages from Google search results. This is due to the fact that other pages may contain links to a page and be indexed by tracking the links, thus skipping the ban in robots.txt.
Under no circumstances should you consider robots.txt as a site security tool for two reasons:
- The file itself is freely accessible and anyone can see what resources you do not want to be indexed;
- As we have said, directives in the file are not mandatory;
Why should each site have a ROBOTS.TXT file?
First of all, the existence of this file has no negative consequences for the site.
Creating the file itself and its contents is a very simple task, as you will see later.
Correctly configured robots.txt can help you not exceed certain hosting plan limitations such as CPU limitation and free traffic (for example, in WordPress you can disable unnecessary indexing of the administrative directory wp-admin and the main files in wp-includes directory.
Where do we put the ROBOTS.TXT file?
The file must be placed in the main folder of the site. If you want to protect a file in a subdirectory, you do not need to create a new robots.txt, but specify the full path to the file in the main robots.txt file:
If the robot does not find robots.txt in the root folder, it will not follow the directives in other robots.txt files, located in subdirectories.
Correct syntax of ROBOTS.TXT file
The name of the robots.txt file is sensitive to the font register (upper/lower case) and the only correct syntax is robots.txt (all other variants like Robots.txt, robots.TXT, ROBOTS.TXT are wrong).
Each domain/subdomain uses its own robots.txt file:
robots.txt lets you specify the path to the sitemap
How do I create a ROBOTS.TXT file?
robots.txt is a plain text file. You can create it with a plain text editor on your computer and upload it through FTP to the main folder on the site.
In cPanel you can create robots.txt in the file manager as follows:
Open the File Manager and go to the site's main folder.
Click the link + File in the upper left corner:
In the small window, enter the name of the robots.txt file and click the Create New File button:
The new file will appear in the list of directories:
Select the file and click Edit in the top menu.
Enter a code - make sure the syntax is correct and click the Save Changes button.
In the browser, enter the URL:
http://example.com/robots.txt to see the contents of the file.
Examples of ROBOTS.TXT file usage
All robots can index all files because the wildcard sign
* means all and Disallow with no value means Not forbidden:
User-agent: * Disallow:
The same result can be achieved with an empty or missing robots.txt file.
All robots are excluded from the entire website with the following code:
User-agent: * Disallow: /
All robots are not allowed to visit these directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
All robots are not allowed to index this file:
User-agent: * Disallow: /directory/file.html
Note that all other files in the specified directory will be processed.
Only the specified robot is not allowed to visit the website:
User-agent: BadBot Disallow: /
These robots are not allowed to visit the specified directory:
User-agent: BadBot User-agent: Googlebot Disallow: /private/
Note: Replace BadBot with the actual bot name.
How to use comments - after the
# symbol at the beginning of the line or after a directive:
User-agent: * # match all bots Disallow: / # keep them out
It is also possible to list some robots with their own rules.
Example with several user agents:
User-agent: googlebot # all Google services Disallow: /private/ # disables this directory User-agent: googlebot-news # only the news service Disallow: / # prohibits everything User-agent: * # each robot Disallow: /something/ # Disable this directory
Non-standard extensions to the ROBOTS.TXT file
Crawl Delay Directive is not part of the standard protocol and is interpreted differently from search engines:
User-agent: * Crawl-delay: 10
The Allow directive is useful when someone tells robots to avoid an entire directory, but still wants some HTML documents in this directory to be crawled and indexed.
To be compatible with all robots, if we want to allow separate files in another directory that is not allowed, we first need to set the Allow and then Disallow directives, for example:
Allow: /directory1/myfile.html Disallow: /directory1/
Some robots support the Sitemap directive by defining multiple Sitemaps in the same robots.txt in the following format:
Sitemap: http://www.example.com/dir/sitemaps/profiles-sitemap.xml Sitemap: http://www.example.com/dir/sitemap_index.xml
Some robots support the Host directive by allowing multiple mirror sites to specify their preferred domain:
Note: The Host directive is not supported by all robots, and should be inserted at the bottom of the robots.txt file after the crawl delay directive.