How important is the robots.txt file for a web site?
This file identifies to search engines which files in the site directory they are allowed to index and which resources do not need to be indexed (saved in the search engine database).
The file uses the syntax of the Robots Exclusion Standard protocol to show which files are allowed for indexing and which are not.
Since the connection between indexed webpages and visitor traffic is obvious, the importance of
robots.txt as an indexing adjustment tool is of great importance.
A tiny error in the syntax of robots.txt can make many useful web pages unseen in search engine results, and this will cause only negative consequences like less traffic, less sales and less popularity.
Different crawlers interpret syntax differently
Although respectable web crawlers follow the directives in a
robots.txt file, each crawler might interpret the directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.
The directives in
robots.txt do not have a prohibitive action, but rather are recommended.
Not all robots cooperate with the standard - email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out.
robots.txt is not a security tool
You should not use
robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the
Under no circumstances treat
robots.txt as a site security tool for two reasons:
- The file itself is a freely accessible and anyone can see what resources you do not want to get indexed;
- As we said, file directives are not mandatory;
Why should each site have a robots.txt file?
First of all, the presence of this file does not have any negative consequences for the site.
Creating the file itself and its content is elementary, as you will see later.
A properly configured robots.txt can help you not to exceed some hosting plan limits such as CPU limit and free traffic (for example, in Wordpress you can disable unnecessary indexing of the administration files in wp-admin folder and the program core files in wp-includes folder).
More robots.txt Details
Where to place robots.txt?
The file must be placed in the site's root folder. If you want to protect a file in a subdirectory, you do not have to create a new
robots.txt but rather specify the full path to the file in the main
If a robot does not find robots.txt in the root folder, it will not comply with the directives in other
robots.txt found in subdirectories.
The Correct Syntax
The name of the robots.txt file is case sensitive and the only correct syntax is
robots.txt (Robots.txt, robots.TXT, ROBOTS.TXT are all wrong).
Each domain/subdomain uses its own robots.txt file:
robots.txt allows you to indicate the path to a site's sitemap:
How to create robots.txt file?
robots.txt is a plain text file. You can create it with a plain text editor on your computer and upload it through FTP into the site's root folder.
In CPanel, you can create
robots.txt in the file manager as follows:
Open File Manager and navigate to site's root folder.
Click on + File link in the upper left corner:
In the small window enter file name
robots.txt and click Create New File button:
The new file will appear in the directory listing:
Select the file and click Edit in the upper menu.
Enter some code – make sure it has a correct syntax and click Save Changes button.
In your browser enter site's URL
example.com/robots.txt to see file's content.
All robots can index all files because the wild card sign
(\*) means all and Disallow without any value means Not disallowed:
User-agent: * Disallow:
All robots are excluded from the entire website:
User-agent: * Disallow: /
All robots are not allowed to visit these directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
All robots are not allowed to index this file:
User-agent: * Disallow: /directory/file.html
Note that all other files in the specified directory will be processed.
Only the specified robot is not allowed to visit the website:
User-agent: BadBot Disallow: /
The specified robots are not allowed to visit this directory:
User-agent: BadBot User-agent: Googlebot Disallow: /private/
How to use comments - after the
# symbol at the start of a line, or after a directive:
User-agent: * # match all bots Disallow: / # keep them out
It is also possible to list multiple robots with their own rules.
Multiple user-agents example:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service Disallow: / # disallow everything
User-agent: * # any robot
Disallow: /something/ # disallow this directory
Crawl-delay directive – it is not part of the standard protocol and is interpreted differently from search engines:
User-agent: * Crawl-delay: 10
Allow directive - this is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed.
In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow, for example:
Allow: /directory1/myfile.html Disallow: /directory1/
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same
robots.txt in the form:
Sitemap: http://www.example.com/dir/sitemaps/profiles-sitemap.xml Sitemap: http://www.example.com/dir/sitemap_index.xml
Host Some crawlers support a Host directive, allowing websites with multiple mirrors to specify their preferred domain:
Note: This is not supported by all crawlers and if used, it should be inserted at the bottom of the robots.txt file after Crawl-delay directive.