Home > Help Center > Shared Hosting > Website > Robots.txt File

Robots.txt File

Robots.txt File

How important is the robots.txt file for a web site?

This file identifies to search engines which files in the site directory they are allowed to index and which resources do not need to be indexed (saved in the search engine database).

The file uses the syntax of the Robots Exclusion Standard protocol to show which files are allowed for indexing and which are not.

Since the connection between indexed webpages and visitor traffic is obvious, the importance of robots.txt as an indexing adjustment tool is of great importance.

A tiny error in the syntax of robots.txt can make many useful web pages unseen in search engine results, and this will cause only negative consequences like less traffic, less sales and less popularity.

Different crawlers interpret syntax differently

Although respectable web crawlers follow the directives in a robots.txt file, each crawler might interpret the directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.

The directives in robots.txt do not have a prohibitive action, but rather are recommended.

Not all robots cooperate with the standard - email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out.

robots.txt is not a security tool

You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file.

Under no circumstances treat robots.txt as a site security tool for two reasons:

  • The file itself is a freely accessible and anyone can see what resources you do not want to get indexed;
  • As we said, file directives are not mandatory;

Why should each site have a robots.txt file?

First of all, the presence of this file does not have any negative consequences for the site.

Creating the file itself and its content is elementary, as you will see later.

A properly configured robots.txt can help you not to exceed some hosting plan limits such as CPU limit and free traffic (for example, in Wordpress you can disable unnecessary indexing of the administration files in wp-admin folder and the program core files in wp-includes folder).

More robots.txt Details

Where to place robots.txt?

The file must be placed in the site's root folder. If you want to protect a file in a subdirectory, you do not have to create a new robots.txt but rather specify the full path to the file in the main robots.txt:

Disallow: /sub-directory/file.html

If a robot does not find robots.txt in the root folder, it will not comply with the directives in other robots.txt found in subdirectories.

The Correct Syntax

The name of the robots.txt file is case sensitive and the only correct syntax is robots.txt (Robots.txt, robots.TXT, ROBOTS.TXT are all wrong).

Each domain/subdomain uses its own robots.txt file:

  • blog.example.com/robots.txt
  • example.com/robots.txt

robots.txt allows you to indicate the path to a site's sitemap:

Sitemap: http://www.example.com/directory/sitemap_index.xml

How to create robots.txt file?

robots.txt is a plain text file. You can create it with a plain text editor on your computer and upload it through FTP into the site's root folder.

In CPanel, you can create robots.txt in the file manager as follows:

Open File Manager and navigate to site's root folder.

Click on + File link in the upper left corner:

robots.txt file

In the small window enter file name robots.txt and click Create New File button:

robots.txt file

The new file will appear in the directory listing:

robots.txt file

Select the file and click Edit in the upper menu.

Enter some code – make sure it has a correct syntax and click Save Changes button.

In your browser enter site's URL example.com/robots.txt to see file's content.

robots.txt Examples

All robots can index all files because the wild card sign (\*) means all and Disallow without any value means Not disallowed:

User-agent: *
Disallow:

All robots are excluded from the entire website:

User-agent: *
Disallow: /

All robots are not allowed to visit these directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

All robots are not allowed to index this file:

User-agent: *
Disallow: /directory/file.html

Note that all other files in the specified directory will be processed.

Only the specified robot is not allowed to visit the website:

User-agent: BadBot
Disallow: /

The specified robots are not allowed to visit this directory:

User-agent: BadBot
User-agent: Googlebot
Disallow: /private/

How to use comments - after the # symbol at the start of a line, or after a directive:

User-agent: *  # match all bots
Disallow: /    # keep them out

It is also possible to list multiple robots with their own rules.

Multiple user-agents example:

` User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory

User-agent: googlebot-news # only the news service Disallow: / # disallow everything

User-agent: * # any robot Disallow: /something/ # disallow this directory `

Nonstandard extensions

Crawl-delay directive – it is not part of the standard protocol and is interpreted differently from search engines:

User-agent: *
Crawl-delay: 10

Allow directive - this is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed.

In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow, for example:

Allow: /directory1/myfile.html
Disallow: /directory1/

Sitemap directive

Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:

Sitemap: http://www.example.com/dir/sitemaps/profiles-sitemap.xml
Sitemap: http://www.example.com/dir/sitemap_index.xml

Host Some crawlers support a Host directive, allowing websites with multiple mirrors to specify their preferred domain:

Host: example.com

Or alternatively:

Host: www.example.com

Note: This is not supported by all crawlers and if used, it should be inserted at the bottom of the robots.txt file after Crawl-delay directive.

*Reference:
Robotsexclusionstandard@en.wikipedia.org*

#settings #tools

Still not finding what you're looking for?

Contact our support team with any additional questions or concerns.

Contact support