What is robots.txt? | Robots.txt file for SEO

Like it. Share

What is robots.txt?

A Web crawler, also known as spider, spider bot, is an Internet bot that systematically browses the World Wide Web to find new and updated content known as crawling. On finding any new content web crawlers follow the process of adding web pages into the World Wide Web called as Web indexing. Before accessing any website they first access the robots.txt file of the site.

The robots.txt, also known as exclusion standard or robots exclusion protocol, is a standard used by the websites to communicate with these web crawlers and web robots.

What is robots.txt file?

Robots.txt file is a means to command directly to search engine web robots, giving them clear instructions about which parts or pages of your site you want to be or not to be crawled and indexed.

Before accessing your website, search engine crawlers looks at your robots file for instructions on what pages they are allowed to crawl and index in search engine results.

In short, you can use this file to command these web crawlers, “Index these pages but don’t index these other ones.” A robots.txt file (mistakenly referred to as robot.txt file) is a must have for every website.

Importance/Purpose of robots file

  • To Maximize Crawl Budget. Crawl budget is the number of URLs robots can and wants to crawl. If you are having a tough time to get all of your web pages indexed, you might have a crawl budget problem. By blocking non-value adding and unimportant pages with robots text file, web crawlers can spend more of your crawl budget on the pages that actually matter and adds value to your website.
  • This file can also be used to block web bots from accessing your website. For example, if a website is in development, it may make sense to block robots from having access until it’s ready to be launched.
  • You may also want to exclude any pages that contain duplicate content. For example, if you offer “print versions” of some pages, you wouldn’t want web crawlers to crawl and index duplicate versions as duplicate content could affect your rankings.
  • There are some pages that contain sensitive information you don’t want to show or pages that you don’t want to get ranked on a SERP. For example, Thankyou page, Login page.

Format of robots.txt file?

Format of robots.txt file

But first, we need to understand some syntax used in a robots.txt file.

  • User-agent –> In regards to this file, these robots are referred to as User-agent.
  • Disallow –> to block access to pages or a section of your website, directly state the URL path here.
  • Allow –>to unblock a URL path within a blocked parent URL, directly enter that URL subdirectory path here.

The asterisk (*) after “user-agent” commands that these instructions applies to all web robots that visit the site. The slash after “Disallow” tells the robot to not visit any pages on the site. The “Allow” command is only used when you want a page to be crawled, but its parent page is “Disallowed.”

Creating a robots.txt file

Creation of this file is an easy process.

Start with the Notepad, Microsoft Word or any simple text editor. Save the file as “robots” all alphabets should be in lowercase then choose “.txt” as the file type extension (in Word, choose “Plain Text”).

Following are some examples of robots.txt file:

User-agent: *
Disallow:

“User-agent” is another word for robots, web crawlers or search engine spiders. The asterisk (*) denotes that this line applies to all of the web crawlers. Here, there is no file or folder listed in the Disallow line, means that every directory on your site may be accessed. This is a basic format of the file.

User-agent: *
Disallow: /

You can block the search engine spiders from accessing your entire site. The Slash ( / ) after Disallow means web crawlers should not crawl any URLs of your website.

User-agent: *
Disallow: /database
Disallow: /photos

You can block the web spiders from accessing certain areas or specific pages of your website. In this case, crawlers are not allowed to access anything in the database and photos directories or sub-directories.

NOTE: Remember, you can mention only one file or folder per Disallow line and you may add as many Disallow lines as you need.

User-agent: *
Disallow: /database
Disallow: /photos
Allow: /photos/furniture.jpg

Suppose there is a photo named furniture.jpg in the photos that you want web crawlers to crawl and index but don’t want to give access to photos. You can do this with the Allow instruction. In this case, web crawlers have access to furniture.jpg only and not the entire photos.

The dollar symbol ($) signifies the end of the URL, without using this for blocking file extensions you may block a huge number of URLs by accident.

Don’t forget to add your search engine friendly sitemap.xml file links to the robots.txt file. This will ensure that the spiders can find your sitemap and easily crawl and index all of your site’s pages.

Use this syntax:
Sitemap: http://www.mydomain.com/sitemap.xml

A good example of robots.txt file for reference is : https://www.flipkart.com/robots.txt Hashtags (#) in this file is just a comment.

Tools for creating robots.txt file

Creating your robots file is an easy process. But still if you are unable to make the robots file then make use of some free tools.

You just have to enter the required information like the search engine robots you want to allow to crawl and index your site, XML Sitemap links, your allowed or restricted directories paths etc.

Below are some free tools given for creating and comparing your robots.txt file:

1. Internet Marketing Ninjas

Link: https://www.internetmarketingninjas.com/seo-tools/robots-txt-generator/

2. SEOptimer,

Link: https://www.seoptimer.com/robots-txt-generator

3. Small seo tools

Link: https://smallseotools.com/robots-txt-generator/

4. Ryte

Link: https://en.ryte.com/free-tools/robots-txt-generator/#custom

5. SureOak

Link: https://www.sureoak.com/seo-tools/robots-txt-generator

Testing Your Robots text file

Finally, you should always test your robots.txt file to make sure it operates as you expect it to. It’s a good idea to do this even if you think it’s all correct.

Tools for testing your robots.txt file

1. Google provides a free robots file tester as part of the Webmaster tools. Steps as follows:

Link: https://www.google.com/webmasters/tools/robots-testing-tool

  • Sign in to your Google Webmasters account by clicking on “Sign In”.
  • Select your property (i.e., website) and click on “Crawl”
  • You’ll see “robots.txt Tester.” Click on that.
  • If there’s any code in the box already, delete it and replace it with your new robots.txt file.
  • Click “Test” 
  • If the “Test” text changes to “Allowed,” that means your robots.txt is valid.

2. Merkle also offers you a free tool, that allows you to test your file. Steps as follows:

Link: https://technicalseo.com/tools/robots-txt/

  • Click on the above link.
  • Enter your robots.txt URL in the URL section.
  • Choose your User Agent from the dropdown list beside the URL section. If your User Agent is asterisk ( * ) i.e. All agents then select the “All (robots.txt)”. You will find this option at extreme bottom of the list.
  • Then click on TEST button and wait for a while for the results.

3. Ryte also gives you a free tool for testing your robots.txt file but it allows only few user agents. Steps as follows:

Link: https://en.ryte.com/free-tools/robots-txt/

  • Click on the above link.
  • Enter your robots.txt URL in the URL section.
  • Choose your User Agent from the dropdown list beside the URL section. Here you will get a limited choice of User Agents to select.
  • Click on Evaluate button and wait for your results.

4. Sitechecker also checks your robots.txt file for free. Steps as follows:

Link: https://sitechecker.pro/robots-tester/

  • Click on the above link.
  • Enter your robots.txt URL in the URL section.
  • Click on side arrow and wait for your results.

Where is robots files uploaded/located?

The robots.txt file must be located at the root of the website host. If you’re unsure about how to access your website root contact your web hosting service provider.

If you are able to access your website root then upload the robots text file to your site’s top-level directory – this need to be added via Cpanel.

Once completed, saved and uploaded your robots.txt file to the root directory of your site, check it by adding /robot.txt after your domain. For example, if your domain is www.digitaldumdum.com, you will write www.digitaldumdum.com/robots.txt.

What if you don’t have robots text file?

Without a robots.txt file web crawlers have access to everything and will crawl and index anything they find on your website. This is fine for most websites.

But it’s a good practice to point out where your sitemap.xml lies, so search engines or web crawlers can quickly find new content on your website, optimizing their crawl budget.

Leave a Reply