What is Robots.txt in SEO?

Like it. Share

In this article, we will learn about Robots.txt and answer questions like what is a Robots.txt file, Why is it required, how to create it and most importantly, how to implement it.

You first need to understand the fact that not all pages of a website are important and hold a high value. Some pages like the home page, products and services pages are more important to a business than its thank you or gallery pages.

Also some webpages might have content which adds immense value to the readers while come webpages might have content that is either incomplete or outdated and hence need to be hidden away from the readers.

For instance, will amazon sales increase if it’s thank you page is visible in search engine results page. Definitely not. Do you think its terms and conditions page, privacy policy page holds the same or more value than its home page. No. Well legally yes but from an individual or consumer standpoint, not that much.

So as a business, you would want only the important webpages to be visible in Google search results and the least important ones excluded from it. In other words, you want Google to index the important pages so they are visible in the search results and NOT index the least important ones so they are hidden from the search results.

So how do we instruct search engines which pages to index and which ones to not index so they are hidden from the search results page.

Well the answer is- by having a Robots.txt file.

What Robots.txt is?

As the name itself suggests, Robots.txt is a text file with set of instructions for the bots or as we like to call them, crawlers. It basically instructs them about which sections or pages of your website should be indexed and which should not.

How does Robots.txt work?

To understand this, I want you to visualise a website, any website, as a huge mansion with numerous doors. These doors represent the various web pages. Out of these numerous doors, some are open and can be accessed while others are closed and cannot be accessed.

For the sake of simplicity assume only 2 doors. Out of these, one is open i.e the green door and the other one is closed i.e the red door.

When you arrive at the mansion, you can simply walk through the green door as it is open and all of its contents can be accessed quite easily.

On the other hand, when you try to enter the red door, you realise that it is closed, completely sealed and hence you do not know what’s on the other side. Its contents are not known to you and cannot be accessed.

The same applies to bots. You can keep doors to some pages closed and hence these pages cannot be crawled by bots.

A robots file typically looks somewhat like this –

There are 4 main commands –

  • User-agent:
  • Disallow:
  • Allow:
  • Sitemap:

Let us understand these commands with some examples.

Here is a sample robots file.

# Example: Block only Googlebot

User-agent: Googlebot

Disallow: /

The user-agent command contains the name of the recipient bot.

To block any particular bot, which in this case is Google’s bot, simply type the bot name next to the user-agent command. This will block Google’s bot from crawling and thus indexing the page specified below.

Note that if you want to block bots from all the search engines, simply use * in place of the bot name. So ideally, our robots file would look something like this.

# Example: Block all crawlers
User-agent: *
Disallow: /
Note that anything after the # is just a comment and not an actual command.

The next command indicates the pages the bot is not supposed to crawl.

You can see a forward slash right next to the disallow command indicating the home page ‘www.domain.com/ or the entire website is not to be crawled.

Note that in robots file, you are not required to enter the domain name and hence ‘www.domain.com/ is represented only with a forward slash.

# Example: Block all crawlers
User-agent: *
Disallow: /

Assume that one wants to block the crawlers from crawling and indexing the thank you page with the url www.domain.com/thank-you.

This is how his robots.txt will look like.

# Example: Block all crawlers from crawling the thank you page
User-agent: *
Disallow: /thank-you
Note how I have not mentioned the complete url and excluded the domain name.

A robots file to block multiple pages should look like this.

# Example: Block all crawlers from crawling the thank you page
User-agent: *
Disallow: /thank-you
# Example: Block all crawlers from crawling admin pages
User-agent: *
Disallow: /wp-admin
# Example: Block all crawlers from crawling the affiliate pages
User-agent: *
Disallow: /

To summarize what we have learned, consider a website with domain name as – www.courses.com and consider digital marketing, data science and ecommerce to be the various courses listed on the website. Assume their respective URLs are –

  • www.courses.com/courses/digital-marketing-course
  • www.courses.com/courses/data-science-course
  • www.courses.com/courses/ecommerce-course

If I wish to block all the bots from crawling data science course, the robots file will be –

User-agent: *
Disallow: /courses/data-science-course
Note how I have excluded the domain name from the url.
If I wish to block all the 3 courses, I can simply do this –
User-agent: *
Disallow: /courses/ digital-marketing-course
User-agent: *
Disallow: /courses/data-science-course
User-agent: *

Disallow: /courses/ ecommerce-course

OR I can simply block everything within the courses folder like this
User-agent: *

Disallow: /courses/

So if I block courses/ everything within or after courses will be blocked and cannot be accessed by crawlers.

Imagine this. If the front door of the mansion is locked, all other subsequent doors within the mansion will be inaccessible to me as I cannot enter through the first door.

Note – in order to create an exception, use the allow command.

User-agent: *

Disallow: /courses/

Allow: /courses/data-science-course

In this example, I am instructing bots to not crawl any of the URLs after the /courses but to only crawl the data science course webpage.

Leave a Reply