What Is Robots.txt?
Robots.txt is a file that tells search engine spiders to not crawl certain pages or sections of a website. Most major search engines (including Google, Bing and Yahoo) recognize and honor Robots.txt requests.
Why Is Robots.txt Important?
Most websites don’t need a robots.txt file.
That’s because Google can usually find and index all of the important pages on your site.
And they’ll automatically NOT index pages that aren’t important or duplicate versions of other pages.
That said, there are 3 main reasons that you’d want to use a robots.txt file.
Block Non-Public Pages: Sometimes you have pages on your site that you don’t want indexed. For example, you might have a staging version of a page. Or a login page. These pages need to exist. But you don’t want random people landing on them. This is a case where you’d use robots.txt to block these pages from search engine crawlers and bots.
Maximize Crawl Budget: If you’re having a tough time getting all of your pages indexed, you might have a crawl budget problem. By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that actually matter.
Prevent Indexing of Resources: Using meta directives can work just as well as Robots.txt for preventing pages from getting indexed. However, meta directives don’t work well for multimedia resources, like PDFs and images. That’s where robots.txt comes into play.
The bottom line? Robots.txt tells search engine spiders not to crawl specific pages on your website.
You can check how many pages you have indexed in the Google Search Console.
If the number matches the number of pages that you want indexed, you don’t need to bother with a Robots.txt file.
But if that number of higher than you expected (and you notice indexed URLs that shouldn’t be indexed), then it’s time to create a robots.txt file for your website.
Create a Robots.txt File
Your first step is to actually create your robots.txt file.
Being a text file, you can actually create one using Windows notepad.
And no matter how you ultimately make your robots.txt file, the format is exactly the same:
User-agent is the specific bot that you’re talking to.
And everything that comes after “disallow” are pages or sections that you want to block.
Here’s an example:
This rule would tell Googlebot not to index the image folder of your website.
You can also use an asterisk (*) to speak to any and all bots that stop by your website.
Here’s an example:
The “*” tells any and all spiders to NOT crawl your images folder.
This is just one of many ways to use a robots.txt file. This helpful guide from Google has more info the different rules you can use to block or allow bots from crawling different pages of your site.
Make Your Robots.txt File Easy to Find
Once you have your robots.txt file, it’s time to make it live.
You can technically place your robots.txt file in any main directory of your site.
But to increase the odds that your robots.txt file gets found, I recommend placing it at:
(Note that your robots.txt file is case sensitive. So make sure to use a lowercase “r” in the filename)
Check for Errors and Mistakes
It’s REALLY important that your robots.txt file is setup correctly. One mistake and your entire site could get deindexed.
Fortunately, you don’t need to hope that your code is set up right. Google has a nifty Robots Testing Tool that you can use:
It shows you your robots.txt file… and any errors and warnings that it finds:
As you can see, we block spiders from crawling our WP admin page.
We also use robots.txt to block crawling of WordPress auto-generated tag pages (to limit duplicate content).
Robots.txt vs. Meta Directives
Why would you use robots.txt when you can block pages at the page-level with the “noindex” meta tag?
Like I mentioned earlier, the noindex tag is tricky to implement on multimedia resources, like videos and PDFs.
Also, if you have thousands of pages that you want to block, it’s sometimes easier to block the entire section of that site with robots.txt instead of manually adding a noindex tag to every single page.
There are also edge cases where you don’t want to waste any crawl budget on Google landing on pages with the noindex tag.
Outside of those three edge cases, I recommend using meta directives instead of robots.txt. They’re easier to implement. And there’s less chance of a disaster happening (like blocking your entire site).
Learn about robots.txt files: A helpful guide on how they use and interpret robots.txt.
What is a Robots.txt File? (An Overview for SEO + Key Insight): A no-fluff video on different use cases for robots.txt.