robots.txt Explained: Controlling What Search Engines Crawl
Imagine you run a busy shop and a polite visitor arrives at the front door every morning asking, "Which rooms am I allowed to walk through today?" You hand them a small note pinned by the entrance. They read it, nod, and wander only where you've said it's fine to go. That little note is, more or less, what a robots.txt file is for your website. It greets the automated visitors that search engines send out and quietly tells them which parts of your site they're welcome to explore.
It sounds modest, and it is just a plain text file sitting at the root of your domain. Yet this unassuming file has the power to either help search engines understand your site efficiently or, if you get it wrong, accidentally make large chunks of your website invisible. In this guide we'll unpack what robots.txt actually does, what it absolutely cannot do, the mistakes that quietly cost businesses traffic, and how to handle it without breaking anything.
What robots.txt actually is
Search engines discover pages using automated programs often called crawlers, bots, or spiders. They follow links from page to page, reading content and adding it to a giant index they later use to answer searches. Before a well-behaved crawler reads your site, it checks one specific location first: a file named robots.txt that lives at the very top of your domain, such as yourdomain.com/robots.txt.
That file contains a short set of instructions written in a simple format. It names which crawlers the rules apply to and lists which paths they should or shouldn't request. Think of it as a doorman's instruction sheet rather than a locked gate. The crawler reads the sheet and, if it's a reputable one from a major search engine, it follows the guidance. This is part of the wider world of technical SEO basics that quietly shape how well a site performs.
The crawl budget connection
Every site gets a rough allowance of attention from search engines, sometimes called crawl budget. It's the amount of crawling a search engine is willing to do on your site in a given window. For a small brochure site this almost never matters. But for a large store with thousands of pages, filters, and search results, you don't want crawlers wasting their visits on pointless URLs. A thoughtful robots.txt can steer them away from low-value corners so they spend more energy on the pages that earn you customers.
What the file looks like inside
You don't need to be a programmer to read a robots.txt file. It's built from a few repeating ingredients. The most common are User-agent, which names the crawler the rules target, and Disallow, which lists a path crawlers should avoid. There's also Allow, which carves out an exception, and a line pointing to your sitemap.
A simple example might say: for every crawler, please don't visit the admin folder or the internal search results, but everything else is fair game, and here's where to find the map of my important pages. That's genuinely most of what the file does. The art is in deciding what belongs on the "please avoid" list and what should stay open.
An asterisk and a slash do a lot of work
Two symbols carry a lot of meaning. The asterisk acts as a wildcard, matching any sequence of characters, and the dollar sign marks the end of a URL. So a rule can target every URL containing a question mark, or every file ending in a particular extension. This is powerful and slightly dangerous: a wildcard placed carelessly can match far more than you intended, which is exactly how accidental site-wide blocks happen.
The single most important thing to understand
Here is the misconception that causes the most damage, and it's worth reading twice: robots.txt controls crawling, not indexing. Those are two different things. Crawling is whether a search engine reads the page. Indexing is whether the page can appear in search results. Blocking a page in robots.txt stops the reading, but it does not reliably stop the page from showing up in results.
How can a page appear in results if the crawler never read it? Because search engines also learn about pages from links pointing to them elsewhere. If lots of sites link to a URL you've blocked, the search engine may list it anyway, often with a bare title and a note that no description is available because crawling was disallowed. So if your real goal is to keep a page out of search results entirely, robots.txt is the wrong tool. You'd want a noindex instruction in the page itself instead, which a crawler can only see if you let it read the page.
| Your goal | Right tool | Why |
|---|---|---|
| Stop wasting crawl effort on junk URLs | robots.txt Disallow | Prevents crawlers from requesting the path at all. |
| Keep a page out of search results | noindex tag on the page | The crawler must read the page to see the instruction, so don't block it. |
| Hide private or sensitive data | Password protection | robots.txt is public; never rely on it for security. |
| Point crawlers to your key pages | Sitemap line in robots.txt | Helps discovery of important URLs efficiently. |
What robots.txt cannot do
It's worth being blunt about the limits, because misunderstanding them leads to real problems. First, as we've covered, it does not guarantee a page stays out of search results. Second, it is not a security measure. The file is publicly readable by anyone who types the address, so listing your secret admin path actually advertises it to the curious. If something must stay private, protect it with a login, not a polite request.
Third, well-behaved crawlers obey it, but not every bot on the internet is well-behaved. Scrapers and malicious bots may ignore the file entirely. And finally, blocking a page in robots.txt can backfire when that page already drives traffic. If a crawler can no longer read it, the search engine slowly loses its understanding of what's there, which can quietly erode rankings. If you've ever wrestled with pages that were crawled but not indexed, robots.txt is often part of the diagnostic puzzle.
The mistakes that quietly cost businesses traffic
Most robots.txt disasters aren't dramatic. They're small, silent, and discovered weeks later when someone notices traffic has slipped. The most infamous is the leftover block from a website build. During development, teams often add a rule that disallows everything so the unfinished site stays out of search. The mistake is forgetting to remove it on launch day. The site goes live, looks perfect to visitors, and is utterly invisible to search engines because that one stubborn line is still telling every crawler to stay out.
Blocking your own resources
Another classic error is blocking the files that make your pages work, such as stylesheets and scripts. Years ago some people blocked these to "tidy up" crawling. Today, search engines render pages much like a browser does, so if you block the resources that control layout, the crawler sees a broken, half-built version of your page and may judge it harshly. Let crawlers reach the assets that make your pages look and behave correctly.
Conflicting and overly broad rules
Wildcards are wonderful until they swallow more than intended. A rule meant to block one type of URL can accidentally match your whole catalogue if the pattern is too loose. Conflicting Allow and Disallow lines also confuse things, since the way conflicts are resolved isn't always obvious to a non-expert. The safest habit is to keep rules few, specific, and easy to read, then test them rather than trust them. These are exactly the sort of issues a thorough SEO audit is designed to catch before they hurt you.
What you should usually leave open
It's tempting to think more blocking equals more control, but the opposite is usually true. For most websites, you want crawlers to read your pages freely. The pages that genuinely benefit from being blocked are narrow: internal search result pages that generate endless thin URLs, certain filtered or sorted versions of category pages that create near-duplicates, cart and checkout steps, and admin areas. Even then, blocking isn't always the best fix for duplicates. Sometimes a duplicate content problem is better solved with canonical tags so the search engine still understands the relationship between pages.
Always point to your sitemap
One genuinely helpful line to include is the location of your XML sitemap. It gives crawlers a tidy map of the URLs you care about, which speeds up discovery, especially for newer or larger sites. If you're setting up a brand-new project, this small step belongs in your launch routine, and pairs naturally with the broader checklist for SEO for new websites.
How to check and test it safely
Before you change anything, look at what you already have. Type your domain followed by /robots.txt into a browser and read it. If you don't have one, that's usually fine; an absent file simply means crawlers assume everything is allowed. If you do have one, read every line and ask, in plain language, "What does this stop a crawler from reading, and do I actually want that?"
When you make changes, treat them with care. Major search engines offer testing tools that let you check whether a specific URL is allowed or blocked under your rules. Use them. It's far better to confirm a rule behaves as expected than to publish it and discover the consequences in your traffic reports a fortnight later. After changes, keep an eye on coverage reports for any sudden spike in blocked pages, and watch your overall site health and performance so nothing slips through unnoticed.
Coordinate with your wider strategy
Robots.txt doesn't live in isolation. It works alongside your sitemap, your internal links, and your indexing instructions. Strong internal linking helps crawlers find your important pages efficiently, which reduces the need for heavy blocking in the first place. When all these pieces agree with each other, crawlers spend their attention where it counts and your best content gets the visibility it deserves. If you're still getting comfortable with the fundamentals, it's worth revisiting how SEO works as a whole.
A calm approach to a small but powerful file
If there's one mindset to carry away, it's this: with robots.txt, restraint usually beats enthusiasm. The file is most useful when it's lean, deliberate, and easy for a human to understand at a glance. Block only what genuinely deserves blocking, never rely on it to hide secrets, remember it controls crawling rather than results, and always point to your sitemap. Then test before you trust.
Handled this way, that humble text file becomes a quiet ally, guiding search engines toward your best work and away from the clutter. Handled carelessly, it becomes one of the easiest ways to disappear from search without realising why. The good news is that paying it a little attention now means you'll rarely have to think about it again. If your site has grown complex or you've recently moved it, it's wise to fold a robots.txt review into a broader technical health check or a planned website migration, and to reach out for a hand if anything looks tangled.
Frequently asked questions
Does every website need a robots.txt file?+
Will blocking a page in robots.txt remove it from Google?+
Can I use robots.txt to hide private information?+
I blocked something by mistake. How fast does fixing it work?+
References
- Google Search Central. "Introduction to robots.txt." developers.google.com.
- Google Search Central. "How Google interprets the robots.txt specification." developers.google.com.
- Bing Webmaster Tools. "Crawl control and robots.txt." bing.com.