Crawl Budget Optimization for Sites Under 10,000 URLs (Yes, It Still Matters)
Google says crawl budget only matters for million-URL sites. In practice, it matters for any site with faceted nav, parameter URLs or auto-generated tag pages. Here's the small-site crawl audit.
Google's official line is that crawl budget is a concern only for sites with 1M+ URLs. The official line is wrong for most ecommerce, multi-location, and tag-heavy publishing sites. A 5,000-product Shopify store can easily expose 200,000 crawlable URLs through facets — and Googlebot will burn budget on the wrong ones unless you intervene.
Table of contents
1. When crawl budget actually matters · 2. The 4 small-site crawl-waste sources · 3. How to find waste in Search Console · 4. The robots.txt + noindex + canonical playbook · 5. Sitemaps as crawl prioritization · 6. Measuring the fix · 7. FAQ
When does crawl budget actually matter for small sites?
Crawl budget matters when the number of crawlable URLs significantly exceeds the number of canonical/indexable URLs. A 500-page editorial site with no facets has no crawl-budget problem; a 500-product Shopify store with 12 filter dimensions can expose tens of thousands of crawlable URLs and starve important pages of recrawl frequency.
The 4 small-site crawl-waste sources
1. Faceted navigation (filter combinations create combinatorial URLs). 2. Internal search results pages indexed by default. 3. Tag and category archives auto-generated by CMS. 4. Parameter URLs from session IDs, sort orders, and tracking. **The combined effect is often 10–50× the canonical URL count being crawled.** Each wasted crawl is a missed recrawl on a page that matters.
How to find waste in Search Console
Use Search Console > Indexing > Pages to find 'Crawled — currently not indexed' and 'Discovered — currently not indexed'. Pull at least 1,000 URLs into a spreadsheet, classify by pattern (?filter=, /tag/, /search?q=). Concentrations of one pattern = your highest-value crawl-budget fix. Cross-check with server logs for actual crawl frequency per URL pattern.
The robots.txt + noindex + canonical playbook
Robots.txt blocks crawl entirely (use for /search, session-ID URLs, internal admin). Noindex+follow keeps the link graph but removes from index (use for tag archives, low-value combinations). Canonical consolidates near-duplicates to a primary URL (use for sort orders, currency variants, A/B test URLs). The wrong tool produces the wrong result — match technique to intent.
Sitemaps as crawl prioritization
Submit XML sitemaps containing only the URLs you want crawled and indexed. Update lastmod accurately (not just on every deploy). Split into <50K-URL files grouped by content type (products, blog, locations) — Google reports indexation per sitemap, making diagnosis 10× faster than a single mega-sitemap.
Measuring the fix
Compare 30-day crawl stats in Search Console before vs after. Successful crawl-budget fixes show: total crawl requests stable or down, crawl frequency per important URL up, 'crawled — not indexed' count down, indexation rate of new pages faster. **A well-executed crawl-budget cleanup typically halves crawl waste and doubles recrawl frequency on canonical pages within 30–60 days.**
Frequently asked
No — noindex is the correct tool for low-value pages. The risk is noindexing pages that should rank; audit before mass-applying.
Noindex+follow if the page links to canonical content worth crawling; robots.txt if the URLs have no SEO value at all (internal search, session URLs).
Indirectly — aggressive bot protection can throttle Googlebot. Verify Googlebot isn't being challenged in your security logs; whitelist verified Googlebot IP ranges.
Quarterly for stable sites; monthly during major changes (new features, product launches, redesigns). Crawl-budget regressions are easy to introduce and easy to miss without active monitoring.
Most do as of 2026 — OpenAI, Anthropic, Perplexity all respect robots.txt directives. Block them only if you don't want training inclusion; for AEO presence, allow.
