Log File Analysis Without an Enterprise Tool: A Working SEO's Workflow
Log files reveal what Googlebot actually crawls — not what tools guess. Here's the no-budget workflow using server logs, command line, and a spreadsheet to surface real crawl-budget waste.
Log file analysis is the only way to see what Googlebot actually crawls — not what Search Console reports (sampled), not what crawl tools simulate (synthetic). Enterprise tools (Botify, OnCrawl) cost $1,500+/month. Most SEOs don't have that budget. Here's the no-budget workflow that produces 80% of the value.
Table of contents
1. What log file analysis answers · 2. Getting raw access logs (3 paths) · 3. Filtering verified Googlebot · 4. The 5 questions to answer first · 5. Spreadsheet workflow · 6. What to fix based on findings · 7. FAQ
What does log file analysis answer?
Log files answer: which URLs Googlebot actually crawls, how often, with what response codes, in what order, and how much budget is wasted on non-canonical URLs. Search Console crawl stats are sampled aggregates; logs are the source of truth. **No other data source tells you which of your 10,000 URLs are getting recrawled monthly vs annually.**
Getting raw access logs (3 paths)
1. Cloudflare Logpush (Enterprise plan, or paid add-on). 2. Origin server logs from cPanel, Plesk, or direct SSH. 3. CDN logs from Vercel, Netlify, AWS CloudFront — most modern hosts expose them in the dashboard. Pull at minimum 30 days. Format varies (Apache combined, NGINX, Cloudflare CSV) — all are parseable.
Filtering verified Googlebot
Spoofed user agents are common. Verify by reverse-DNS lookup against googlebot.com or google.com. The clean shortcut: download Google's published IP ranges (googlebot.json), filter logs against them. Skip user-agent-only filtering — half the 'Googlebot' traffic in raw logs is fake.
The 5 questions to answer first
1. What % of crawls hit canonical/indexable URLs? (target: >70%) 2. How many distinct URLs were crawled in 30 days vs total indexable URLs? (gap = recrawl-frequency problem) 3. Which URL patterns generate the most 4xx/5xx? 4. Which top-traffic pages haven't been crawled in 30+ days? 5. What % of crawls hit JS/CSS/image assets vs HTML? (high asset ratio = fix performance, not SEO).
Spreadsheet workflow
Open log file in Excel/Google Sheets/CSV with grep + awk if huge. Columns: timestamp, IP, URL, status, user-agent. Add columns: URL pattern (extracted via regex), canonical Y/N, important Y/N. Pivot table by URL pattern to see crawl distribution. **For sites under 100K URLs, this workflow takes 2–3 hours and produces actionable findings comparable to enterprise tools.**
What to fix based on findings
Wasted crawl on non-canonical URLs → robots.txt block or canonical fix. Important pages with stale recrawl → improve internal linking, sitemap inclusion, freshness signals. High 4xx rate → fix or redirect. Asset crawl dominating HTML crawl → consolidate assets, add cache headers, reduce sitewide asset count. Each finding maps to a specific technical SEO fix.
Frequently asked
Often yes — most cPanel hosts expose access logs. If your host doesn't, Cloudflare's free tier provides analytics (not full logs); paid Cloudflare plans expose Logpush.
Minimum 30 days for meaningful patterns. 90 days is better for seasonal sites. Beyond 90 days, returns diminish — crawl patterns shift fast.
Symptom of low authority + thin internal linking. Improve both before optimizing crawl-budget allocation; you don't have a budget problem yet — you have a discovery problem.
Screaming Frog Log File Analyzer (£139/year, not free but cheap), GoAccess (free, terminal-based), basic Excel pivots (free, sufficient for <500K rows). Skip enterprise tools until your site exceeds 500K URLs.
Yes — GPTBot, ClaudeBot, PerplexityBot, Google-Extended all appear in logs. Tracking their crawl patterns reveals which AI engines actually index your site, informing AEO strategy.
