Technical SEO· 14 min · May 15, 2026

Log File Analysis Without an Enterprise Tool: A Working SEO's Workflow

Q: Should I track AI bot crawls separately?

Yes — GPTBot, ClaudeBot, PerplexityBot, Google-Extended all appear in logs. Tracking their crawl patterns reveals which AI engines actually index your site, informing AEO strategy.

Log files reveal what Googlebot actually crawls — not what tools guess. Here's the no-budget workflow using server logs, command line, and a spreadsheet to surface real crawl-budget waste.

By Freelancer Tamal

SEO Expert · Rangpur, Bangladesh · 6+ years experience

Log file analysis is the only way to see what Googlebot actually crawls — not what Search Console reports (sampled), not what crawl tools simulate (synthetic). Enterprise tools (Botify, OnCrawl) cost $1,500+/month. Most SEOs don't have that budget. Here's the no-budget workflow that produces 80% of the value.

1. What log file analysis answers · 2. Getting raw access logs (3 paths) · 3. Filtering verified Googlebot · 4. The 5 questions to answer first · 5. Spreadsheet workflow · 6. What to fix based on findings · 7. FAQ

What does log file analysis answer?

Quick answer

Log files answer: which URLs Googlebot actually crawls, how often, with what response codes, in what order, and how much budget is wasted on non-canonical URLs. Search Console crawl stats are sampled aggregates; logs are the source of truth. **No other data source tells you which of your 10,000 URLs are getting recrawled monthly vs annually.**

Getting raw access logs (3 paths)

1. Cloudflare Logpush (Enterprise plan, or paid add-on). 2. Origin server logs from cPanel, Plesk, or direct SSH. 3. CDN logs from Vercel, Netlify, AWS CloudFront — most modern hosts expose them in the dashboard. Pull at minimum 30 days. Format varies (Apache combined, NGINX, Cloudflare CSV) — all are parseable.

Filtering verified Googlebot

Spoofed user agents are common. Verify by reverse-DNS lookup against googlebot.com or google.com. The clean shortcut: download Google's published IP ranges (googlebot.json), filter logs against them. Skip user-agent-only filtering — half the 'Googlebot' traffic in raw logs is fake.

The 5 questions to answer first

1. What % of crawls hit canonical/indexable URLs? (target: >70%) 2. How many distinct URLs were crawled in 30 days vs total indexable URLs? (gap = recrawl-frequency problem) 3. Which URL patterns generate the most 4xx/5xx? 4. Which top-traffic pages haven't been crawled in 30+ days? 5. What % of crawls hit JS/CSS/image assets vs HTML? (high asset ratio = fix performance, not SEO).

Spreadsheet workflow

Open log file in Excel/Google Sheets/CSV with grep + awk if huge. Columns: timestamp, IP, URL, status, user-agent. Add columns: URL pattern (extracted via regex), canonical Y/N, important Y/N. Pivot table by URL pattern to see crawl distribution. **For sites under 100K URLs, this workflow takes 2–3 hours and produces actionable findings comparable to enterprise tools.**

What to fix based on findings

Wasted crawl on non-canonical URLs → robots.txt block or canonical fix. Important pages with stale recrawl → improve internal linking, sitemap inclusion, freshness signals. High 4xx rate → fix or redirect. Asset crawl dominating HTML crawl → consolidate assets, add cache headers, reduce sitewide asset count. Each finding maps to a specific technical SEO fix.

Frequently asked

Can I do this with shared hosting?

Often yes — most cPanel hosts expose access logs. If your host doesn't, Cloudflare's free tier provides analytics (not full logs); paid Cloudflare plans expose Logpush.

How much log data do I need?

Minimum 30 days for meaningful patterns. 90 days is better for seasonal sites. Beyond 90 days, returns diminish — crawl patterns shift fast.

What if Googlebot doesn't crawl my site much?

Symptom of low authority + thin internal linking. Improve both before optimizing crawl-budget allocation; you don't have a budget problem yet — you have a discovery problem.

Are there free log analysis tools?

Screaming Frog Log File Analyzer (£139/year, not free but cheap), GoAccess (free, terminal-based), basic Excel pivots (free, sufficient for <500K rows). Skip enterprise tools until your site exceeds 500K URLs.

Should I track AI bot crawls separately?

Yes — GPTBot, ClaudeBot, PerplexityBot, Google-Extended all appear in logs. Tracking their crawl patterns reveals which AI engines actually index your site, informing AEO strategy.

Continue your research

Related services, guides & deep-dives

Reading

Services

AEO & GEO services

Resources

Book a consultation