Server Log File Analysis For SEO: The Complete Guide.
Log file analysis is often an overlooked element of SEO (probably because it seems really technical), yet it can provide huge value to your site and highlight SEO opportunities you would’ve never found before.
Log file analysis is one of the best things you could ever learn to improve your SEO.
It’s important to remember that log file data is more accurate than Google Analytics which means you can pull some really interesting insights from your analysis.
It’s not even that technical of a job once you know what you’re doing, as you’ll find out below.
Table Of Contents
This guide will teach you:
What a server log file is and why they matter
The best log file analysis tools for SEO
How to access your log files (and clean the data)
Log file analysis checks that will improve your SEO
What Is A Server Log File?
A web server will keep a log of every action it takes and record all requests coming into the site.
These requests could be users visiting the site, Googlebot crawling the site or any kind of resource request whether it’s from a user or a machine.
These records are called log files and they contain some really useful information. You can essentially see exactly how users and bots are accessing your website as well as how your site is responding to visitors.
Whenever a user or Googlebot (or anything) accesses your website, a line is recorded in your server logs, before being stored away by the server.
Remember that your server will keep a log file of all activity. That means every hit and every Googlebot crawl. If you have a large website, that’s a lot of data to analyse.
The term log file refers to a collection of log entries that are usually packaged up by your server on a daily basis. You can export these log files and view all activity for a particular day, this could be thousands of lines if you have a large site.
Why Are Log Files Important For SEO?
Log files provide actionable data that’s much more accurate than your standard crawling and analytics tools.
It’s real data that provides an invaluable insight into how search engines are accessing your website. You won’t find crawl data this useful anywhere else.
There are so many SEO use cases for server log files. You can really gain a competitive advantage by learning the techniques below and not only for your websites, but for your career too.
Logs can’t lie!
There are also uses for server log files outside of SEO, including:
Quality assurance: Reviewing an application for errors or bugs that are causing poor performance
Website security: Whether it’s locating attempted hacking attempts or simply reviewing who is trying to access your website
System errors: Locate any errors within your IT infrastructure
UX: Review how users are interacting with your applications and locate the data behind errors reported by customers
Compliance: Ensuring your business keeps its data safe and follows best practices such as GDPR
Where To Find Your Server Log Files
Different servers and hosts will manage their log files differently so you may need to do a little digging.
All log files are found in the backend of your server so you’ll need access to this if you plan on pulling the files yourself.
The most common server types are Apache, NGINX and IIS and they each have documentation on accessing server log files:
But even with the info above, you may run into issues. For example, you may use CDNs like Cloudflare which means you’ll need to gain access to your Cloudflare log files and combine the data for a complete picture.
Some hosting providers will allow you to download log files directly from your cPanel which does make things easier.
Most servers package up log files based on a day’s worth of data, which obviously won’t provide a complete picture. Depending on the size of your site, you’ll want to download 1 to 4 weeks worth of data.
You may find that logging is disabled on your server which means no historical data will be available – enable it asap!
If you’re working with a client or you’re an in-house SEO, you should be able to request log file data from the dev team who will have access to the server.
Some clients or stakeholders may not understand why you require server logs and may hesitate to provide access due to data protection concerns. You may need to do some educating and reassuring here.
Also, some servers may be set up to wipe server logs after a short amount of time, sometimes on a daily basis. You’ll need to speak to the devs and see if a longer retention time can be implemented.
How To Read A Log File
Log files are scary to look at.
They contain lots of data, some of which you never need to look at, and it can get a little overwhelming when you’re dealing with a long list of entries.
You should know that not all log files use the same structure or include the same data. It’s all dependent on your website’s configuration and server setup.
You may even want to tweak your server logs so they pull more useful data points.
There are a number of common attributes found within a log file which I’ll explain below.
- Client/Visitor IP – The IP address of the user or bot submitting the request.
- Server IP – The IP address of your server.
- Timestamp – The date and time of the request.
- HTTP status code – The status code returned by the server (200, 404, 301 etc.)
- Requested URL – Which URL the visitor has requested.
- Method (GET/POST) – The method refers to the type of request from the visitor. GET could be a simple URL or resource request whereas POST would be an action such as submitting a blog comment.
- Referrer – Which external website a user has arrived from (similar to the referral channel in GA).
- Browser user-agent – What browser type and version is being used by the visitor.
- Bytes downloaded – The size (in bytes) of the requested resource.
- Time taken – The time taken to load the requested resource from server to client.
Not all server log configurations will be tracking every data point mentioned above. Talk to your devs if you need extra data points added to your logs.
How Do Server Logs Work?
A URL contains three basic components that are processed by your browser.
https:// = Protocol
lewischaffey.co.uk = Server name
server-log-file-analysis.html = File location
The server name will correspond to your IP address that a web browser will use to establish a connection with your web server to locate and serve the requested file.
To serve the requested file, a HTTP GET request will be sent from the browser to the web server, with the file being returned to the browser in HTML format. This will then be processed and displayed by the browser, resulting in the final page appearing on your screen.
It’s these HTTP GET requests that are stored as a hit by your server.
What Is Crawl Budget Optimisation And How Does It Work?
You want Google to crawl your website regularly to ensure new pages are indexed quickly and any changes to old pages are being picked up and factored into the search results.
Crawl budget = The number of pages Googlebot crawls and indexes upon a visit to your website (before leaving and moving onto a different website).
Crawl rate = How many requests Googlebot makes to your site when it is crawling it: for example, 5 requests per second.
Essentially, the more established and high-authority your website is, the more time Google will spend crawling your site.
The goal of crawl budget optimisation is to ensure Google is crawling all of the important pages of your website regularly while avoiding old, outdated and irrelevant URLs.
You can then use crawl budget data to optimise your website from the ground up.
If a certain blog post is getting crawled regularly, for example, you might want to add some internal links to that post directing to your product pages.
You’ll notice that Google mentioned the use of faceted navigations in the screenshot above. Here’s an example of one.
The faceted navigation is the filterable menu on the left of the page above, Nike’s category page for football boots.
I once worked with a large ecommerce store that used a faceted navigation for their category pages, just like the above.
These filters created tonnes of duplicate URLs that Googlebot would spend all of its time crawling, before leaving the site.
This meant that important pages of the site were being missed entirely. Other categories, high-margin products etc.
Crawl budget optimisation will reveal issues like this and there are plenty of ways for you to fix them.
More on that stuff later…
Why You Can’t Trust Google Analytics And Crawler Data
The number one advantage of log file data is its accuracy when compared to analytics tools like GA.
Crawler data also doesn’t show the complete picture.
Crawlers will usually conduct a crawl from the homepage or sitemap down, following all internal links. Googlebot behaves differently.
Firstly, Googlebot will discover your pages from external sources. For example, I’ve seen Googlebot hit one of my blog post URLs that used a Twitter UTM tag because the post was shared multiple times on Twitter. A standard crawler is never discovering that URL.
Secondly, Googlebot has its own priorities. Your blog will often be crawled much more frequently than your product or service pages, because Google understands where new content is published most often and therefore where it needs to crawl most often.
You’ll also see the real 404 and 302 response codes that Googlebot is hitting during a crawl. I’ve seen Googlebot crawl pages that haven’t existed for 6+ years. You may find 404 and 302 issues that a standard crawler will never pick up.
(By the way, I’m not dissing crawlers – they are essential SEO tools! It’s just important to understand their limits).
How To Conduct Log File Analysis (Best Log File Tools)
A breakdown of the best log file analysis tools you can use to generate insights, from Excel to Screaming Frog to ELK.
Once you’ve pulled a few days worth of log file data and combined them into a single file or folder, you’re ready to begin your analysis.
You can use Google Sheets or Excel to analyse log files but this is the more manual approach. Spreadsheets aren’t going to highlight any insights by themselves so you need to be comfortable with formulas and pivot tables if you go down this route.
If you would like to use Excel, it might be best to analyse only a few days worth of data, it depends how many rows of data your machine can handle. Here’s a good article to help you with this approach.
Listed below are some of the best tools you can use for log file analysis.
#1: Screaming Frog Log File Analyzer
In classic Screaming Frog fashion, this is one of the best and most easy-to-use log file analysers out there.
You will get instant log file insights from this tool and the data will be nicely condensed so even marketers with a less technical background can understand and make use of the data.
Simply drag and drop your log files into the tool and Screaming Frog will process the files (depending on what data is included) to provide insights such as:
- Which bots crawl your site and how often
- How many pages those bots crawl
- Which HTTP statuses are being returned
You can then dig deeper and begin exporting pages that require action. For example, you can sort the data by status code, filter the list down to 404 errors and then go away and implement redirects for these pages (where necessary).
You may want to delve into pages that are showing slow response times and identify the root cause of these issues.
The real magic comes when you play with URL data. Screaming Frog allows you to import a set of URL data and compare this against your log file data.
This allows you to:
Identify URLs that appear in the URL data but not in the server log files,
These pages aren’t being crawled which could be a problem if they’re important resources
Identify URLs that appear in the server log files but not in the URL data.
This could highlight long lost orphaned pages or maybe reveal pages that aren;t in your sitemap but should be
As you can imagine, this now opens the tool up to lots of potential uses:
- Identify how often important pages are being crawled
- Understand if all of your sitemap URLs are being crawled
- Reveal new pages that need to be added to the sitemap
- Review how long onsite changes will likely take to be crawled and later indexed
- Identify any links that impact on crawl rate
- Reveal which pages of your site are orphaned and not getting crawled
- Lots more…
#2: SEMrush Log File Analyzer
This is a very handy tool if you’re looking for quick insights that doesn’t require hours of digging into data. You’ll find plenty of quick wins for your websites/clients using this tool.
Simply upload your log files and SEMrush will handle the analysis. You’ll see a list of all the pages included in the log file, filterable by response code, number of visits etc.
SEMrush will also show you the path that a bot has taken which can provide some really useful data. You might find bots are hitting backend plug-in pages or non-indexable URLs which would show new robots rules may be needed.
Look at the SEMrush Log Analyzer as a quick and easy way to organise your log file data. It makes everything filterable and easy to digest for you to work your SEO magic.
#3: JetOctopus Log File Analyzer
JetOctopus takes things to a new level.
Being a tool built with server log analysis in mind, JetOctopus is a real powerhouse in this space. The tool will combine multiple data sets including; crawl data, log files, Search Console data and Google Analytics conversion data.
Something to keep in mind here is that Google probably knows more about your website than you.
It holds historical data from your site going back years. By combining log files with a website crawl, you’ll have a more complete picture of your website than ever before.
What I love about this tool is how it will visualise your data. This makes it so easy to explain to clients what actual value my server log file analysis has generated. Here are some examples:
You’ll also see important stats that are always kept up to date for a high-level overview of your site whenever you log in, including:
- Bot crawl ratio
- Orphan page ratio
- Pages not visited by bot (%)
- Crawl budget allocated to non-indexable pages (%)
#4: OnCrawl Log File Analyzer
OnCrawl Log Analyzer sits between Screaming Frog and JetOctopus. It can take (and encourages) lots of log file data, making everything easily filterbale (like ScreamingFrog) and also providing some handy data visualisations (like JetOctopus).
OnCrawl actually encourages you to upload your log data on a regular basis so you end up with a tool which can paint a complete picture of how your site is being crawled, going back months and years if needed.
This results in some really handy charts and graphs which will showcase how your server logs have changed over time. This essentially shows crawl trends and allows you to spot positive and negative patterns early on.
Here is a handy video on how to set up OnCrawl and the kind of data you can expect back.
Now we’re getting technical.
Splunk (strange name I know) is a tool primarily used by devs to help structure complex, multi-line data.
It actually has a range of uses outside of SEO, including:
- IT operations
- System and application monitoring
- Business analytics
- Application security
- Compliance reviews
This is a more enterprise-level setup and won’t be necessary for most standard SEO work. Splunk essentially creates a centralised logging system that allows you to easily filter long lists of data via a simple UI.
You can configure the tool to automatically consolidate log data and this will be processed into:
- Searchable data strings
- Real-time monitoring
- Historical analytics and comparisons
- Custom alerts
- Dashboards and data visualisation
#6: The ELK stack
The ELK stack is an enterprise-level analytics tool that comprises of three popular open-source projects:
It’s all managed and kept up-to-date by Elastic (they’re definitely worth checking out).
ELK has now become the world’s favorite log management system thanks to its incredible capabilities. It will aggregate logs from all of your systems, analyse these logs, and create visualizations to support:
- Application and infrastructure monitoring
Some companies have actually built out dedicated services using the ELK platform as their backbone.
Update: Another tool, Beats, has been added to the ELK stack, making it an even more capable system.
This all works together by Beats and Logstash handling data collection, Elasticsearch indexing, storing and organising the data and Kibana providing a nice UI capable of data visualisations and queries.
This is the Google Data Studio for log file analysis, it pulls in all of the data points you could ever need and allows you to manipulate the data to build custom reports for anything you need to assess.
The ELK stack deserves a whole article of its own and I couldn’t possibly do it justice here so I’ll send you over to this detailed guide from Logz.io.
#7: Google Sheets / Excel
If you’re comfortable with formulas and you really want to dig into the data, Google Sheets or Excel will be essential.
Even if you use one of the other tool options, you’re probably eventually going to want to import your data into a spreadsheet so you can compare logs against Analytics and Search Console data.
You may even have conversion metrics or website KPIs/goals that you want to align log file data with.
It will take time to filter, segment and organise all of your data but it’s usually worth the time investment.
Extra Tip: Always Check Your Log File Data Is Correct (It’s Easy To Be Tricked)
While server log data is probably the most accurate website data you could ever use, there are a few common mistakes to watch out for.
Pretend Googlebots – You may stumble across visits from bots that look like Googlebot but are actually set up to replicate Googlebot and therefore their crawl fata is not very useful. These bots may be sent from SEO tools by yourself or competitors and you don’t want to let this ruin your data.
(The Screaming Frog Log Analyzer has a built-in function that verifies bots to ensure they are who they say they are, which is pretty handy).
Load balancing – Some websites will split traffic amongst a group of servers to maintain strong performance and reduce the load on one specific server. This is called load balancing. A setup like this means that log files will be split across multiple servers and you need to combine that data to get a full picture.
CDN data – A similar issue is the use of content delivery networks. A system like Cloudflare will serve your website across the globe using multiple server locations, so you’ll need to get hold of these logs as well.
Don’t make assumptions based on false data!
If you can, always verify the log data you’re analysing to ensure it’s 100% valid.
How To Use Server Log Files To Improve Your SEO
Below we’ll cover the different checks you can perform via log file analysis to improve a website’s SEO.
Uses of server log files for SEO can generally be whittled down to three categories:
- Crawl budget optimisation
- Crawl behaviour analysis
- Site-wide SEO health
Listed below are some of the most important log file checks you can perform from each of these categories.
Analyse Crawler Behaviour
Just simply looking for patterns in your crawl data at a high-level is a good way to get an overview of crawler behaviour. You can get an idea of how and why Google is crawling your site, complementing any of your future SEO efforts.
You can view whether your site is being crawled by the desktop or mobile version of Googlebot (although at this point, it’s probably going to be mobile).
Something I like to do is work out the total percentage of crawls per page type. So I’ll calculate how many crawls each subfolder is receiving and compare these stats to unearth the most valuable areas of my site (from Google’s point of view).
Understand Crawl Volume
Log files will show the total number of requests made by a search engine, whether it’s Google, Bing, Yahoo or others.
You can compare these totals to see which search engines are crawling your site most regularly and, more alarmingly, which ones are barely visiting you at all.
This can reveal some issues for international sites in particular. If you want to be found in China but Baidu is almost never crawling your website, that’s a problem that needs dealing with.
Understand What Pages Crawlers Are Prioritising
Log files will reveal what areas and subfolders of your site are being crawled most regularly and which areas are being a little bit neglected.
You may find that Google is not crawling really important pages of your site such as product/service pages or pages that are regularly updated and need to be re-indexed.
For example, you may find that a certain blog category is only crawled once every 3 months whereas your site’s news section is crawled every few days. If you have short-term traffic goals for a new page, it would be much better off in the news subfolder.
If important pages aren’t being crawled enough, you should review:
- Crawl depth – how many steps away from the homepage are these URLs?
- Internal linking – are these pages internally linked from high-performing pages of your site
- Speed and rendering time – is it taking too long for Google to process these pages?
Backlinks can also help manipulate crawl priority.
Last crawl date will be a data point included in your log files, use this to understand how quickly Google crawls a page after it has been pushed live.
Identify Crawl Budget Waste
If Google is hitting its crawl limit before discovering new pages that need to be indexed or crawling important areas of your site, this is going to need addressing.
You’ll almost always find crawl budget waste during log file analysis (it’s pretty normal).
This could be caused by improper pagination, ecommerce sites using navigation and parameters that create lots of unnecessary URLs or any number of other factors.
You can fix crawl budget wastage by:
- Implementing robots.txt rules
- Adding the no-follow attribute to certain internal links
- Adding no-index tags where necessary
- Amending URL structure
- Amending page depth
Identify 302 And 404 Response Codes
Server logs will flag any 404s that Googlebot is trying to access. Google will maintain more historical data for your website than any crawler so you may end up finding 404s for pages you never knew once existed!
Any 302 (temporary) redirects that haven’t been found by your crawler will also show up and will need updating to 301s in most cases.
- Are any pages with 3xx, 4xx or 5xx responses being visited frequently?
- Are pages with 3xx, 4xx or 5xx responses being prioritised over important pages?
- Are there any patterns to the response code errors that will make fixing them easier?
Identify Inconsistent Response Codes
A slightly more tricky issue to rectify is inconsistent response codes. This is when the same URL returns two or more different response codes when crawled over a particular time period.
There are a number of reasons why a URL may display different response codes:
- A 2xx page becomes a 5xx server error when the site experiences high traffic and fails to load the page due to limited capacity
- A 4xx page becomes a 2xx page because the broken link which caused the original error has since been fixed
- A 4xx page returns a 3xx response code as a redirect has been implemented for that error page
It’s always good to review inconsistent response codes just to ensure there are no serious issues under the hood of your site.
It’s worth noting that Screaming Frog Log Analyzer allows you to filter your data to only show pages with an inconsistent response.
Duplicate Content And URL Parameter Issues
You may find search engines are crawling multiple versions of the same URL. This may be URLs with lots of parameters or even PPC pages that havent been removed from the index correctly.
This is wasted time that could be better spent on the crawling and indexing of new pages.
To fix duplicate content issues:
- Add a robots.txt rule to stop search engines crawling duplicate pages in the future
- Add a page-level noindex attribute to the duplicate URLs
- Add canonical tag that directs to the master page
- If the duplicate URLs are errors or no longer needed, implement redirects
- Specify the parameters you want Google to ignore in Search Console
Review Slow Pages
Log files are a great way to identify any pages that are slow to load and therefore offering a poor UX to both viewers and crawlers.
Using the TTFB metric (time to first byte) you can sort your data by pages that were the slowest for crawlers to process.
By optimising these pages you will speed up the time it takes for Google to crawl your site (improving crawl budget) and offer users a better experience, improving ranking potential.
When reviewing pages that were slow to load, see if you can notice any patterns:
- Are images slow to load?
- Are video embeds being used?
- Is there lots of additional CSS to load?
- Are there any interactive on-page features that need optimising?
For a quick page speed test, run your pages through GTmetrix.
View Your Traffic Metrics
Google Analytics data is good but not exact whereas log files don’t lie. You can use monthly log files to determine the number of visits each page is receiving and sorting your pages by most visited can lead to some pretty useful insights.
Obviously this data won’t be split by channel or demographic, it’s simply tracking raw hits.
Block Scrapers And Competitors
By reviewing the user agents of the bots that are visiting your site, you can pretty quickly identify scrapers and crawlers that are spamming your website. If you’re seeing an SEO crawler that you don’t recognise it might even be a competitor snooping at your site!
Look for suspicious activity and analyse the behavior of any offending bots. If they’re up to no good then it’s worth blocking them from accessing your site.
Unfortunately robots.txt rules won’t be enough here as these bots aren’t going to follow them. You’re better off blocking them directly via the htaccess file.
Check If Google Is Crawling Pages It Shouldn’t
I like to make a list of URLs that have been made non-indexable because we don’t want them appearing in search. I also include any disallowed URL groups from the robots file in this list.
You can then compare this data with the list of URLs from your logs that have been requested by a crawler. If any disallowed pages or subfolders are still being crawled, this is worth looking into.
Google may consider these pages valuable (warranting the removal of any blockers) or there may be crawl efficiency issues to resolve.
Review Your Website’s Security
Perhaps not strictly an SEO benefit, but still pretty useful so I thought I’d include this one anyway.
Server logs will allow you to view any malicious activity and even potential hacking attempts. A lot of the time they are the only evidence you’ll be able to find of an attack.
In fact, the analysis of server log files can reveal:
- Attempts to access a hidden file (that shouldn’t be accessible)
- Sharp increases in activity at unusual times (potential attacks or bot spam)
- Repeated attempts to access password protected areas of the site
- Attempts to perform remote code execution and injection
You might find an instance of one particular IP attempting to log in to your website multiple times. If you do spot any potential security risks, go ahead and block those offending IPs.
Parameter URLs Receiving Crawls
Parameter URLs are often causes of duplicate content issues (when not canonicalised) or can simply act as a drain on your crawl budget.
You can use log files to identify all the parameter URLs Google is hitting. If these are important, indexable URLs then that’s not a problem.
Most of the time, however, you’ll find Google attempting to crawl lots of parameter URLs that have been created via faceted navigations or product-related URL variables.
If parameter URLs are becoming an issue:
- Ensure you have correctly declared parameter URLs in Search Console
- Implement canonical tags
- Implement robots.txt disallow rules
Establish Internal Linking Priorities
Pages that are crawled the most should link the most!
Show Googlebot the areas of your site you really want to prioritise (I’m guessing product or service pages) and implement some internal links with optimised anchors.
A lot of times you will find an informational page receiving a tonne of Google visits, perhaps because it received lots of social shares or inbound links. These will make great internal linking opportunities.
For Advanced SEOs And Agencies: Set Up Custom Reporting
To take things to the next level, you can get pretty complex with custom reporting and automations using your log data. Some agencies will report using log file data on a monthly basis, with clever Google Data Studio setups.
The truth is, this may get a little too technical for your clients and it’s best to keep things as simple as possible. At the end of the day, revenue and sales are all your client is going to be interested in.
Data visualisation is probably your best bet to tell the story of why your log file data matters.
You could share log file data that shows which areas of the site Googlebot is missing, or perhaps month-on-month comparisons on the number of new pages Googlebot is crawling each month. Even the improvement of 4xx and 5xx errors over time could make a useful reporting point.
My Advice + Extra Resources
Download my free server log file audit template and watch the videos I found useful when learning log file analysis.
There are so many uses for log files.
My advice is to not let them overwhelm you. Even if you’re just getting started out in SEO, you can run some logs through a tool like Screaming Frog and get some actionable insights pretty easily.
It only needs to be as technical as you want to make it.
Generally I would recommend running a server log review at least every quarter. That will leave enough time for Googlebot’s behaviour to change and some new valuable data to emerge.
Below I’ve listed some extra resources and videos that I found really useful when discovering the power of log files, hopefully you will too!
- How to access log files on Cloudways
- How to access log files on Bluehost
- How to access log files on Hostgator
- How to access log files on Siteground
- How to access log files on Squarespace
- How to access log files on Azure
- How to access log files on Google Cloud
- Log file Analysis – What You Should Be Looking For
- Server Log Files & Technical SEO Audits: What You Need to Know Presented by Samuel Scott
- Server Log Analysis, How to analyse your server logs using SEMRush
So hopefully you now have a good overview of server log file analysis and how to approach it. I’m sure you’ll see plenty of benefits from learning this SEO tactic and implementing it across your sites.
If you have any questions about your server logs, feel free to reach out.
The SEO's Guide To Anchor Text Optimisation (Tactics & Techniques). Anchor text selection has been a huge part of offsite SEO even since the very early days of search. Learn why anchor texts are so important for SEO and how to craft the perfect anchor text...