How Googlebot Crawls Your Pages. Logs Insights
Log files analysis shows what Google actually does on your site, revealing real problems instead of hypotheses.
By Ann Yaroshenko
You probably know log files are treasure troves. You can see how bots crawl your pages, which content is frequently visited, or vice-versa, ignored by bots. But often SEOs don’t know how to get the value of logs - it’s time-consuming to explore the sheer volume of data manually, and analytics tools may be costly. Still, you need to find a way to extract the data from your logs. Today we're going to demonstrate how log files analysis helps to reveal tag-related problems, indexability issues, and crawl budget waste.
We will cover:
1. Robots.txt. Are you sure you're sending correct directives?
2. AJAX. What content is better to hide?
3. Canonicalization. Why rel=tag is not the best solution.
1. How Googlebot follows robots.txt directives
The path value in robots.txt is used as a basis to determine whether or not a rule applies to a specific URL on a site. In the case of conflicting rules, the least restrictive rule is used. Here is the case study when an incorrect order of precedence in robots.txt opened access to duplicated content.
About the referenced website:
- 226K pages
- 4M monthly visits
The website contains a number of filters, like prices, sizes, brands, etc.
The content on those pages is similar, which is why search bots won’t show it in the SERPs. In order not to waste the crawl budget on duplicated pages, the webmasters closed these URLs in robots.txt.
Nevertheless, for some unknown reasons duplicate pages were still in Google’s index.
What did we do?
1. We crawled the website completely and filtered pages that were blocked by robots.txt.
There were no category pages on this list.
2. We analyzed the log files to see how bots process duplicated pages. For some reason, the bot ignored disallow directives. We needed to understand WHY.
To check the robots.txt file with a Validator tool to verify that URLs have been blocked properly.
Result: URL paths to products were allowed.
1. /*? in a Disallow field means “Block bots from accessing any URL that has a ? in it”
2. Allow: /product means “Crawl all URLs which begin with product combination of symbols”. But the problem is the URL paths contain both variants.
Thus, directives contradict each other. Which rule does Googlebot respect? Let’s look at Google’s Robots.txt Specifications:
So, a number of duplicated URL parameters were crawled.
Before submitting your directives, learn robots.txt guidelines for EACH search engine, because specifications can vary.
Always use a robots.txt tester tool that shows you whether your directives block search engine bots from specific URLs on your site. Here is Google’s robot’s txt checker. Note that this tool tests your directives ONLY with Google user-agents.
2. How Googlebot processes AJAX technique
Here is a case where Googlebot wasted 62% of the crawl budget on non-profitable AJAX content.
About the website:
- Real-estate website
- 1M+ pages
- 4.5M monthly visits
On the one hand, websites that use AJAX load faster and thus provide a better UX. On the other hand, these websites can be difficult for the search bot to crawl.
What was done?
We analyzed log files to see where crawl budget was spent. Logs showed that Googlebot sent 38M requests during the month. 23.5M (!) requests were wasted on AJAX content.
Check content that is provided on AJAX and evaluate whether it is worth being shown in the SERPs. Close content that provides no value in robots.txt for Googlebot.
Before closing AJAX content in robots.txt, consider carefully what content is there. You can accidentally block the part of your webpage and the content can become meaningless for bots.
3. How bot processes canonical tags
Rel=canonical tag is the most frequently used way to show Googlebot original/duplicated pages and the correlation between them. Does it mean that you should use this tag on your website? We answer, “It depends.” Here is a true story about how rel=canonical structure consumed crawl budget on e-commerce website.
About the website:
- Jewelry online shop
- 16+M pages
- 342K monthly visits
One part of the website was in the SERPs and took TOP organic positions, while the other wasn’t in the index. We aren’t talking about outdated or user credentials content, but about relevant and profitable pages that weren’t indexed for months.
What was done?
1. JetOctopus started the crawl and analyzed 2M pages.
2. We found almost 900 internal links on each crawled page and calculated that there were around 16M (!) pages in total. We decided to pause the crawl to figure out why the website is so huge.
3. Log files were analyzed to see how bots crawled the website, and where crawl budget was spent. We found out that Googlebot was spending resources on pages with rel=canonical tags. The thing is that webmasters inserted the canonical/self-canonical tag on each item in each category, filter and also on each paginated page.
What recommendation we gave:
To close combinations of category+filter+pagination+sort that all together generated a bunch of pages.
Result: Website size decreased to 730K pages so that search bot could crawl the most profitable pages.
Robots.txt is not a universal solution. Yes, instead of blocking pages in robots.txt, Google recommends closing pages with ‘noindex’ tag.
The problem is that on big ecommerce websites the bot still spends crawl budget while accessing the page code and finding directives.
On the other hand, robots.txt can ask bots not to crawl some categories at all. So that crawl budget will be used on different profitable pages, not on few huge categories.
To sum up
Log files give you unique insights about what is wrong on your website, and where your SEO opportunities are. Whatever technical updates you conduct on your website, logs can show you how search bots perceive changes. Thus you can start fixing problems based on real data, not on hypothesis. Why not try it, especially since log analysis can be much simpler and more affordable than you might think!