Googlebot won’t crawl every URL on your site. The vast majority of sites have a significant amount of pages missing.
Google doesn’t have enough resources to crawl every page it finds. Googlebot discovers URLs that it hasn’t crawled yet, and URLs it plans to crawl. These URLs are prioritized in a crawl queue.
Googlebot will only crawl URLs that have been given a high enough priority. Google processes new URLs constantly, so the crawl queue is always changing. Not all URLs join the queue at the end.
How can you make sure your URLs are VIPs?
SEO is all about crawling
Googlebot must crawl content first in order to make it visible.
The faster page is crawled at the time it is:
- Created The sooner new content can appear in Google, the better. This is especially important when you are trying to create first-to-market or time-limited content strategies.
- Updated The sooner you refresh your content, the more impact it can have on rankings. This is particularly important for both technical SEO strategies and content republishing strategies.
Crawling is therefore essential for all organic traffic. However, crawl optimization is often only beneficial for large websites.
It’s not about how big your website is, how often it is updated, or whether there are “Discovered-currently not indexed exclusions” in Google Search Console.
Every website can benefit from crawl optimization. Its value is often misunderstood, especially when crawl budget is not taken into account.
Crawl budget doesn’t matter
Crawl budget is often used to evaluate crawling. This is the amount of URLs Googlebot will crawl on a website in a given time.
Google claims it is determined by two factors.
- Crawl speed limit: This is the speed at which Googlebot can retrieve website resources without affecting site performance. A responsive server will result in a higher crawl rate.
- Crawl request: This is the number of URLs Googlebot visits in a single crawl. It is based on the demand (re)indexing and is affected by the site’s popularity and staleness.
Googlebot stops crawling sites once it has spent its crawl budget.
Google doesn’t give a crawl budget figure. It is possible to get a rough idea of the number of crawl requests from the Google Search Console crawl stats reports.
Many SEOs, including me in the past have taken great care to infer crawl budget.
These steps are often described as:
- You can determine how many crawlable pages your site has. We recommend looking at the URLs in your XML sitemap, or running an unlimited crawler.
- Calculate the average crawls per hour by exporting the Google Search Console Crawl Stats Report or based upon Googlebot requests in log file.
- Divide the page count by the average number of crawls per day. It is often said that if the result is higher than 10, optimize your crawl budget.
This is however a problematic process.
It assumes that every URL has been crawled once. However, some URLs are crawled multiple time and others not at all.
It assumes that one crawl equals one webpage. In reality, one page may need multiple URL crawls to fetch the resources (JS and CSS) needed to load it.
It is important to remember that crawl budget is only a vanity metric when it is reduced to a calculated metric like average crawls per days.
Any tactic that aims to optimize crawl budget (a.k.a. aiming to increase the total amount of crawling) will be a fool’s errand.
If the crawl is being used on URLs that aren’t of any value or pages that haven’t changed since the last crawl, why should you care about increasing the number of crawls? Such crawls will not improve SEO performance.
Crawl statistics are a constant source of uncertainty for anyone who has ever looked at them. They can fluctuate from one day to the next depending on many factors. These fluctuations could or may not be related to fast (re)indexing SEO-relevant pages.
It is not inherently good or bad for URL crawlers to see a rise or fall in number.
Crawl efficacy can be a KPI in SEO
The focus should not be on whether the page(s), you want to index, was crawled after it was published or significantly modified.
The goal is to reduce the time it takes for a page that is SEO-relevant to be created or updated and for the next Googlebot crawl. This is what I call the time delay in crawl efficacy.
The best way to measure crawl effectiveness is to use the server log files to calculate the difference between the database creation or update datetime, and the next Googlebot crawl of the URL.
If you have difficulty accessing these data points, you can use the XML sitemap date and query URLs in Google Search Console URL Inspection API as a proxy. This will allow you to check the status of the last crawl (up to 2,000 queries per day).
You can also use the URL Inspection API to track changes in indexing status and calculate indexing efficacy. This is the difference between publication or successful indexing.
Crawling without a flow on indexing status or processing a refresh page content is a waste.
Crawl efficacy can be used as a metric to determine if your content is SEO-critical. As it decreases, more content can be presented to your audience across Google.
It can also be used to diagnose SEO problems. To find out how fast content is being crawled from different sections of your site and what is preventing organic performance, drill down to URL patterns.
What can you do if Googlebot takes hours, days, or weeks to crawl your newly created content and index it?
Get the daily newsletter search marketers rely on.
Please refer to the terms.
7 steps to optimize crawling
Crawl optimization refers to directing Googlebot to crawl important URLs fast once they are (re-published). Follow these seven steps.
1. You can ensure a quick and healthy server response
It is crucial to have a highly performant server. Googlebot will stop crawling if it is:
- Performance is affected by crawling your site. For example, crawling your site can slow down the server response time.
- The server responds with a significant number of errors and connection timeouts.
Googlebot can crawl more URLs in a shorter time if it has a faster page load speed. This is in addition to page speed being a ranking and user experience factor.
Support for HTTP/2 is a great option if you don’t have it. It allows you to request more URLs at a similar load to your servers.
However, the correlation between crawl volume and performance is only to a point. Any additional server performance gains that exceed this threshold, which varies from one site to the next, are unlikely to be correlated with an increase in crawling.
How do you check your server health
The Google Search Console crawl statistics report:
- Status: Host. Shows green ticks.
- 5xx errors: Less than 1%.
- Chart of server response time: Trending below 300 milliseconds
2. Get rid of low-value content
Site content that is duplicate, outdated, or low quality can cause competition for crawl activity. This could delay the indexing of new content or reindexing.
This is a must-have SEO tip.
When you have a page that can be used as a replacement, merge content with a redirect. This will double the crawl time, but it is worth the sacrifice for link equity.
If there is no equivalent content, a 301 will only produce a soft404. To signal that the URL is not to be crawled again, remove such content by using a 410 (best), or 404 (closest second) status code.
How do you check for low-value content
The number of URLs reported in the Google Search Console pages reports ‘crawled-currently not indexed’ exclusions. If this is high, you can review the sample URLs provided to see folder patterns and other issue indicators.
3. Review indexing controls
Rel=canonical links are a strong hint to avoid Indexing issues. However, they can be over-relied upon and end up causing crawl problems as each canonicalized URL requires at least two crawls.
Noindex robots directives can also be useful in reducing index bloat. However, too many can negatively impact crawling so only use them when absolutely necessary.
In both cases, ask yourself the following questions:
- These indexing directives are the best way to tackle the SEO challenge.
- Can URL routes be combined, removed, or blocked in robots.txt
If you are currently using it, consider AMP as a long-term technical solution.
The page experience update focuses on core web vitals and allows non-AMP pages to be included in all Google experiences, provided you meet the site speed requirements. Take a hard look at AMP and whether it is worth the double crawl.
How can you check over-reliance upon indexing controls
The number of URLs included in the Google Search Console coverage reports that are categorized under the exclusions, without a clear reason.
- Alternative page with appropriate canonical tag
- Excluded by the noindex tag
- Google duplicates, but chose a different canonical than the user.
- Duplicate URL submitted URL not selected for Canonical
4. Tell search engine spiders where and when to crawl
An XML sitemap is a vital tool to help Googlebot prioritize site URLs and communicate when they are updated.
To ensure effective crawler guidance, you should:
- Include URLs that can be indexed and are valuable for SEO. This is generally 200 status code pages, canonical original content pages, with an “index,follow” robots tag.
- Make sure to include accurate lastmod> timestamp tags in each URL and the sitemap.
Google doesn’t automatically check a sitemap for new sites. It’s best to ping the sitemap to Google whenever it’s changed. Send a GET request to your browser or the command-line to:
Also, include the sitemap paths in the robots.txt files and submit it to Google Search Console via the sitemaps report.
Google will crawl URLs within sitemaps more often that others. Googlebot can still use your sitemap to crawl URLs that are of low quality.
XML sitemaps can be used to add URLs into the regular crawl queue. There is also a priority crawl queue. For this queue, there are two entry options.
First, you can submit URLs via Google’s Indexing API to those who have job postings or live video.
You can also use the IndexNow API to crawl any URL if you want to attract the attention of Yandex or Microsoft Bing. It had no effect on URL crawling in my testing. If you use IndexNow, make sure to monitor crawl effectiveness for Bingbot.
After inspecting the URL in Search Console, you can also manually request indexing. However, crawling can take up to several hours due to the fact that there is a daily limit of 10 URLs. This temporary patch should be considered temporary while you dig deeper to find the root cause of your crawling problem.
How to find essential Googlebot do-crawl guidance
Your XML sitemap in Google Search Console shows the status “Success”, and was read recently.
5. Tell search engine spiders not to crawl
You don’t want pages that aren’t important to users or site functionality to be included in search results. You can prevent crawlers from noticing URLs that are not intended for them by using a robots.txt disable. This could be:
- APIs, and CDNs . If you are a Cloudflare customer, make sure to disable the folder /cdn_cgi/ that is added to your site.
- Unimportant images or scripts, if pages loaded without these resources are not affected significantly by the loss.
- Functional page, such as a shopping basket.
- Infinite Spaces, such like those created by calendar pages.
- Parameter pages. Particularly those that use faceted navigation to filter (e.g.?price range=20-50), reorder, (e.g.?sort=), or search (e.g.?q=) since each combination is counted as a separate page by crawlers.
You should not block the pagination parameter completely. Googlebot often needs crawlable pagination to find content and process internal link equity. For more information on pagination, see this Semrush webinar.
Instead of using UTM tags powered with parameters (a.k.a ‘?’)), use anchors (a.k.a ‘#’).). It provides the same reporting benefits as Google Analytics, but is not crawlable.
How to verify that Googlebot does not crawl guidance
Google Search Console has a sample of ‘Indexed URLs that are not submitted in sitemap’ URLs. What other paths are you able to find, ignoring the first few pages? These pages should be included in an XML sitemap.
Also, check out the “Discovered – currently not indexed” list – blocking in robots.txt URL paths that offer low or no value to Google.
This will allow you to take your search to the next level. You can review all Googlebot smartphone searches in the server log files and find valuable paths.
6. Curate relevant links
For many SEO aspects, backlinks to pages are important. This includes crawling. External links can be difficult to obtain for certain types of pages. Deep pages, such as product pages, categories at lower levels of the site architecture, or articles, can be difficult to find external links.
However, the following internal links are relevant:
- Technically, it is possible.
- Googlebot will use powerful signals to prioritize pages for crawling.
- This is especially important for deep page crawling.
These internal links should add real value to the user.
How to find relevant links
You can run a manual crawl of your entire site using a tool such as ScreamingFrog’s SEO spider. The goal is to find:
- Orphan URLs
- Robots.txt has blocked internal links
- Links to any non-200 code.
- The percentage of non-indexable URLs that are internally linked.
7. Examine any remaining crawling issues
If you are not satisfied with the above optimizations and your crawl efficiency is still suboptimal, it is time to conduct a deep dive inspection.
To identify crawl issues, you should first review the remaining Google Search Console exclusions.
After those issues are resolved, you can use a manual crawling tool like Googlebot to crawl all pages within the site structure. Cross-reference this with the log files down to Googlebot IPs in order to determine which pages are being crawled.
Finally, launch into log file analysis was narrowed down by Googlebot IP for at minimum four weeks of data, ideally greater.
Log analyzer tools can help you understand the format of log files if you don’t know what it is. This is the best way to learn how Google crawls your site.
Once you have completed your audit and have a list with crawl issues identified, rank each issue according to its expected effort and impact on performance.
Note Other SEO experts have stated that clicking from the SERPs increases crawling of the landing pages URL. This is something I have yet to test.
Prioritize crawl efficiency over crawl budget
Crawling is not about getting the most crawling or having every page of a site crawled multiple times. It is to attract a crawl of SEO-relevant material as close to when a page was created or updated.
Overall, budgets don’t matter. It’s how you invest that matters.
Search Engine Land’s first article, Crawl efficacy: How can you improve crawl optimization appeared on Search Engine Land.