THE SCIENCE OF OPTIMIZING CRAWL SPACE The Science of Optimizing Crawl Space What is Crawl Space? x Crawl Space is the totality of possible URLs for a website. Google started talking about ‘infinite crawl space’ in 2008. They warned about their crawlers wasting time and bandwidth on websites that contained large sets of URLs, often with no new content for Google to index. Google wanted us to optimize the ‘crawl space’ of our websites to unique pages, by reducing duplicate pages to crawl, providing fresh content rich pages and consolidate indexing signals to them. That was in 2008. What a clue. Knowing your site architecture is the first step towards successful website optimization. Identifying your crawl space is the first step towards knowing your site architecture. @DeepCrawl www.deepcrawl.com [email protected] 1 The Science of Optimizing Crawl Space Size still matters Google still warns you about potential issues impacting your crawl space within Webmaster tools. Google still outline what they dislike and the consequences of not addressing crawl space: Google want to avoid unnecessarily crawling a large number of URLs that point to identical or similar content Google hate having to crawl unimportant parts of your site. (Importance could be measured by depth of content and engagement) Googlebot may consume much more bandwidth than necessary. Google be unable to completely index all of the content on your site as a result. Are you feeling bloated? The most common cause of a bloated crawl space is multiple URLs serving the same webpage. For example, a single web page can have multiple URL versions of itself. Domain.com Domain.com/ Domain.com/home/ Domain.com/?source=googleppc Domain.com/Home Domain.com/index For successful SEO, there should only be one URL per page. To find out if this is a problem for your website, check Google /Bing’s index to identify how many of your pages are already in the index. Ask yourself, is this a realistic number of pages for your website? Try checking the URL structures of those pages that are indexed. Do you recognize the structure or formats being indexed? There could be pages in the search engine index providing you zero return in traffic or value. They could even be doing you harm, causing duplication and dilution of authority. Time to sort it out. @DeepCrawl www.deepcrawl.com [email protected] 2 The Science of Optimizing Crawl Space Why Bother with Crawl Space Today? Googlebot only spends a finite time crawling your site, so make sure it doesn’t waste time crawling low value pages at the expense of the important ones. This is particularly important for large websites changing frequently i.e. News or gambling. Focus Google’s attention on the important valuable URLs that drive traffic and conversions. Duplicate content causes dilution of page value, weakening authority signals including backlinks and social shares. If you can consolidate all near duplicate URLs to make sure a single canonical page is optimized to perform, you will be optimizing your valuable pages. Maximizing your crawl efficiency can often result in reducing your crawl space, which in turn can reduce the volume of your indexed pages. Often, you’ll need to be working on traffic growth strategies too, requiring website (and therefore URL volume) expansions. An expansion strategy requires you to increase the overall volume of your unique indexable pages. Scaling-up websites can be risky. Issues can rapidly grow in proportion to size of your site: 10x dupe URLs per page already 10x growth in content results 100x URLs Understanding your crawl space provides the framework to work more efficiently and manage scale with minimal risk. @DeepCrawl www.deepcrawl.com [email protected] 3 The Science of Optimizing Crawl Space What’s in Crawl Space? To build up a picture of your full site architecture and to discover your complete crawl space, you’ll need to consider what contributes to your URL Universe - the place where all theoretically possible URLs exist. Here’s our DeepCrawl URL Universe checklist: Invoked URLs All of the URLs ever brought into existence, for any reason Socially Shared URLs All URLs being shared across social media platforms, Facebook posts, tweets, etc. Uninvoked URLs URLs that haven’t been invoked (but could be?) Organic Landing Page URLs All indexed URLs which have driven traffic from organic search results Internally Linked URLs All Invoked URLs which linked from other internal pages Externally Linked URLs All URLs linked from another site. URLs In Sitemaps All of the Invoked URLs which are currently included in a Sitemap. Mobile URLs All URLs from a mobile site or mobile property that are internally or externally linked to your main site Crawlable URLs All the URLs in the Universe which could be crawled by the search engine Translated/Regional URLs URLs used on different language and regional copies of a website Crawled URLs All the invoked URLs which have been crawled by the search engine Shortened URLs External but internal Indexable URLs All the URLs in the Universe which could be indexed by the search engine Domain Duplicates Aliased domains www/non-www, http/https Canonical URLs All the clean canonical URLs (These pages should be your crawlable unique pages) Indexed URLs All crawled pages which are now in the search engine’s index @DeepCrawl www.deepcrawl.com [email protected] 4 The Science of Optimizing Crawl Space The DeepCrawl Formula Discovery, management and optimization of your crawl space is essential to optimizing website architecture and laying the foundation for strong performance. Maximized + Minimized = Optimized Minimize your crawl space, maximize your indexable space and promote your optimized, clean site version. Carefully define and efficiently manage your crawl spaces for optimized website performance and successful growth initiatives. Maximize indexable space Increase the volume of your valuable pages Increase crawl efficiency Minimize crawlable space Define your crawl space Identify and eliminate threats Optimize canonical space A clean version of your website (URLs) Your URL à la carte Carefully define and efficiently manage your crawl spaces for optimized website performance and successful growth initiatives. Discover your Crawl Space DeepCrawl supports a powerful new crawl type - Universal Crawl. You can crawl your website, XML Sitemaps and organic landing pages in one hit, giving you a significant head start on defining, managing and optimizing your crawl spaces. For example, you’ll be able to discover: Sitemaps URLs that aren’t linked internally You can then link them internally and Minimize your crawl space Linked URLs that aren’t in your Sitemaps You can then add these to your sitemaps Maximizing your indexable space Linked URLs or URLs in Sitemaps that don’t generate traffic You could then disallow or delete these, minimizing your crawl space URLs that generate entry visits but aren’t linked You could then redirect or re-integrate these into your website, increasing your indexable space @DeepCrawl www.deepcrawl.com [email protected] 5 The Science of Optimizing Crawl Space Manage your Crawl Space Why manage it? A lack of Crawl Space management can leave you wide open to threats, and lead to a whole load of GWT warnings! Poor management has negative implications: Orphaned pages Incomplete sitemaps Dilution of backlinks, shares and therefore page authority Performance is harder and more complicated to track Limited volume of unique URLs in analytics Crawl inefficiency for SEO Traffic growth strategies just won’t work For example - your social share URLs don’t match your canonical URLs Why worry - are social signals and organic rankings even related? Twitter & Facebook - No Google+ - Possibly Even if social signals are not used in a ranking algorithm, social media and SEO are close in terms of driving discovery at scale. Why not consolidate your quality signals to search engines as part of your crawl space URL management? It may not be required, but it is recommended. @DeepCrawl www.deepcrawl.com [email protected] 6 The Science of Optimizing Crawl Space Become the Boss (of URLs) Take ownership of your website URLs within your organization. This will help you become a great crawl space manager. 1. Get the tools you need like DeepCrawl and Webmaster Tools (Don’t forget Bing) 2. Fully audit and benchmark your current URL landscape using Universal Crawl to discover your crawl space and audit. 3. Create an organizational level URL roadmap, to communicate to all departments such as development, marketing, PR and Social ensuring URL consistency and management. 4. Maintain awareness – communicate changes, hold workshops and congratulate successful adherence e.g. social media campaigns. The Payoff Effective crawl space management can avoid dilution of your page authority and facilitate traffic growth. You’ll be able to recover backlinks and traffic, improve crawl efficiency and get your valuable pages crawled first. You’ll increase the volume of your indexable pages and consolidate the value of social shares and backlinks. You’ll run traffic/website growth initiatives that will actually work! @DeepCrawl www.deepcrawl.com [email protected] 7 The Science of Optimizing Crawl Space Optimize your Crawl Space Following the DeepCrawl formula, you’ll want to start minimizing your crawl space and maximizing your indexable pages, to optimize your crawl space and promote your clean canonicalized site version. Here’s how to get started... Regular DeepCrawl Crawl your website frequently to understand how your URLs are changing over time If you run a dynamic, frequently updated website, you’re likely to be faced with multiple updates from multiple sources, being applied to multiple areas of your website, database, social media and advertising, at multiple points within a week/month/quarter. Running a single crawl won’t help you understand how or where your site changes or to be the boss of your URLs. You also need to include the extra dimension of time into your crawl space definition and management. DeepCrawl automatically shows you what’s changed between crawls. Trend reports will show you the extra time dimension. You might spot some URLs formats that are changing frequently and affecting the crawl efficiency of your website. Run repeat and scheduled crawls to understand how your website changes over time and find all the URLs you need to manage within your crawl space. @DeepCrawl www.deepcrawl.com [email protected] 8 The Science of Optimizing Crawl Space Maximize your Indexable Space x Audit your indexable and non indexable URL parameters Use DeepCrawl Universal Crawl to understand your current URL configuration and canonical use 1. Use Webmaster Tools URL parameters report to identify the URL parameters in existence on your domain(s). Download a list with volumes of URLs monitored by Google and an indication of when and how they have been configured in Webmaster Tools, if applicable. 2. Use site: + inurl: to check indexation of parameters. 3. Run a Universal Crawl on your website (including sitemaps and analytics). 4. Use DeepCrawl indexation reports (All crawled URLs, unique pages, noindex pages, disallowed URLs etc) to identify URL parameters that you don’t want indexed. 5. Check your robots.txt for the current disallowed URL parameters. Use DeepCrawl disallowed URLs report to show you parameters currently disallowed in Robots.txt 6. Use the Parameter Removal setting in DeepCrawl to run a new crawl and test the impact of stripping parameters from your URLs. This is a useful test prior to changing anything in Webmaster Tools. 7. Apply URL parameter settings in Webmaster Tools to get Google to ignore parameters that are causing you page bloat in your crawl space. Improve crawl efficiency and your Google Analytics data, by reducing the number of unique URLs. x Check your list of disallowed URLs - as expected? Identify opportunity for additional URLs to disallow and check that what is disallowed, should be, incase there’s opportunity to increase your unique indexable pages. Sitemaps should not contain disallowed URLs. Your robots.txt should not contain disallows that mean URL parameters you need indexed are getting blocked. 1. Download and check your DeepCrawl Disallowed URLs report from a Universal or Website crawl. 2. Run a DeepCrawl Backlinks Crawl, so you can see linked URLs (to your site) that are disallowed. 3. Compare all URLs against your Disallowed URLs report, including Sitemap URLs and URLs with backlinks. 4. Use your findings to design and test changes to your robots.txt file. Test it using the Robots Overwrite function in DeepCrawl. 5. Write a new optimized version of your Robots.txt file that increases the number of disallowed URLs or removes disallows causing unique URLs to be missed. @DeepCrawl www.deepcrawl.com [email protected] 9 The Science of Optimizing Crawl Space x Compare your sitemaps to your linked URLs - anything missing? Your sitemaps should promote (only) the valuable, unique pages that you want indexed. They should not include disallowed pages, noindex/nofollow pages, or pages that provide little or low value (thin content). 1. Use DeepCrawl Universal Crawl Only in Sitemaps report to identify pages in sitemaps that are not linked internally and pages linked internally that are not in sitemaps. 2. Use DeepCrawl Universal Crawl Missing in Sitemaps report, to find linked pages which are not in the sitemaps but could be. 3. You can also manually diff (in excel) the list of URLs from a Universal DeepCrawl with the URLs included in your sitemap(s). x Compare your landing page URLs to your linked URLs - anything missing? Organic landing pages driving you traffic are often valuable but can become orphaned through time. You can use DeepCrawl Universal Crawl to help you identify orphaned pages that are unlinked and link them back in for additional value and consolidation of your backlinks/authority. Use DeepCrawl Universal Crawl Only in Organic Landing Pages report to identify: Pages that are not linked internally but are included in your organic landing pages (orphaned) Identify ways to link orphaned pages back into your site and apply to consolidate backlinks, authority and value Pages that are linked internally but that are not in organic landing pages These pages could be missing analytics tracking altogether or indicate indexation/crawl efficiency areas for investigation and optimization Use the Universal Crawl settings to adjust the value you assign to your organic landing pages by choosing the time frame (7 -30 days back) and minimum visit volume settings (default is 2 visits but can be set at any value) for analytics inclusion. Useful when testing different page categories. You can also manually diff the list of URLs from a Universal or Website Crawl with the list of organic landing pages from analytics. @DeepCrawl www.deepcrawl.com [email protected] 10 The Science of Optimizing Crawl Space Minimize your crawl space x Check for low value content (bloat) A low ratio of body content to HTML can prevent a page from being indexed, in some instances it can cause a ‘soft 404’ classification in Webmaster Tools. It can also indicate pages with a high level of scripts i.e. JavaScript, that could be affecting your page load times. Slow page load speed can also negatively impact page indexation. Use DeepCrawl Low Content/HTML Ratio report to: Highlight an extensive use of javascript within HTML that might be affecting your page load times. Identify pages that require more unique content to be of value or a reduction of HTML/javascript that is unnecessary to their function. Address pages that need content additions to be of value, or optimization of their HTML. If neither is relevant or possible, remove these low value pages from your crawl space. x Identify low value linked URLs and disallow them Pages that don’t drive any entrance visits could be causing crawl space bloat and might not need to be indexed or crawled. 1. Use DeepCrawl Universal Crawl Missing in Organic Landing Pages to identify internally linked pages that have not driven any traffic. 2. Use the Universal Crawl settings to adjust the value you assign to your organic landing pages by choosing the time frame (7 -30 days back) and minimum visit volume settings (default is 2 visits but can be set at any value) for analytics inclusion. 3. Check these pages function, if not required nor driving visits, disallow them in robots.txt. @DeepCrawl www.deepcrawl.com [email protected] 11 The Science of Optimizing Crawl Space x Identify domain aliases and redirect them / use cross domain canonicals Domain duplication can cause high volume and easily detected duplicate content issues and massive crawl inefficiencies. 1. Get a list of all the registered domains relevant to you and check to see if they return a duplicate of your site or if they redirect and whether that redirect is returning the correct status. 2. Use e.g. site:www.yourdomain.com inurl:https or inurl:www or inurl:othersubdomains to check http/https, www/non-www and any testing/staging and sub domain versions of your site URLs in the search engine index 3. You can use automation tools such as Robotto to detect indexed subdomains in Google. 4. Search for unique text from your site in Google in quotes. 5. Use backlink data to identify redirecting URLs - Majestic has a useful report showing redirected domains. x Identify all linked URLs and get their response codes - as expected? Check that your website internal linking is working towards an optimized crawl space. Use DeepCrawl internal broken links, redirected links, 4xx and 5xx error reports to identify internal links on your site that are broken or are redirected URLs that may affecting your crawl efficiency. x Remove Duplicates Page duplication can dilute page and backlinks authority and cause valuable pages to be devalued or not indexed. Page duplication can cause URL duplicates (trailing slash, case issues) and content duplicate (similar product variants) reducing your crawl efficiency. 1. Use the Duplicate Pages report in DeepCrawl Universal Crawl to identify causes of page duplication. 2. Check for URL duplicates (trailing slash, URL encoding, case issues) and content duplicates (similar product variants i.e. sizes). 3. Check for product variations such as sizing or URL parameters from sorting and refining searches that can often cause soft duplication (and soft 404’s in Webmaster Tools) 4. Use the DeepCrawl canonical report to check if you are canonicalizing product variations and similar URL distinctions correctly or as expected. 5. Sometimes your canonicalized size variations of a product (and other soft duplications) are not unique enough, in which case you might canonicalize the parameter to remove the duplication. @DeepCrawl www.deepcrawl.com [email protected] 12 The Science of Optimizing Crawl Space x Minimize Redirects Redirection in volume is not necessarily a direct issue to site indexation, but it does cause crawl inefficiencies and a potentially bloated crawl space. 1. Use the DeepCrawl 301 Redirects and Non-301 Redirects reports to identify the amount and type of redirected URLs on your site. 2. Look for internal links to redirected to URLs and check whether they need to be redirected or whether should be a direct link. For example, websites using segmented databases and website search functions can find dynamic parts of their websites unsynced. This can cause auto generation of internal links that are under redirection, such as 302s, that aren’t required for passing indexed page authority and where links could be optimized as direct internal links instead of being redirected links. 3. Check for redirecting URLs in Sitemaps. Sitemaps should not contain redirected URLs and search engines may not follow them, which could prevent other pages from getting indexed. 4. Replace any redirected Sitemaps URLs with the redirect target URL or remove them from Sitemaps. 5. Check the DeepCrawl Max Redirect Chains report to identify any long redirection chains of URLs that could be reducing crawling efficiency by causing redirection loops. @DeepCrawl www.deepcrawl.com [email protected] 13 The Science of Optimizing Crawl Space Optimize your Canonical Space x Test your canonical URL configuration Creation of canonical URLs requires rules, development and maintenance of consistency. There can be many problems in canonical URL implementation. e.g. duplication, character encoding differences for the canonical URL or case differences, that can render your canonicalized site version ineffective. Check your current canonical implementation. 1. Use the DeepCrawl Canonicalized Pages report to review your canonical tags have been implemented correctly with your preferred URL. Check all your canonical URLs are using absolute URLs, including domain and http/https. If they aren’t, they should be. 2. Identify canonical URLs which aren’t linked internally as they should be. Use the Unlinked Canonical Pages report in DeepCrawl. Check that every canonical URL to is linked (referencing) an internal URL that exists (has already been invoked). There could be legitimate instances why you aren’t canonicalizing to an invoked URL, but it is likely to flag potential canonical URL issue and worth investigating. 3. Use the DeepCrawl Pages without Canonical Tag report to show you pages that are missing canonical tags altogether. It is now best practice to ensure every site page has a canonical tag, even if it is canonicalizing to itself and particularly useful as a safeguard if you’ve had domain duplication issues. x Consistent Canonical and Social URLs Driving discovery of your site at scale, is vital to traffic growth strategies and Social Media and SEO are your tactical tools. Consolidating your quality signals to search engines, as part of your crawl space URL management may not be required, but it is recommended. 1. Check the consistency of your canonical URL and social media shared URLs (e.g. OG URLs / Twitter Card URLs) - do they match? 2. Use DeepCrawl Inconsistent Open Graph and Canonical URLs reports to identify any inconsistencies. 3. Ensure your Social Media OG tags contain the same page URL as your page’s canonical tag. 4. Ensure your socially shared URLs e.g. Twitter Cards, are promoting your canonicalized site version (your page’s canonical URL) 5. Its a good idea to ensure your canonical tags are correct and being generated against a set of defined canonical URL rules for consistency, before addressing your social media tag consistencies, so you are confident you are prioritising the correct canonicals. @DeepCrawl www.deepcrawl.com [email protected] 14 The Science of Optimizing Crawl Space x Implement Pagination Implementing page pagination consistency may not help directly with crawl efficiency, but it will help to consolidate authority signals across your website URLs. 1. Use DeepCrawl Paginated Pages report to identify the volume of result sets pages and individual results pages on your website. 2. Check your paginated page setup. If you are using rel/prev/next, use DeepCrawl Canonicalized Pages report to check your canonical tags are correct, with each page canonicalizing to itself. 3. If you are not using rel/prev/next, apply it if relevant. Use DeepCrawl Canonicalized Pages report to check your canonical tags are not pointing to page one and instead ensure they are canonicalizing to themselves. Search engines will then consolidate your page authority to the set of pages’. x Implement hreflang If you have international duplicate sites, you must implement hreflang tags. This is also essential for same language duplicates, e.g. US, UK, Canada, Australia. DeepCrawl now detects hreflang tags in Sitemaps, headers and on-page, showing you a matrix of language alternatives for every page. 1. Use DeepCrawl Pages without hreflang Tags report to identify pages that do not have hreflang tags. 2. Use DeepCrawl hreflang detection within your All URLs report, to identify any gaps or inconsistencies in your current implementation. @DeepCrawl www.deepcrawl.com [email protected] 15 The Science of Optimizing Crawl Space We hope this guide has provided a good insight into crawl space & how DeepCrawl helps you to monitor and optimize your crawl space effectively. FREE REPORT https://www.deepcrawl.com/report/ GET STARTED https://www.deepcrawl.com/pricing/ LOGIN https://tools.deepcrawl.co.uk/login
© Copyright 2026 Paperzz