THE SCIENCE OF OPTIMIZING CRAWL SPACE

THE SCIENCE
OF OPTIMIZING
CRAWL SPACE
The Science of
Optimizing Crawl Space
What is Crawl Space?
x Crawl Space is the totality of possible URLs for a website.
Google started talking about ‘infinite crawl space’ in 2008. They warned about their crawlers wasting
time and bandwidth on websites that contained large sets of URLs, often with no new content for
Google to index.
Google wanted us to optimize the ‘crawl space’ of our websites to unique pages, by reducing duplicate
pages to crawl, providing fresh content rich pages and consolidate indexing signals to them.
That was in 2008. What a clue.
Knowing your site architecture is the first step towards successful website optimization.
Identifying your crawl space is the first step towards knowing your site architecture.
@DeepCrawl
www.deepcrawl.com
[email protected]
1
The Science of
Optimizing Crawl Space
Size still matters
Google still warns you about potential
issues impacting your crawl space
within Webmaster tools.
Google still outline what they dislike and the
consequences of not addressing crawl space:
Google want to avoid unnecessarily crawling
a large number of URLs that point to
identical or similar content
Google hate having to crawl unimportant parts
of your site. (Importance could be measured by
depth of content and engagement)
Googlebot may consume much more bandwidth
than necessary.
Google be unable to completely index all of
the content on your site as a result.
Are you feeling bloated?
The most common cause of a bloated crawl space is multiple URLs serving the same webpage.
For example, a single web page can have multiple URL versions of itself.
Domain.com
Domain.com/
Domain.com/home/
Domain.com/?source=googleppc
Domain.com/Home
Domain.com/index
For successful SEO, there should only be one URL per page.
To find out if this is a problem for your website, check Google /Bing’s index to identify how many of
your pages are already in the index. Ask yourself, is this a realistic number of pages for your website?
Try checking the URL structures of those pages that are indexed. Do you recognize the structure or
formats being indexed?
There could be pages in the search engine index providing you zero return in traffic or value. They
could even be doing you harm, causing duplication and dilution of authority.
Time to sort it out.
@DeepCrawl
www.deepcrawl.com
[email protected]
2
The Science of
Optimizing Crawl Space
Why Bother with
Crawl Space Today?
Googlebot only spends a finite time crawling your site, so make sure it doesn’t
waste time crawling low value pages at the expense of the important ones.
This is particularly important for large websites changing frequently i.e. News
or gambling.
Focus Google’s attention on the important valuable URLs that drive traffic and conversions.
Duplicate content causes dilution of page value, weakening authority signals including backlinks and
social shares. If you can consolidate all near duplicate URLs to make sure a single canonical page is
optimized to perform, you will be optimizing your valuable pages.
Maximizing your crawl efficiency can often result in reducing your crawl space, which in turn can
reduce the volume of your indexed pages. Often, you’ll need to be working on traffic growth strategies
too, requiring website (and therefore URL volume) expansions.
An expansion strategy requires you to increase the overall volume of your unique indexable pages.
Scaling-up websites can be risky. Issues can rapidly grow in proportion to size of your site:
10x dupe URLs per page already
10x growth in content results
100x URLs
Understanding your crawl space provides the framework to work more
efficiently and manage scale with minimal risk.
@DeepCrawl
www.deepcrawl.com
[email protected]
3
The Science of
Optimizing Crawl Space
What’s in Crawl Space?
To build up a picture of your full site architecture and to discover your
complete crawl space, you’ll need to consider what contributes to your URL
Universe - the place where all theoretically possible URLs exist.
Here’s our DeepCrawl URL Universe checklist:
Invoked URLs
All of the URLs ever brought into existence,
for any reason
Socially Shared URLs
All URLs being shared across social media
platforms, Facebook posts, tweets, etc.
Uninvoked URLs
URLs that haven’t been invoked
(but could be?)
Organic Landing Page URLs
All indexed URLs which have driven traffic
from organic search results
Internally Linked URLs
All Invoked URLs which linked from other
internal pages
Externally Linked URLs
All URLs linked from another site.
URLs In Sitemaps
All of the Invoked URLs which are currently
included in a Sitemap.
Mobile URLs
All URLs from a mobile site or mobile property
that are internally or externally linked to your
main site
Crawlable URLs
All the URLs in the Universe which could be
crawled by the search engine
Translated/Regional URLs
URLs used on different language and
regional copies of a website
Crawled URLs
All the invoked URLs which have been
crawled by the search engine
Shortened URLs
External but internal
Indexable URLs
All the URLs in the Universe which could be
indexed by the search engine
Domain Duplicates
Aliased domains www/non-www, http/https
Canonical URLs
All the clean canonical URLs (These pages
should be your crawlable unique pages)
Indexed URLs
All crawled pages which are now in the
search engine’s index
@DeepCrawl
www.deepcrawl.com
[email protected]
4
The Science of
Optimizing Crawl Space
The DeepCrawl Formula
Discovery, management and optimization of your crawl space is essential to
optimizing website architecture and laying the foundation for strong performance.
Maximized + Minimized = Optimized
Minimize your crawl space, maximize your indexable space and promote your optimized, clean site
version. Carefully define and efficiently manage your crawl spaces for optimized website performance
and successful growth initiatives.
Maximize indexable space
Increase the volume of your valuable pages
Increase crawl efficiency
Minimize crawlable space
Define your crawl space
Identify and eliminate threats
Optimize canonical space
A clean version of your website (URLs)
Your URL à la carte
Carefully define and efficiently manage your crawl spaces for optimized website performance and
successful growth initiatives.
Discover your Crawl Space
DeepCrawl supports a powerful new crawl type - Universal Crawl.
You can crawl your website, XML Sitemaps and organic landing pages in one hit, giving you a significant
head start on defining, managing and optimizing your crawl spaces.
For example, you’ll be able to discover:
Sitemaps URLs that aren’t linked internally
You can then link them internally and Minimize your crawl space
Linked URLs that aren’t in your Sitemaps
You can then add these to your sitemaps Maximizing your indexable space
Linked URLs or URLs in Sitemaps that don’t generate traffic
You could then disallow or delete these, minimizing your crawl space
URLs that generate entry visits but aren’t linked
You could then redirect or re-integrate these into your website, increasing your indexable space
@DeepCrawl
www.deepcrawl.com
[email protected]
5
The Science of
Optimizing Crawl Space
Manage your Crawl Space
Why manage it?
A lack of Crawl Space management can leave you wide open to threats, and lead to a whole load of
GWT warnings!
Poor management has negative implications:
Orphaned pages
Incomplete sitemaps
Dilution of backlinks, shares and therefore page authority
Performance is harder and more complicated to track
Limited volume of unique URLs in analytics
Crawl inefficiency for SEO
Traffic growth strategies just won’t work
For example - your social share URLs don’t match your canonical URLs
Why worry - are social signals and organic rankings even related?
Twitter & Facebook - No
Google+ - Possibly
Even if social signals are not used in a ranking algorithm, social media and SEO are close in terms of
driving discovery at scale. Why not consolidate your quality signals to search engines as part of your
crawl space URL management?
It may not be required, but it is recommended.
@DeepCrawl
www.deepcrawl.com
[email protected]
6
The Science of
Optimizing Crawl Space
Become the Boss (of URLs)
Take ownership of your website URLs within your organization. This will help you become a great crawl
space manager.
1. Get the tools you need like DeepCrawl and Webmaster Tools (Don’t forget Bing)
2. Fully audit and benchmark your current URL landscape using Universal Crawl to discover your
crawl space and audit.
3. Create an organizational level URL roadmap, to communicate to all departments such as
development, marketing, PR and Social ensuring URL consistency and management.
4. Maintain awareness – communicate changes, hold workshops and congratulate successful
adherence e.g. social media campaigns.
The Payoff
Effective crawl space management can avoid dilution of your page authority and facilitate traffic
growth.
You’ll be able to recover backlinks and traffic, improve crawl efficiency and get your valuable
pages crawled first.
You’ll increase the volume of your indexable pages and consolidate the value of social shares
and backlinks.
You’ll run traffic/website growth initiatives that will actually work!
@DeepCrawl
www.deepcrawl.com
[email protected]
7
The Science of
Optimizing Crawl Space
Optimize your Crawl Space
Following the DeepCrawl formula, you’ll want to start minimizing your crawl
space and maximizing your indexable pages, to optimize your crawl space and
promote your clean canonicalized site version.
Here’s how to get started...
Regular DeepCrawl
Crawl your website frequently to understand how your URLs are changing
over time
If you run a dynamic, frequently updated website, you’re likely to be faced with multiple updates from
multiple sources, being applied to multiple areas of your website, database, social media and
advertising, at multiple points within a week/month/quarter.
Running a single crawl won’t help you understand how or where your site changes or to be the boss of
your URLs. You also need to include the extra dimension of time into your crawl space definition and
management.
DeepCrawl automatically shows you what’s changed between crawls. Trend reports will show you the
extra time dimension.
You might spot some URLs formats that are changing frequently and affecting the crawl efficiency of
your website.
Run repeat and scheduled crawls to understand how your website changes over time and find all the
URLs you need to manage within your crawl space.
@DeepCrawl
www.deepcrawl.com
[email protected]
8
The Science of
Optimizing Crawl Space
Maximize your Indexable Space
x Audit your indexable and non indexable URL parameters
Use DeepCrawl Universal Crawl to understand your current URL configuration and canonical use
1. Use Webmaster Tools URL parameters report to identify the URL parameters in existence on
your domain(s). Download a list with volumes of URLs monitored by Google and an indication of
when and how they have been configured in Webmaster Tools, if applicable.
2. Use site: + inurl: to check indexation of parameters.
3. Run a Universal Crawl on your website (including sitemaps and analytics).
4. Use DeepCrawl indexation reports (All crawled URLs, unique pages, noindex pages, disallowed
URLs etc) to identify URL parameters that you don’t want indexed.
5. Check your robots.txt for the current disallowed URL parameters. Use DeepCrawl disallowed
URLs report to show you parameters currently disallowed in Robots.txt
6. Use the Parameter Removal setting in DeepCrawl to run a new crawl and test the impact of
stripping parameters from your URLs. This is a useful test prior to changing anything in
Webmaster Tools.
7. Apply URL parameter settings in Webmaster Tools to get Google to ignore parameters that are
causing you page bloat in your crawl space. Improve crawl efficiency and your Google Analytics
data, by reducing the number of unique URLs.
x Check your list of disallowed URLs - as expected?
Identify opportunity for additional URLs to disallow and check that what is disallowed, should be, incase
there’s opportunity to increase your unique indexable pages.
Sitemaps should not contain disallowed URLs. Your robots.txt should not contain disallows that mean
URL parameters you need indexed are getting blocked.
1. Download and check your DeepCrawl Disallowed URLs report from a Universal or Website crawl.
2. Run a DeepCrawl Backlinks Crawl, so you can see linked URLs (to your site) that are disallowed.
3. Compare all URLs against your Disallowed URLs report, including Sitemap URLs and URLs with
backlinks.
4. Use your findings to design and test changes to your robots.txt file. Test it using the Robots
Overwrite function in DeepCrawl.
5. Write a new optimized version of your Robots.txt file that increases the number of disallowed
URLs or removes disallows causing unique URLs to be missed.
@DeepCrawl
www.deepcrawl.com
[email protected]
9
The Science of
Optimizing Crawl Space
x Compare your sitemaps to your linked URLs - anything missing?
Your sitemaps should promote (only) the valuable, unique pages that you want indexed. They should
not include disallowed pages, noindex/nofollow pages, or pages that provide little or low value (thin
content).
1. Use DeepCrawl Universal Crawl Only in Sitemaps report to identify pages in sitemaps that are
not linked internally and pages linked internally that are not in sitemaps.
2. Use DeepCrawl Universal Crawl Missing in Sitemaps report, to find linked pages which are not in
the sitemaps but could be.
3. You can also manually diff (in excel) the list of URLs from a Universal DeepCrawl with the URLs
included in your sitemap(s).
x Compare your landing page URLs to your linked URLs - anything missing?
Organic landing pages driving you traffic are often valuable but can become orphaned through time.
You can use DeepCrawl Universal Crawl to help you identify orphaned pages that are unlinked and link
them back in for additional value and consolidation of your backlinks/authority.
Use DeepCrawl Universal Crawl Only in Organic Landing Pages report to identify:
Pages that are not linked internally but are included in your organic landing pages (orphaned)
Identify ways to link orphaned pages back into your site and apply to consolidate backlinks,
authority and value
Pages that are linked internally but that are not in organic landing pages
These pages could be missing analytics tracking altogether or indicate indexation/crawl
efficiency areas for investigation and optimization
Use the Universal Crawl settings to adjust the value you assign to your organic landing pages by
choosing the time frame (7 -30 days back) and minimum visit volume settings (default is 2 visits
but can be set at any value) for analytics inclusion.
Useful when testing different page categories.
You can also manually diff the list of URLs from a Universal or Website Crawl with the list of organic
landing pages from analytics.
@DeepCrawl
www.deepcrawl.com
[email protected]
10
The Science of
Optimizing Crawl Space
Minimize your crawl space
x Check for low value content (bloat)
A low ratio of body content to HTML can prevent a page from being indexed, in some instances it can
cause a ‘soft 404’ classification in Webmaster Tools.
It can also indicate pages with a high level of scripts i.e. JavaScript, that could be affecting your page
load times. Slow page load speed can also negatively impact page indexation.
Use DeepCrawl Low Content/HTML Ratio report to:
Highlight an extensive use of javascript within HTML that might be affecting your page
load times.
Identify pages that require more unique content to be of value or a reduction of
HTML/javascript that is unnecessary to their function.
Address pages that need content additions to be of value, or optimization of their HTML.
If neither is relevant or possible, remove these low value pages from your crawl space.
x Identify low value linked URLs and disallow them
Pages that don’t drive any entrance visits could be causing crawl space bloat and might not need to be
indexed or crawled.
1. Use DeepCrawl Universal Crawl Missing in Organic Landing Pages to identify internally linked
pages that have not driven any traffic.
2. Use the Universal Crawl settings to adjust the value you assign to your organic landing pages by
choosing the time frame (7 -30 days back) and minimum visit volume settings (default is 2 visits
but can be set at any value) for analytics inclusion.
3. Check these pages function, if not required nor driving visits, disallow them in robots.txt.
@DeepCrawl
www.deepcrawl.com
[email protected]
11
The Science of
Optimizing Crawl Space
x Identify domain aliases and redirect them / use cross domain canonicals
Domain duplication can cause high volume and easily detected duplicate content issues and massive
crawl inefficiencies.
1. Get a list of all the registered domains relevant to you and check to see if they return a duplicate
of your site or if they redirect and whether that redirect is returning the correct status.
2. Use e.g. site:www.yourdomain.com inurl:https or inurl:www or inurl:othersubdomains to check
http/https, www/non-www and any testing/staging and sub domain versions of your site URLs in
the search engine index
3. You can use automation tools such as Robotto to detect indexed subdomains in Google.
4. Search for unique text from your site in Google in quotes.
5. Use backlink data to identify redirecting URLs - Majestic has a useful report showing
redirected domains.
x Identify all linked URLs and get their response codes - as expected?
Check that your website internal linking is working towards an optimized crawl space.
Use DeepCrawl internal broken links, redirected links, 4xx and 5xx error reports to identify internal
links on your site that are broken or are redirected URLs that may affecting your crawl efficiency.
x Remove Duplicates
Page duplication can dilute page and backlinks authority and cause valuable pages to be devalued or
not indexed.
Page duplication can cause URL duplicates (trailing slash, case issues) and content duplicate (similar
product variants) reducing your crawl efficiency.
1. Use the Duplicate Pages report in DeepCrawl Universal Crawl to identify causes of page duplication.
2. Check for URL duplicates (trailing slash, URL encoding, case issues) and content duplicates
(similar product variants i.e. sizes).
3. Check for product variations such as sizing or URL parameters from sorting and refining
searches that can often cause soft duplication (and soft 404’s in Webmaster Tools)
4. Use the DeepCrawl canonical report to check if you are canonicalizing product variations and
similar URL distinctions correctly or as expected.
5. Sometimes your canonicalized size variations of a product (and other soft duplications) are not
unique enough, in which case you might canonicalize the parameter to remove the duplication.
@DeepCrawl
www.deepcrawl.com
[email protected]
12
The Science of
Optimizing Crawl Space
x Minimize Redirects
Redirection in volume is not necessarily a direct issue to site indexation, but it does cause crawl
inefficiencies and a potentially bloated crawl space.
1. Use the DeepCrawl 301 Redirects and Non-301 Redirects reports to identify the amount and type
of redirected URLs on your site.
2. Look for internal links to redirected to URLs and check whether they need to be redirected or
whether should be a direct link.
For example, websites using segmented databases and website search functions can find dynamic
parts of their websites unsynced. This can cause auto generation of internal links that are under
redirection, such as 302s, that aren’t required for passing indexed page authority and where links
could be optimized as direct internal links instead of being redirected links.
3. Check for redirecting URLs in Sitemaps. Sitemaps should not contain redirected URLs and
search engines may not follow them, which could prevent other pages from getting indexed.
4. Replace any redirected Sitemaps URLs with the redirect target URL or remove them
from Sitemaps.
5. Check the DeepCrawl Max Redirect Chains report to identify any long redirection chains of URLs
that could be reducing crawling efficiency by causing redirection loops.
@DeepCrawl
www.deepcrawl.com
[email protected]
13
The Science of
Optimizing Crawl Space
Optimize your Canonical Space
x Test your canonical URL configuration
Creation of canonical URLs requires rules, development and maintenance of consistency.
There can be many problems in canonical URL implementation. e.g. duplication, character encoding
differences for the canonical URL or case differences, that can render your canonicalized site version
ineffective.
Check your current canonical implementation.
1. Use the DeepCrawl Canonicalized Pages report to review your canonical tags have been
implemented correctly with your preferred URL.
Check all your canonical URLs are using absolute URLs, including domain and http/https. If they
aren’t, they should be.
2. Identify canonical URLs which aren’t linked internally as they should be. Use the Unlinked
Canonical Pages report in DeepCrawl.
Check that every canonical URL to is linked (referencing) an internal URL that exists (has already
been invoked).
There could be legitimate instances why you aren’t canonicalizing to an invoked URL, but it is likely
to flag potential canonical URL issue and worth investigating.
3. Use the DeepCrawl Pages without Canonical Tag report to show you pages that are missing
canonical tags altogether.
It is now best practice to ensure every site page has a canonical tag, even if it is canonicalizing to
itself and particularly useful as a safeguard if you’ve had domain duplication issues.
x Consistent Canonical and Social URLs
Driving discovery of your site at scale, is vital to traffic growth strategies and Social Media and SEO are
your tactical tools. Consolidating your quality signals to search engines, as part of your crawl space
URL management may not be required, but it is recommended.
1. Check the consistency of your canonical URL and social media shared URLs
(e.g. OG URLs / Twitter Card URLs) - do they match?
2. Use DeepCrawl Inconsistent Open Graph and Canonical URLs reports to identify
any inconsistencies.
3. Ensure your Social Media OG tags contain the same page URL as your page’s canonical tag.
4. Ensure your socially shared URLs e.g. Twitter Cards, are promoting your canonicalized site
version (your page’s canonical URL)
5. Its a good idea to ensure your canonical tags are correct and being generated against a set of
defined canonical URL rules for consistency, before addressing your social media tag
consistencies, so you are confident you are prioritising the correct canonicals.
@DeepCrawl
www.deepcrawl.com
[email protected]
14
The Science of
Optimizing Crawl Space
x Implement Pagination
Implementing page pagination consistency may not help directly with crawl efficiency, but it will help to
consolidate authority signals across your website URLs.
1. Use DeepCrawl Paginated Pages report to identify the volume of result sets pages and individual
results pages on your website.
2. Check your paginated page setup. If you are using rel/prev/next, use DeepCrawl Canonicalized
Pages report to check your canonical tags are correct, with each page canonicalizing to itself.
3. If you are not using rel/prev/next, apply it if relevant. Use DeepCrawl Canonicalized Pages report
to check your canonical tags are not pointing to page one and instead ensure they are
canonicalizing to themselves. Search engines will then consolidate your page authority to the
set of pages’.
x Implement hreflang
If you have international duplicate sites, you must implement hreflang tags. This is also essential for
same language duplicates, e.g. US, UK, Canada, Australia.
DeepCrawl now detects hreflang tags in Sitemaps, headers and on-page, showing you a matrix of
language alternatives for every page.
1. Use DeepCrawl Pages without hreflang Tags report to identify pages that do not have
hreflang tags.
2. Use DeepCrawl hreflang detection within your All URLs report, to identify any gaps or
inconsistencies in your current implementation.
@DeepCrawl
www.deepcrawl.com
[email protected]
15
The Science of
Optimizing Crawl Space
We hope this guide has provided a good insight
into crawl space & how DeepCrawl helps you to
monitor and optimize your crawl space effectively.
FREE REPORT
https://www.deepcrawl.com/report/
GET STARTED
https://www.deepcrawl.com/pricing/
LOGIN
https://tools.deepcrawl.co.uk/login