Labs | Google Webmaster Tools - Part 2: Key areas in more detail

30th June 2014

Mike Davis

Lead Developer

In part one of this series I gave an overview of Google Webmaster Tools.

In this second post I am going to look at some of the key areas that I have found useful when reviewing a site listing in Google from a Drupal developer's point of view.

These area of interest are: Crawl Errors, Fetch as Google and Sitemaps.

Crawl details

Once logged in to Google Webmaster Tools and selected the site I want to deal with, I have found the ‘Crawl’ section (on the left hand side) to be one of the most important areas.

Here you can get information on what site pages Google has crawled, including various errors and details about how many URLs have been indexed from your sitemap.xml file.

Crawl Errors

This section is broken into the different types of errors:

Server error
Soft 404
Access denied
Not found
Not followed
Other

Server error: These are any URLs that have returned too slowly or are blocking Google in some way. This would typically be pages causing errors on your site, so they should be dealt with fairly urgently.

Soft 404: These pages are interesting. They are like ‘Not found’ pages, but they aren’t strictly invalid pages as they aren’t returning a 404 header response. Google’s 'help' details these pages as:

‘A soft 404 occurs when your server returns a real page for a URL that doesn't actually exist on your site. This usually happens when your server handles faulty or non-existent URLs as "OK," and redirects the user to a valid page like the home page or a "custom" 404 page.'

In some cases, these pages could be search pages which take in various query parameters to determine the search criteria. As the search content changes, the results of certain criteria may return no results.

This type of page can also be seen as a ‘soft 404’ page.

Google recommends setting up your robots.txt file to not index such search pages as the content could be misleading. If you are providing a sitemap.xml file this should contain all of your site's content for Google to index.

Access denied: These are fairly obvious - they are pages that Google can not access.

This might be due to authentication being required or just that Google is being blocked from seeing the page. It's worth keeping an eye on these pages as it might be that you have an error on a page that is preventing Google from accessing it etc.

Not found: These are also fairly obvious - they are pages that Google can not find or are returning a 404 header response.

This might be due to the page changing URL or just that the page no longer exists. It is worth keeping an eye on these pages as it might be that you have removed some pages and you didn’t realise that there was a link on a page on your site (or indeed on someone else's site) that is linking to that page.

In the event that the URL has just changed, but the page that this was referring to still exists, it is advisable to provide a redirect from the old URL to the new URL so that Google can reindex the correct URL. This should be done using a 301 redirect and can be achieved using a htaccess file.

Not followed: These are pages that Google tried to follow but couldn’t for some reason.

Other: This is more of a ‘catch all’ for any pages that couldn’t be accessed but don’t fall into any of the categories above.

What can you do with the list of URLs?

Within each of the above sections, if there are any URLs found, a list will be presented. Clicking on one of the URLs will open up further useful information:

Error details: When this error was first detected and why etc.
Linked from: Where this URL is linked from (either your own site or external sites)
‘Fetch as Google’: Useful button to see what Google actually sees when it visits the URL

You can also mark URLs as being ‘fixed’, i.e that they should no longer appear in that list.

This will remove them from the list, but if Google detects them again they will get added back.

However, if your content has been ‘fixed’ the URL will automatically be removed from the relevant list when Google crawls that site, you removing it seems to be more for your own sanity and ease of seeing what is still to be sorted out.

Fetch as Google

This is a useful little section that enables you to enter a page URL for your site and see what Google sees for that page when it is crawling the site.

Sitemaps

If you have provided a sitemap.xml file to Google then this will provide further details on the number of the pages that the sitemap contains against the number of pages that Google has actually indexed.

Google says that it won’t guarantee to index all the site's pages, so don’t expect this to match up, but it does give you a good indication on the number of pages that Google is actually aware of.

Other resources

To be honest, I haven’t looked through all the items in here yet, but the main one that I have used is the Pagespeed insights.

This is a great little tool that analyses your site URL and tells you how it can perform better and faster. This is always worth having a look at to see how your site is performing, as sometimes small changes can make a big difference.

In Part 3...

I will analyse how data from Google Webmaster Tools helps me understand sites better and improve their standing in Google, complete with examples.