Hi All,
Wondering if you guys are able to help or at least validate some of the things that I'm seeing on Google crawl behaviour recently. I will outline the issue and context out below.
Apologies in advance for the long read, however, I felt the context is important in being able to understand and decipher the issue.
I manage SEO for a fairly large website (3 million plus pages) in Australia.
On this particular website, we've not had any development work or releases done for the last 4 to 5 months at least. No modification of code in any shape or form.
The site navigation for a human user involves a lot of AJAX calls as the person goes through the site and gets to the information they need. It has either a search functionality or a browse functionality which the user can adopt to get to the information they need. Both routes are active AJAX implementations with the content being refreshed as the user progresses without having to reload pages.
From an SEO perspective, we assigned URLs for each of these "pages" as the user clicks through to each so that we can add these "paths" to the Sitemap.xml for Google to crawl and discover our content and site hierarchy.
However, last week as of the 10th of December 2014, we've noticed some really strange patterns emerging on Googlebot activity across our site.
We've looked at our server logs using SPLUNK and discovered that not only:
ONE - The AJAX calls which invoke the content are being crawled by Google via our browse functionality but also,
TWO - Googlebot is filling out content into our search form and conducting searches.
Now, I've got no issue with them crawling the AJAX and rendering the content. I guess that they must need to use that to ensure that we're not cloaking the content for users and for search engines and serving up different experiences.
They seem to be modifying the presented URLs by:
1. Injecting null or incorrect data into the search fields which results in either error pages or giving them content that is not useful at all for a human user
2. Stripping out particular important fields from the presented AJAX URLs which results in an invalid request and throws the user / engine to a parent level page using a 301 redirect.
EXAMPLE:
So here's the example that will help explain it better...
User arrives at www.domain.com
user browses to the next page, (non AJAX url is www.domain.com/state
then onto the next one, (non AJAX url is www.domain.com/state/suburb-postcode)
On the next page the user can go one of 3 different ways (non AJAX url is either:
www.domain.com/state/suburb-postcode/street
www.domain.com/state/suburb-postcode/category
www.domain.com/state/suburb-postcode/business-page)
These pages are all predefined and the valid pages are all part of an extensive sitemap.
Until now, Google has been crawling these pages and indexing the content accordingly.
What I'm not happy about is the change they made last week to their patterns because it will not only impact our users, but users of Google as well because Google is presenting their users with irrelevant and invalid results as part of their index.
By doing the modification to the AJAX URL and stripping out the important fields, they are rendering urls like:
www.domain.com/state/suburb-null
www.domain.com/state/suburb-null-nul
www.domain.com/state/suburb-null-null-null
....
www.domain.com/state/suburb-null-null-null-null-null-null-null-null
www.domain.com/state/suburb-null-null-null-null-null-null-null-null-null
www.domain.com/state/suburb-null-null-null-null-null-null-null-null-null-null
You get the idea...
Over the last 3 days, the number of such null pages in the Google index has gone from 1,000 to 18,500 to 22,600 today
I'm wondering why Google would be doing this? All those page redirect themselves to the parent page using a 301 Redirect.
Why then would Google choose to:
1. Execute such a page by parsing such irrelevant data
2. Include such pages in their index despite there being a valid 301
3. Crawl such pages knowing they're not part of our sitemap.xml
Any ideas / help / theories?
[edited by: aakk9999 at 1:54 am (utc) on Dec 19, 2014]