Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot seen making AJAX Crawls like a human user

         

EvilSaint

1:24 am on Dec 19, 2014 (gmt 0)

10+ Year Member



Hi All,

Wondering if you guys are able to help or at least validate some of the things that I'm seeing on Google crawl behaviour recently. I will outline the issue and context out below.

Apologies in advance for the long read, however, I felt the context is important in being able to understand and decipher the issue.

I manage SEO for a fairly large website (3 million plus pages) in Australia.

On this particular website, we've not had any development work or releases done for the last 4 to 5 months at least. No modification of code in any shape or form.

The site navigation for a human user involves a lot of AJAX calls as the person goes through the site and gets to the information they need. It has either a search functionality or a browse functionality which the user can adopt to get to the information they need. Both routes are active AJAX implementations with the content being refreshed as the user progresses without having to reload pages.

From an SEO perspective, we assigned URLs for each of these "pages" as the user clicks through to each so that we can add these "paths" to the Sitemap.xml for Google to crawl and discover our content and site hierarchy.

However, last week as of the 10th of December 2014, we've noticed some really strange patterns emerging on Googlebot activity across our site.

We've looked at our server logs using SPLUNK and discovered that not only:

ONE - The AJAX calls which invoke the content are being crawled by Google via our browse functionality but also,

TWO - Googlebot is filling out content into our search form and conducting searches.

Now, I've got no issue with them crawling the AJAX and rendering the content. I guess that they must need to use that to ensure that we're not cloaking the content for users and for search engines and serving up different experiences.

They seem to be modifying the presented URLs by:

1. Injecting null or incorrect data into the search fields which results in either error pages or giving them content that is not useful at all for a human user

2. Stripping out particular important fields from the presented AJAX URLs which results in an invalid request and throws the user / engine to a parent level page using a 301 redirect.

EXAMPLE:

So here's the example that will help explain it better...

User arrives at www.domain.com
user browses to the next page, (non AJAX url is www.domain.com/state
then onto the next one, (non AJAX url is www.domain.com/state/suburb-postcode)
On the next page the user can go one of 3 different ways (non AJAX url is either:
www.domain.com/state/suburb-postcode/street
www.domain.com/state/suburb-postcode/category
www.domain.com/state/suburb-postcode/business-page)

These pages are all predefined and the valid pages are all part of an extensive sitemap.

Until now, Google has been crawling these pages and indexing the content accordingly.

What I'm not happy about is the change they made last week to their patterns because it will not only impact our users, but users of Google as well because Google is presenting their users with irrelevant and invalid results as part of their index.

By doing the modification to the AJAX URL and stripping out the important fields, they are rendering urls like:

www.domain.com/state/suburb-null
www.domain.com/state/suburb-null-nul
www.domain.com/state/suburb-null-null-null
....
www.domain.com/state/suburb-null-null-null-null-null-null-null-null
www.domain.com/state/suburb-null-null-null-null-null-null-null-null-null
www.domain.com/state/suburb-null-null-null-null-null-null-null-null-null-null

You get the idea...

Over the last 3 days, the number of such null pages in the Google index has gone from 1,000 to 18,500 to 22,600 today

I'm wondering why Google would be doing this? All those page redirect themselves to the parent page using a 301 Redirect.

Why then would Google choose to:
1. Execute such a page by parsing such irrelevant data
2. Include such pages in their index despite there being a valid 301
3. Crawl such pages knowing they're not part of our sitemap.xml

Any ideas / help / theories?

[edited by: aakk9999 at 1:54 am (utc) on Dec 19, 2014]

lucy24

6:53 am on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm wondering why Google would be doing this?

Ah, everyone's favorite question: Why does Google {insert-verb-here}? But also the most ultimately useless question.

All those page redirect themselves to the parent page using a 301 Redirect.

Now that sounds like a perverse version of G's perennial favorite, the soft-404 test. They're just getting more sophisticated about it.

www.domain.com/state/suburb-null-null-null-null-null-null-null-null-null-null

There's been a slew of recent Weird Google Behaviors. That is, ahem, weirder than usual. Others include requesting nonexistent directories and requesting URLs containing multiple // slashes. So far, the only safe fix is to nip them in the bud by rigorously redirecting to the correct form.

You might also think about tweaking your page-building code-- in this case Ajax, but it could be anything-- so that bogus input leads to something other than a fresh URL with a fresh 200 response.

3. Crawl such pages knowing they're not part of our sitemap.xml

Oh, that one's straightforward. A sitemap is inclusive, not exclusive. "When you crawl, be sure not to overlook A, B and C" not "Crawl only A, B and C."

FranticFish

8:20 am on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I first saw Google running search forms on websites myself about 8 years ago, and although I've yet to see it myself first hand, I read about them adding fake variables to dynamic pages and indexing non-existent urls over 5 years ago.

Trying the rewritten versions of dynamic urls is just an extension of previous 'nosy parker' behaviour.

Your programmers simply need to lock down your pages so that requests for non-existent variables return a 410 page. This is not just for Google, as on a dynamic site a 200 response on a dynamic url with an error from the db could reveal information about how the INSERT and UPDATE commands are put together for your db, which could be of interest to hackers.

trabis

8:50 pm on Dec 19, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Your programmers simply need to lock down your pages so that requests for non-existent variables return a 410 page. This is not just for Google, as on a dynamic site a 200 response on a dynamic url with an error from the db could reveal information about how the INSERT and UPDATE commands are put together for your db, which could be of interest to hackers.


Using 410 for security reasons sounds like a lazy patch.
#1 Never trust user input. Sql injections cannot happen if you sanitize input properly.
#2 Only developers should have access to error reporting.

lucy24

10:06 pm on Dec 19, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your programmers simply need to lock down your pages so that requests for non-existent variables return a 410 page.

Why do you recommend 410 instead of 404? Sure, a 410 makes the Googlebot stop crawling sooner-- assuming it's visiting URLs it has previously requested-- but it's otherwise not exactly truthful.

FranticFish

4:56 am on Dec 20, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@trabis - I get what you're saying, I made a mistake in my previous post confusing a request with an input. I'm not a programmer (fortunately for my clients, someone else does that).

To clarify, if a user requests one of your dynamic pages with non-existent variables or bad data for an existing variable you tell them that (a) the information they requested doesn't exist, (b) without revealing the server-side variable names or any db table names and (c) with a 400 response.

@lucy24 - from the start, and going forward, you'd probably want a 400 BAD REQUEST. But if these pages have been already indexed then you might want a 410 GONE FOR GOOD instead.

EvilSaint

11:28 pm on Dec 22, 2014 (gmt 0)

10+ Year Member



Thank you so much @lucy24, @trabis and @FranticFish for taking the time to look into it.

It's really been bizarre behaviour to a point where the injected queries were close to taking down our site.

I've manually slowed down Google's Crawl Rate of that site considering there's no new pages being created or active development on that site.

My issue with the 404 or 410 is that the site is a navigation site in Australia and some area / location names are quite lengthy and difficult to get correct when spelling. I just don't want any of our human users suffering because they don't know the correct spelling for the place they're looking for on our site.
For such users, we have a "did you mean?" type of result set returned so that they may choose the correct one.

The other issue is that I cant block off any queries for the word "null" because in Australia there are roads named "null" :P
(Thank you various councils and town planners for that splendid work)

@lucy24 - great input about sitemaps being inclusive not exclusive. That makes things a lot harder!

Our development manager and support team were actually going to block Googlebot from the server when they saw this happening.

Errors have gone up on Google Webmaster Tools as well. Of course there will be more errors when Google is doing the wrong thing :P

We've blocked Googlebot from accessing the AJAX modules for search via robots.txt at the moment to hold them back.

lucy24

1:19 am on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For such users, we have a "did you mean?" type of result set returned so that they may choose the correct one.

That's absolutely fine. But remember that the physical page seen by a human need not have anything to do with the numerical response returned by the server (and hence recorded by the search engine). So with a few code tweaks you should be able to make "Did you mean..." into a dynamic 404 page.

There isn't really a place called null-null-null-etcetera is there?
:: vague mental association with Nullarbor, perennial contender in Silliest Name Ever category ::

If you're getting an awful lot of requests in the form
(-[a-z]+)\1{2,}

you could redirect them to the non-duplicated form. (Uh... Ajax can do this, can't it? I know there are some limitations to the things Javascript can do with regular expressions, but I'm pretty sure I've used this formulation.)

Robert Charlton

2:46 am on Dec 23, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



My issue with the 404 or 410 is that the site is a navigation site in Australia and some area / location names are quite lengthy and difficult to get correct when spelling.

Given the location name situation and the 3 million plus page size of your site, I'd suggest that, whatever else you do to resolve the AJAX urls situation, you add an autocomplete function to your form entry. Your site search, of course, should have autocomplete as well.

I'm not a programmer, so I can't tell you how to do it, but all of your users, particularly your mobile users, would benefit from the feature.

It's possible, btw, and this is complete conjecture... that Google might be doing these crawls on your site because it's encountered a lot of misspellings already, and is trying to diagnose why. As lucy24 suggested above, you shouldn't be returning urls for content that doesn't exist.

TheMadScientist

3:00 am on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My issue with the 404 or 410 is that the site is a navigation site in Australia and some area / location names are quite lengthy and difficult to get correct when spelling. I just don't want any of our human users suffering because they don't know the correct spelling for the place they're looking for on our site.

I haven't read through the whole tread [really only skimmed portions], but one of the things that jumps out to me is PHP's pspell [php.net...] with a custom dictionary containing valid place names so you don't need/have to accept queries that are invalid when compared to your custom dictionary of valid place names might be a good option in this case.

You might also want to look into homophone, metaphone and soudex algorithms available to narrow down what visitors really mean -- They apply well in some cases, but not as well in others, so which is best or should be used in a specific situation is, well, situational, IME.

EvilSaint

4:48 am on Dec 23, 2014 (gmt 0)

10+ Year Member



@RobertCharlton and @theMadScientist - we have an autocomplete on our site which kicks in when you enter any two letters.

However, the site is not only place and location names but also business listings which tend to have even more unique and difficult names at the best of times with incorporation of apostrophes and other special characters.

So we've got the location stuff down pat using our autocomplete for the human users.

@lucy24 - great suggestion with the additional -null-null type queries and limiting those algorithmically. I'll have a chat to the devs on a way to process these.

Thank you all for your suggestions and help with troubleshooting!

TheMadScientist

6:13 am on Dec 23, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...we have an autocomplete on our site which kicks in when you enter any two letters.

Does it limit entries to keep you from getting overloaded, or does it simply correct the entries it sees as incorrect while loading an actual page? If it doesn't limit the entries and serve a 404 for incorrect entries, rather than providing suggestions and loading a page of results even though an entry is incorrect, I can see how your server might get overloaded -- Hope that makes sense.