Googlebot Now Crawls via HTML Forms

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot Now Crawls via HTML Forms

tedster

1:52 am on Apr 12, 2008 (gmt 0)

There's some interesting new today on Google's official blog:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form.
For text boxes, our computers automatically choose words from the site that has the form;
For select menus, check boxes, and radio buttons on the form, we choose from among the values...
[googlewebmastercentral.blogspot.com...]
I believe we now have at least some of the answer for two mysterious threads from the recent past:
Google indexing large volumes of (unlinked?) dynamic pages [webmasterworld.com]
Google is indexing site search results pages [webmasterworld.com]
I'd guess that during the development phase of this new form of crawling, Googlebot may have not always been as well-behaved as the engineers planned.

theBear

2:09 am on Apr 12, 2008 (gmt 0)

Looks like it is time to hide the forms.

treeline

3:40 am on Apr 12, 2008 (gmt 0)

A small number? Last fall I had to block (robots.txt) a key set of forms when googlebot discovered them. The data updates hourly with thousands of possible combinations, and G wanted all of it all the time.

willybfriendly

4:32 am on Apr 12, 2008 (gmt 0)

So what happens if they stumble into protected areas of a site? Where is the liability?

I could see this being a realistic scenario given how poorly some people select things like passwords and usernames.

Rosalind

6:00 am on Apr 12, 2008 (gmt 0)

Only a small number of particularly useful sites receive this treatment,

I'm hoping this means that if you don't have sitelinks it doesn't apply. I don't have many forms which use GET, if any, but it would be tedious to have to comb my websites to check them all just in case. I'm underwhelmed by this development: if it's behind a form, I can't think of any instance where I want it spidered.

arikgub

6:56 am on Apr 12, 2008 (gmt 0)

How can a webmaster block this? Will they come up with a "nofollow" attr for the <form> tag now?

Oliver Henniges

7:00 am on Apr 12, 2008 (gmt 0)

> I'm underwhelmed by this development: if it's behind a form, I can't think of any instance where I want it spidered.

Same for me. My first thought was, tedster had come across an april fool's joke twelve days later;)

But indeed i found googlebot using some mysterious get-variables on my shop page quite recently. As a matter of fact, these needles were outdated at present, because I redesigned my system in February.

If of interest: This older variable used ?sch= plus integer in order to preserve the n-th step of a navigation-action inside my shop's search function. I used to display a random collection of my products if the internal product-search-function gave zero results and I found it quite annoying if you saw something completely different using the back button. Currently I work with other needles and skipped randomization completely.

But for a few weeks now I have begun to log such get-requests due to some amateurish hacker-attacks. And was really surprised to find googlebot among the delinquents. Given my above explanation I found it really mysterious googlebot had used integers as high as 22348 for testing, and the paranoid in me said "That was a human employee;)"

What bothers me most is the duplicate content issue: If google indexes a shop's search function it is quite likely, pages with different uris show exactly the same result, be it only two-word-searches in different word order. Until now, I never cared about the metatags of my cgi-output, but given this new development, I shoud probably go through the whole structure and make sure specific pages are displayed with "noindex"?

The other thing is malicious behaviour of competitors as described in this thread [webmasterworld.com]: If a competitor puts links to your site using varying get-variables, most of your pages will alway display the same result and may trigger filters. BTW: Are both phenomena related? Has that serious damage reported in that thread to do with google spidering random-form-variables?

> Only a small number of particularly useful sites receive this treatment,

The honour is well appreciated;)

himalayaswater

7:42 am on Apr 12, 2008 (gmt 0)

Now I know why domain . com/search/news/ page is google result; even though I don't have such page. Neutrally, it is producing error 404.

bouncybunny

8:10 am on Apr 12, 2008 (gmt 0)

Well, as the starter of one of the earlier threads on this subject, that does indeed make some sense. And my form is indeed a GET form. Although, Google's use of the term 'small number' is interesting.

My big dilemma was whether to use robots.txt to exclude googlebot. On the one hand I was concerned about duplicate content. But on the other hand, are these dynamically created and potentially indexed pages really duplicate content? These 'search results' results pages don't exist in any other format.

Moreover, as I said on my earlier thread, Google appeared uncannily accurate with the words that it chose to use to run queries on my site (my form is a site-search cgi script that searches a knowledge base section of my site). If those results then appear in the SERPS for keywords that I (or others) have not necessarily used in link text, is that really such a bad thing? Perhaps someone searching for that particular concept will be really pleased that they have found this result.

I eventually decided to exclude the search script from googlebot, mainly for good old fashioned duplicate content paranoia reasons. But I wonder if it is time to rethink this kind of thing? Have we been taking the whole duplicate content thing too literally? I assume that Google knows where it has drawn these results from and perhaps there is no need to be concerned?

This paragraph is particularly interesting:

The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.

On the other hand, unless Google gives some direct and clear guidelines, it is far too tempting to err on the site of caution. Which may be a shame for webmasters and searchers alike.

Any thoughts?

confuscius

8:27 am on Apr 12, 2008 (gmt 0)

I too was one of the chosen few BUT I am none too happy. My site is a compact highly controlled directory of sites about a particular niche, well structured and SEO'ed to the best of my ability! I also have an authority listing. The site contains a site search with a few available sort orders - so Google picks a word and sort order runs the search query and adds another page to its index. A site: now shows that as every day passes my site looks like a list of unstructured search queries now outweighing the number of original structured directory pages.

I presume that I can use a wildcard block in robots to stop further damage but how the heck do I remove the damage already done?

Begs the question, what is the point in designing a nicley structured site if Google can simply completely screw it up for you?

I would love to hear Google's comments on this. So the theory that one site can screw up another is confirmed beyond all reasonable doubt and the attacker is ..... Google!

koan

9:58 am on Apr 12, 2008 (gmt 0)

So a site can have half its normal pages in the supplemental index yet Google is still hungry for useless pages behind forms?

Oliver Henniges

10:12 am on Apr 12, 2008 (gmt 0)

> are these dynamically created and potentially indexed pages really duplicate content?

the question is, whether google PERCEIVES them as duplicate and then triggers a filter, which may hurt you. Given that the canonical issue is said to have caused so many problems, we cannot be careful enough. And the difference between mydomain.com, www.mydomain.com and mydomain.com/index.html is something google should have fixed in the first place (without the help of webmasters via WTC), before beginning to crawl uncontrollable stuff behind forms.

Disclaimer: Normally I do not use the phrase "google should", but this issue is probably worth an exception.

The more I think about it, the more mysterious things get: It is relatively easy as long as be only have to do with GEt-requests leading to idempotent cgi-URIs. But what about POST-forms? Will googlebot randomly apply for mail-accounts on the various public mailservers and then send billions of mails? Will googlebot perform orders in my shop (Nice rich customer, eh?;)? Will googlebot place trillions of postings in forums and blogs(I'd like to know what golem has to say meanwhile;)?

glitterball

10:13 am on Apr 12, 2008 (gmt 0)

I'm not pleased with this development either.
I have always understood that Search Engines do not complete forms and that this has been an unwritten rule since the start.

I really do not see the point in Google indexing search results for random keywords. This is bound to create all kinds of unforeseen problems.

DamonHD

11:06 am on Apr 12, 2008 (gmt 0)

GET operations should do no harm given the meaning of GET (idempotent and safe).

It would be wrong of Google to perform other operations with forms, such as POST, which might cause a stateful (and expensive) side-effect, such as ordering goods, etc.

Google seems to be clearly restricting itself to GET.

As it happens, mainly because I don't like search-results-in-search from other sites, I just moved my biggest site to have the search/results page noindex (though not nofollow).

Rgds

Damon

Miamacs

11:36 am on Apr 12, 2008 (gmt 0)

One of my test domains ( yeah, hint to Google: I know where you look first ) has a blog software installed. Didn't bother with HTML or a normal CMS.

...

They've been doing the following since last year:

Query the search form with words from the page at random.
Words they found to be prominent on the page.
They crawled and indexed these results pages.

...

Now on other domains which I work on, where it'd have made sense to crawl - for example - URLs used in <OPTION>s in a drop down menu... nothing happened, so I couldn't care less.

I hoped they'd stop doing it and decide this is an idea too crazy for prime time.

...

so all in all this:

Only a small number of particularly useful sites receive this treatment,

...means only high TrustRank sites get this... treatment.

Mind you the test domain isn't useless...
it has very informative articles written especially for it.
But to call is particularly useful might be an exaggeration *smirk*

None the less CNN, Google Blog search and all the rest felt the same way towards it. Some tests ended up on real, live, WEB search results within ( grab your seats ) 4 minutes time.

...

But after all they did NOT choose to drop this reckless idea.
Yeah, my opinion is...

This move is insane.

First they educated everyone *for 10 years* that if you use FORMs, OPTIONs, etc. your pages will *not* be crawled / are in effect, *hidden* from them.

Now that *finally* everyone understood this...
And used FORMs to HIDE pages from Google...

They start crawling and indexing them. Hey. Why?

...

Stepping out of line from my tech-sided comments:
I've reached a conclusion for the time being.

Google is a politician

Nothing more, nothing less.
Same credibility, same accountability.

[edited by: Miamacs at 11:37 am (utc) on April 12, 2008]

confuscius

11:44 am on Apr 12, 2008 (gmt 0)

"Only a small number of particularly useful sites receive this treatment " - does this not smack of manual intervention by Google to bring more useful content from useful sites into their index?

Perhaps, this is a new way for Google to deal with spammmers, get more useful content to swamp it! I wonder what constitutes a 'useful' site and what the selection process is.

Looks to me like Google is moving away from its algo based approach.

Receptional Andy

3:41 pm on Apr 12, 2008 (gmt 0)

Hmm. I'm not happy to be right with my speculation about Google's form-spidering behaviour at the end of last year, and now that they've announced it 'officially' I admit I just don't get it. The pages they are likely to find via this mechanism are almost by definition lower quality - otherwise they would have links to them.

Maybe I'm a suspicious type, but I just don't buy Google's explanation for doing this. It directly contradicts a number of prior statements, it will result in the discovery of the type of content they dump into the supplemental index, and worse still, it will cause pages to appear in the index that webmasters never intended to be spidered.

Where is the benefit to either searchers or webmasters?

I'm no fan of proprietary extensions to the robots exclusions protocol but a directive that amounts to 'no forms please' would get the thumbs up from me.

explorador

3:45 pm on Apr 12, 2008 (gmt 0)

I think is too much. Big G is trying to index everything and willing to find out all the info we have. Whats next?, and I mean... whats next?

Pico_Train

4:05 pm on Apr 12, 2008 (gmt 0)

I really don't see the point in indexing:

"Thank you, your request has been sent successfully. A member of our staff will contact you shortly."

"Sorry, there is no availability for your stay on those dates."

"I'm afraid we are out of stock of that particular product, perhaps you would like it in pink?"

Quit it, now!

treeline

4:57 pm on Apr 12, 2008 (gmt 0)

Google did make it easy to opt out. Robots.txt was easy, and worked instantly. They say other methods like nofollow also work.

Site selection is interesting. My site is in a small niche. It's key in the niche, but it's a small niche. It does show the supplemental results in searches. As for the small number of forms pulled, it went from googlebot pulling about 300-400 pages a day to 5,000-10,000 a day.

The forms were all radio buttons & dropdown selects. There are combinations no human would choose, but googlebot seemed to try every conceivable cross-indexed possibility. The form results include a lot of numbers and are very useful when produced, but the data changes hourly and is completely useless after several hours. I can't imagine it drawing clicks from google searchers. Googlebot started following forms on this site in September 2007.

It's only logical they'd take a shot at forms. Remember when search engines couldn't follow cgi scripts? That was the hidden web then. Now they're good at it. If good stuff can be found and Yahoo and MSN can't keep up, Google has a bigger edge. Of course, many forms lead to random gibberish if you didn't submit intelligently.

Bewenched

6:35 pm on Apr 12, 2008 (gmt 0)

Yes they're making a MESS of my log with it. I get error notifications as well when there's a search that returns nothing. ARGH.

I'm all the time fighting referral spam through some of our search forms so I get notified of some of the odd requests.

How does google know what should be posted thorugh there.

Guess I'll have to block that directory from bots, unfortunately that also informs those referral spammers exactly where to do their dirty work .. great.

Only a small number of particularly useful sites receive this treatment,

amazon. interesting

[edited by: Bewenched at 6:43 pm (utc) on April 12, 2008]

g1smd

6:55 pm on Apr 12, 2008 (gmt 0)

That explains why a certain page is indexed under multiple URLs when I explicitly set it up to be indexed under only one specific parameterless version. The other versions can only be reached via a form.

The other versions should not be indexed. Now I have to do extra work to stop that happening. Bad move by Google. They just crossed a line.

They're getting too nosey. Back off.

rogerd

7:45 pm on Apr 12, 2008 (gmt 0)

I hope they give Googlebot a credit card to allow successful completion of my order forms.

mcavic

9:40 pm on Apr 12, 2008 (gmt 0)

This move is insane.

Yes.

treeline

10:18 pm on Apr 12, 2008 (gmt 0)

I hope they give Googlebot a credit card

It keeps guessing random numbers til it's approved!

Oliver Henniges

12:04 am on Apr 13, 2008 (gmt 0)

> Only a small number of particularly useful sites receive this treatment...

Given the amount of posters contibuting here and admitting to have been visited by googlebot on this issue, the phrase "small number of particularly usefull sites" seems a bit strange. Two explanations remain:

A) google is just clapping our shoulders for whatever reason

B) Many of us posting here are viewed as particularly trustworthy by the google-engineers. To a high percentage our websites are "clean" (i.e. "accessible" and syntactically correct), so that they make up a good starting point for testing new crawl-algos.

[edited by: Oliver_Henniges at 12:06 am (utc) on April 13, 2008]

olias

10:22 pm on Apr 13, 2008 (gmt 0)

I had Googlebot crawling some basic forms back in 2003. [webmasterworld.com...]

It was just a basic drop down select box that may well have looked like site navigation so I can sort of understand that.

Goodness knows why they now feel the need to perform searches though, I can see no sense in trying to find stuff that way.

GrendelKhan TSU

1:38 am on Apr 14, 2008 (gmt 0)

Does anyone LIKE this move?

I'm still not clear who this even theoretically benefits.
The site owner (ahhh! more pages I forgot to robots.txt) or user (netizen doing a search)?

I mean, what would people be search FOR that they would pages behind a form to be in their SERP?

jamiembrown

9:06 am on Apr 14, 2008 (gmt 0)

Matt Cutts has spoken about this change on his blog:

[mattcutts.com ]

He says that this change is for the sort of corporate site that has a list of dropdowns on the home page to select a country or region.

Personally I think that this is a strange excuse - which corporate site has not already added text links to help the spiders? And if they haven't done that yet, I'm not sure they deserve to be indexed at all!

I don't quite understand what benefit this change will bring to users or Google - its just going to add a whole load of low quality content to the SERPs...

bouncybunny

9:10 am on Apr 14, 2008 (gmt 0)

Does anyone LIKE this move?

I don't know about actively like but, as I've said (aside from my concerns about duplication), the way that Google have indexed my particular site does have potential benefits to searchers. If it helps people to arrive at my knowledge base via more keywords than before, then more power to them.

Having said that, looking at the wider picture and hearing from other people, it does seem like a rather unusual move. But Google have stated that these crawls are treated differently from normal indexes (see my post above) so, in theory we shouldn't really have much to fear from negative effects.

As regards actual visitors, I can't say that I noticed any major increase in referrals from Google before I disallowed the bots. But there was some.

I'm still in two minds about whether to keep disallowing.

This 51 message thread spans 2 pages: 51