Googlebot Now Crawls via HTML Forms

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot Now Crawls via HTML Forms

tedster

1:52 am on Apr 12, 2008 (gmt 0)

There's some interesting new today on Google's official blog:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form.
For text boxes, our computers automatically choose words from the site that has the form;
For select menus, check boxes, and radio buttons on the form, we choose from among the values...
[googlewebmastercentral.blogspot.com...]
I believe we now have at least some of the answer for two mysterious threads from the recent past:
Google indexing large volumes of (unlinked?) dynamic pages [webmasterworld.com]
Google is indexing site search results pages [webmasterworld.com]
I'd guess that during the development phase of this new form of crawling, Googlebot may have not always been as well-behaved as the engineers planned.

GrendelKhan TSU

9:57 am on Apr 14, 2008 (gmt 0)

sorry typos in last post so maybe ppl didn't understand my last question....

"I mean, what would people be search FOR that they would pages behind a form to be in their SERP?

what I mean was:
what would netizens be searching for that they would find from (benefit from) content that was behind a form?

the way that Google have indexed my particular site does have potential benefits to searchers.

as above, can you be more specific?

even mattcutts explanation is vague about it (lumped with dropdowns) and saids its being more about "discovering new links". But he doesn't give a hard example about forms. Country links... ok. I relate to that. Crawling those behind dropdowns is hard... but still not clear on forms.

netizen searches for: ________?
serp give them useful Result: ____________? <<< (this came from content in a form.)

I'm kinda with pico and can only imagine something like:

"Thank you, your request has been sent successfully. A member of our staff will contact you shortly."

what link is it discovering from that?
+_+

[edited by: GrendelKhan_TSU at 10:02 am (utc) on April 14, 2008]

wheel

12:20 pm on Apr 14, 2008 (gmt 0)

I'm horrified at this possibility. I provide calculators to an industry via a form they put on their website. If they start indexing these forms there's all sorts of problems - duplicate content not the least of them. Most importantly my competitors now have a potentially easy way to find my entire client list.

I have a friend in the industry who publishes his client list for the public - put in your zip code and find someone who sells in your area. Again he's done it behind a form specifically so that it's not indexed - and specifically so that people don't have an easy way to crack a list of suppliers.

Ranting aside, will the action of the form receive any linkweight from this? My goodness, if it does, I suspect the dormant domain I use to serve the calculators from is liable to hit the top in a very competitive industry. That could change the serps in some industries.

Romeo

12:35 pm on Apr 14, 2008 (gmt 0)

There are sites with database query forms to explicitly send POST requests just because they don't want to show half a million direct deep GET links to the search engines.
Other sites takes queries to convert values by a special service algo, where there could be trillions of theoretically explorable possible values.

And now it looks as they want to deliberately sabotage these set ups.

Or are they suggesting that we should cloak our forms to hide them?

No, thanks. It is rude.

Some examples:

Imagine a website that simply asks for a web colour code (between #000000 and #FFFFFF) to show a resulting colour.gif: Wow! that would give a boost of 16,777,215 most usefull hits.

Or how about a money currency converter between US$ and GB-Pounds?
And can we have decimals, please, too, to get more fun for all possible values between 0.01 and 9999999999999.99 ....

Or how about a small page just asking
"Hi, put in some meaningless value: [nnn]"
to give back an
"Oh, so you have put in 'nnn'. Thank you so much. Want to try another one? [nnn]".

Or a navigational page to "select 1, 2 or 3" being programmed to always give back a 200 "Sorry, your selection 7885365... is invalid, try a valid one." instead of a 404 or 403.

All this can get ballooned to many million/billion/trillion different valid and even content-rich pages.
There are so many different ways to get it completely wrong.

... ah, enough of that.
I am now just off now for some minutes to do some funny coding, and the GOOG bot is invited to come by shortly after then ...

Kind regards,
R.

Oliver Henniges

2:58 pm on Apr 14, 2008 (gmt 0)

Silly question: Does anyone really use form-elements with the GET-method?

Personallay I only use <form> in combination with POST. Instead of using GET with a submit-elemtent I'd always prefer the complete target-location as href-link. Obvioulsy, I did not fully understand what the GET-method was made for.

Receptional Andy

3:27 pm on Apr 14, 2008 (gmt 0)

* Use GET if:
o The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup).
* Use POST if:
o The interaction is more like an order, or
o The interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or
o The user be held accountable for the results of the interaction.
[w3.org...]

GET forms have a number of advantages for users. As an example, it allows the sending/bookmarking of links to Google search results pages, or a link to HTML validation results, etc.

ddogg

4:13 pm on Apr 14, 2008 (gmt 0)

This makes no sense for most sites, obviously if we want it crawled we will make the content available on our sites through links.

GrendelKhan TSU

4:18 pm on Apr 14, 2008 (gmt 0)

I'm horrified at this possibility. I provide calculators to an industry via a form they put on their website. If they start indexing these forms there's all sorts of problems - duplicate content not the least of them. Most importantly my competitors now have a potentially easy way to find my entire client list.
I have a friend in the industry who publishes his client list for the public - put in your zip code and find someone who sells in your area. Again he's done it behind a form specifically so that it's not indexed - and specifically so that people don't have an easy way to crack a list of suppliers.

this is exactly what I was worried about. I'm in the DB industry myself.

Even if there are nofollows or robots.txt or whatever ways to deal with the issue... it's still creating more work and more things to have to worry about. Thing that I don't necessarily why see why we SHOULD have to worry about. Let's not forget the cost of making the change to block the crawl (for most sites), we are going to have all the of ppl out there (probably majority) that won't know they were supposed to worry about it till waaaay too late.

But forget the risky downside for a sec (it's obvious, there is a lot one would NOT want netizens to see from forms getting crawled):

I'm still not seeing what a webmaster (whitehat) would want to be crawled from a form, even theoretically.

anyone? +_+

[edited by: GrendelKhan_TSU at 4:22 pm (utc) on April 14, 2008]

wheel

7:41 pm on Apr 14, 2008 (gmt 0)

I went back and read Matt Cutt's blog from the link earlier in this thread. I'll comment at the risk of the standard google flame (let's not).

He suggested that example.com use a list of links by country at the bottom of the page. Ewwwww. A site with a block of links at the bottom for SE purposes? I tend to assume stuff like that won't pass a hand inspection. I'd love to do that for better local rankings myself, I don't because it looks horrible and is 100% SE driven. Yes my competitors do this and rank better for local search terms....I've deliberately decided not to do that.

Overall he's advocating specific design implementations that are strictly for the search engines and not for the users. The entire post seems to be written from that perspective. I'd suggest that some of the suggestions are actually detrimental to the user.

Seems like a slippery slope. Following forms is going to create a worse mess than the problem they're trying to solve. I also suspect that in the example he's used (clickable map of the world with drop down to select region) that most people would be using POST, not GET. Crawling forms doesn't solve the problem.

Silvery

4:14 pm on Apr 16, 2008 (gmt 0)

While it's nice that they don't spider/index results if you've disallowed search URLs in robots.txt, this is still a bit too after-the-fact. After all, major search engines didn't previously execute automated queries through search forms uninvited in the past, so webmasters now may be dealing with inconvenient, bulk automated queries that they had no reason to expect.

I can appreciate the desire to index the unavailable content, but there are lots of folks in the internet industry who rely on their site usage statistics for various reasons, and having sudden spikes in search queries may not be something they can easily correct in their reports. Not to mention cases others have cited here where no-value pages are getting indexed and logfiles get suddenly clogged up with the automated query records.

This really should have been announced prior to deployment out in the wild.

Many webmasters won't have their search URLs disallowed in robots.txt, because they've never previously needed to do so.

This is just a bit like how Network Solutions once coopted traffic from all non-registered domain names -- suddenly changing legacy paradigms on the internet without advanced notice is both rude and unwise.

g1smd

6:55 pm on Apr 16, 2008 (gmt 0)

I have now added meta robots noindex nofollow tags to all the pages with forms on.

I can't afford any time to look into the potential effects of not doing so, so have just gone with a blanket "keep out".

I do wonder if using the robots.txt disallow would instead be a better option.

Receptional Andy

7:16 pm on Apr 16, 2008 (gmt 0)

g1smd: if the activity I saw (in one of the linked threads by the OP) is attributable to this new behaviour, the sheer volume of crawling was a nuisance. Directives for robots in meta elements encourage re-crawling, so I opted for a disallow.

The way some sites are set up, I may just disallow any URL containing a question mark using wildcards.

Frankly, I don't consider this new spidering behaviour to be at all desirable. It creates a lot of work that I believe is wholly unneccesary.

Tastatura

1:04 am on Apr 25, 2008 (gmt 0)

At Faculty Summit in Zurich on 02/14/08, Google's Alfred Spector (VP of Research and Special Initiatives) briefly talked about some current projects including "Deep Web" - all html indexable data through forms.

I hope they give googlebot a credit card to allow successful completion of my order forms

Actually he mentioned that they have to be careful not to buy anything (not click on "buy" button), and what form parameters to to use as an input.

Here is link to the video:
[video.google.com...]

and Deep Web starts around 44/45th minute.

List of some of the current projects (as mentioned in the video)
-Voice
-Machine Translation
-Entity/Relation Extraction
-Deep Web

Also listen to the Q/A from the audience at around 59th minute regarding Deep Web/semantic web/ meta data - very interesting answer :)

[edited by: Tastatura at 1:10 am (utc) on April 25, 2008]

webdude

12:29 pm on Jul 1, 2008 (gmt 0)

Was there ever any resolution to this? Is there an answer?

I recently was shocked to see in GWT that one of my sites had 3500 pages of duplicate titles and meta descriptions. Of course I started to investigate. By the time I found a fix, duplications went to 6500 pages. All of this is POST data. An example URL was as follows...

?a=111&b=222&c=3

Not only is Googlebot inserting random numbers, it is actually rearranging the order of the post. The c value is not necessary to return the page (it is now, though). So I end up with...

?b=222&a=111&c=8
?c=9&a=111&b=222
?a=111&c=10&b=222

All pointing to the same content!

I check my logs and analytics along with GWT religiously and it took just 24 hours for this to occur. So now I have 6500 pages of duplicate content in the SERPs and it is really going to be a pain to get it all removed. I have fixed this with a combination of no-follows and the robot.txt file.

So the question really is, are these URLs being generated by Google? or... Is Google being fed URLs from someone else. The site is a non-profit, but I could see others wanting to bring it down. It would be nuts, in my opinion, to have a bot trolling for non-existant URLs just to see if it could dig up more data... especially if the data is more then likely duplicate, redundant, or not generated by any other means... no human clicks.

At first I thought this was a programming error, but I checked, double checked, triple checked and quadriple checked my code and there is no way these URLs are being generated on my end. All log files show that 90% of the referrals on these rogue URLs are being hit by Googlebot... the other 10% are being hit by users who searched Google... yeah, the pages are in there. They now return a 404 error. I was so worried about the dupe content, I had to cut off my nose to save my face. Another thing too, is that the total page count using site: has dropped from 14,500 to 10,500 and it looks like around 1% of the pages in the SERPs are the rogue links.

Now why would G do something like this?

Receptional Andy

7:20 pm on Jul 1, 2008 (gmt 0)

are these URLs being generated by Google? or... Is Google being fed URLs from someone else

The problem is, there's no guaranteed way to find this out, since it's quite possible to hide links from being discoverable in search results. And pages 'discovered' are often pointless enough to make the results seem like sabotage, even if it was Google's behaviour.

All of this is POST data

Just by way of clarification, in the URL ?a=1&b=2&c=3, a, b and c are all GET variables. POSTed data is never present in a URL.

I don't have any recent examples of this behaviour by Google, since wherever possible I've blocked crawling of any forms via robots exclusion, and not every site is affected in any case. In some cases, this actually removed useful results from Google, in instances where users had chosen to link to the results of various forms with good reason. But I can't have tens of thousands of junk URLs filling up search results.

The best solution is (pre-emptive) robots exclusion, although it's worth pointing out that I didn't see any harmful effect on performance of such URLs being indexed on other pages. Since there are no (or no worthwhile) links to such content, you lose nothing by blocking them.

Of course, on sites where tens of thousands (and more) of these URLs were indexed, the wreckage hangs around in Google forever as URL-only entries (they 'discovered' the content I suppose, so they want to keep it). So perhaps 404 is a better way to expedite complete removal. In my case, most URLs were valid, just not desirable for indexing, so a 404 would have to be cloaked, which is more trouble than I want to go to to get around this new 'feature'.

Now why would G do something like this?

The data Google can access is severely restricted by the need for there to be links to the content. To a webmaster aware of this, that's probably a good thing. Google's argument would likely be that there are webmasters who are not aware of this, and wish that content with no links to it was made available, and users who are searching for such content.

I will make no secret of the fact that I believe neither collateral damage or quality assurance were given adequate regard in the decision to allow Googlebot to put pseudo-random data into GET variables.

webdude

8:02 pm on Jul 1, 2008 (gmt 0)

Oooops... you are right, they are GETs.

As for the problem, I saw something that really gave me a jolt today. First off, these are forum pages and they are very desirable to have in the index. There are over 77,000 posts with an average of 8 posts per page. Some a lot more, some less. That averges to about 9600 pages. This is just the forum and does not include any other pages on the site. Like I said, site: brings in about 10,500 pages in Google. When I search for the name of the actual file, though, it returns 6 results... two from my site and four from others linking in. This is a very unique filename. When I click on "repeat the search with the omitted results included" I get 2,670,000 pages... uh?!?!?! Can't be right, it only displays around 400 of my pages.

Anyway... any insight would be helpful. The site seems to be holding its usual positions in the SERPs.

Receptional Andy

8:17 pm on Jul 1, 2008 (gmt 0)

Site: counts have been highly unreliable for years, and the numbers shown are increasingly different depending on the exact search conducted. Note that site:example.com/forum is valid syntax (if you aren't aware of that already).

"Repeat the search with the omitted results included" is an odd beast, as seems to omit "entries very similar to the x already displayed" as it applies to the keyword searched for - so the degree of similarity is usually much higher in site: and other specialised searches.

I found most of the URLs with random GET variables were "omitted from results" for site: searches, although not necessarily for other methods of extracting the content from Google results (via creative queries).

I would say block the 'fake' ones via robots exclusion and if the mess it makes of certain search results bothers you, consider using Google's URL removal process.

webdude

12:42 pm on Jul 2, 2008 (gmt 0)

I would use the G removal process, but to remove singular links one at a time would be a full time job ;-) There are over 6600 pages of dupe content right now. I added a couple in the queu to see what would happen after changing the code to show a 404 for these pages.

Site: is returning about 1% of the rogue pages right now. Googlebot has stopped all crawling of these pages since I added the rogue wildcards into the robots text.

Funny, I got on GWT this morning and it shows "no data" for just about everything except for the crawling stats. I hope this isn't a portent of something to come. Also I am showing a drop in the SERPs across the board for all keywords and phrases. Everything down by 2 spots... even some phrases that were locked at #1 for the past couple of years.

Mmmmmm... me thinks there might be a bit of a problem.

webdude

7:23 pm on Jul 2, 2008 (gmt 0)

Update...

Yesterday there were 6689 duplicate titles listed in GWT... today there are 6,051.

At least it's a step in the right direction, eh?

g1smd

7:56 pm on Jul 2, 2008 (gmt 0)

I use entries in .htaccess to pre-process incoming URL requests, before they hit the content-generating PHP scripts.

If the query-string format is completely wrong then a 404 error is served. I just rewrite (NOT redirect) the URL to /this-file-does-not-exist instead.

If the parameters are valid, but in the wrong order, then a redirect to the correct parameter order is done (as well as fixing the target URL with the correct domain).

This keeps most non-valid requests from getting through.

For requests that do get passed to the internal PHP scripts, the script does a validity check on the parameter values and serves a 404 for any requests that are still incorrect.

Duplicate Content is impossible with these methods in place.

webdude

2:09 pm on Jul 3, 2008 (gmt 0)

Another update, if anyone is interested...

Duplicate titles down to 2,264

Returns in SERPs for all keywords/phrases back to their original glory.

Receptional Andy

2:14 pm on Jul 3, 2008 (gmt 0)

Glad to hear it's headed in the right direction, webdude.

Duplicate Content is impossible with these methods in place

Good advice, and for duplicate content, certainly.

Note that this does not apply to most issues caused by the form-spidering, which results in the creation of a valid URL with distinct content (e.g. through a site search form, where Google plugged in every word found on the site in question, one be one).

Such content is valid, and accessible, but the presumption has always been that such content is inaccessible to Google: this is no longer the case, and you also need to protect URLs that are valid, have unique content, but are not really suitable for spiders. Anything accessible via a form, basically.

This 51 message thread spans 2 pages: 51