Welcome to WebmasterWorld Guest from 126.96.36.199
In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn't find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form.
For text boxes, our computers automatically choose words from the site that has the form; For select menus, check boxes, and radio buttons on the form, we choose from among the values...
I believe we now have at least some of the answer for two mysterious threads from the recent past:
Google indexing large volumes of (unlinked?) dynamic pages [webmasterworld.com]
Google is indexing site search results pages [webmasterworld.com]
I'd guess that during the development phase of this new form of crawling, Googlebot may have not always been as well-behaved as the engineers planned.
Only a small number of particularly useful sites receive this treatment,
I'm hoping this means that if you don't have sitelinks it doesn't apply. I don't have many forms which use GET, if any, but it would be tedious to have to comb my websites to check them all just in case. I'm underwhelmed by this development: if it's behind a form, I can't think of any instance where I want it spidered.
Same for me. My first thought was, tedster had come across an april fool's joke twelve days later;)
But indeed i found googlebot using some mysterious get-variables on my shop page quite recently. As a matter of fact, these needles were outdated at present, because I redesigned my system in February.
If of interest: This older variable used ?sch= plus integer in order to preserve the n-th step of a navigation-action inside my shop's search function. I used to display a random collection of my products if the internal product-search-function gave zero results and I found it quite annoying if you saw something completely different using the back button. Currently I work with other needles and skipped randomization completely.
But for a few weeks now I have begun to log such get-requests due to some amateurish hacker-attacks. And was really surprised to find googlebot among the delinquents. Given my above explanation I found it really mysterious googlebot had used integers as high as 22348 for testing, and the paranoid in me said "That was a human employee;)"
What bothers me most is the duplicate content issue: If google indexes a shop's search function it is quite likely, pages with different uris show exactly the same result, be it only two-word-searches in different word order. Until now, I never cared about the metatags of my cgi-output, but given this new development, I shoud probably go through the whole structure and make sure specific pages are displayed with "noindex"?
The other thing is malicious behaviour of competitors as described in this thread [webmasterworld.com]: If a competitor puts links to your site using varying get-variables, most of your pages will alway display the same result and may trigger filters. BTW: Are both phenomena related? Has that serious damage reported in that thread to do with google spidering random-form-variables?
> Only a small number of particularly useful sites receive this treatment,
The honour is well appreciated;)
My big dilemma was whether to use robots.txt to exclude googlebot. On the one hand I was concerned about duplicate content. But on the other hand, are these dynamically created and potentially indexed pages really duplicate content? These 'search results' results pages don't exist in any other format.
Moreover, as I said on my earlier thread, Google appeared uncannily accurate with the words that it chose to use to run queries on my site (my form is a site-search cgi script that searches a knowledge base section of my site). If those results then appear in the SERPS for keywords that I (or others) have not necessarily used in link text, is that really such a bad thing? Perhaps someone searching for that particular concept will be really pleased that they have found this result.
I eventually decided to exclude the search script from googlebot, mainly for good old fashioned duplicate content paranoia reasons. But I wonder if it is time to rethink this kind of thing? Have we been taking the whole duplicate content thing too literally? I assume that Google knows where it has drawn these results from and perhaps there is no need to be concerned?
This paragraph is particularly interesting:
The web pages we discover in our enhanced crawl do not come at the expense of regular web pages that are already part of the crawl, so this change doesn't reduce PageRank for your other pages. As such it should only increase the exposure of your site in Google. This change also does not affect the crawling, ranking, or selection of other web pages in any significant way.
On the other hand, unless Google gives some direct and clear guidelines, it is far too tempting to err on the site of caution. Which may be a shame for webmasters and searchers alike.
I presume that I can use a wildcard block in robots to stop further damage but how the heck do I remove the damage already done?
Begs the question, what is the point in designing a nicley structured site if Google can simply completely screw it up for you?
I would love to hear Google's comments on this. So the theory that one site can screw up another is confirmed beyond all reasonable doubt and the attacker is ..... Google!
the question is, whether google PERCEIVES them as duplicate and then triggers a filter, which may hurt you. Given that the canonical issue is said to have caused so many problems, we cannot be careful enough. And the difference between mydomain.com, www.mydomain.com and mydomain.com/index.html is something google should have fixed in the first place (without the help of webmasters via WTC), before beginning to crawl uncontrollable stuff behind forms.
Disclaimer: Normally I do not use the phrase "google should", but this issue is probably worth an exception.
The more I think about it, the more mysterious things get: It is relatively easy as long as be only have to do with GEt-requests leading to idempotent cgi-URIs. But what about POST-forms? Will googlebot randomly apply for mail-accounts on the various public mailservers and then send billions of mails? Will googlebot perform orders in my shop (Nice rich customer, eh?;)? Will googlebot place trillions of postings in forums and blogs(I'd like to know what golem has to say meanwhile;)?
I really do not see the point in Google indexing search results for random keywords. This is bound to create all kinds of unforeseen problems.
It would be wrong of Google to perform other operations with forms, such as POST, which might cause a stateful (and expensive) side-effect, such as ordering goods, etc.
Google seems to be clearly restricting itself to GET.
As it happens, mainly because I don't like search-results-in-search from other sites, I just moved my biggest site to have the search/results page noindex (though not nofollow).
They've been doing the following since last year:
Query the search form with words from the page at random.
Words they found to be prominent on the page.
They crawled and indexed these results pages.
Now on other domains which I work on, where it'd have made sense to crawl - for example - URLs used in <OPTION>s in a drop down menu... nothing happened, so I couldn't care less.
I hoped they'd stop doing it and decide this is an idea too crazy for prime time.
so all in all this:
Only a small number of particularly useful sites receive this treatment,
...means only high TrustRank sites get this... treatment.
Mind you the test domain isn't useless...
it has very informative articles written especially for it.
But to call is particularly useful might be an exaggeration *smirk*
None the less CNN, Google Blog search and all the rest felt the same way towards it. Some tests ended up on real, live, WEB search results within ( grab your seats ) 4 minutes time.
But after all they did NOT choose to drop this reckless idea.
Yeah, my opinion is...
This move is insane.
First they educated everyone *for 10 years* that if you use FORMs, OPTIONs, etc. your pages will *not* be crawled / are in effect, *hidden* from them.
Now that *finally* everyone understood this...
And used FORMs to HIDE pages from Google...
They start crawling and indexing them. Hey. Why?
Stepping out of line from my tech-sided comments:
I've reached a conclusion for the time being.
Google is a politician
Nothing more, nothing less.
Same credibility, same accountability.
[edited by: Miamacs at 11:37 am (utc) on April 12, 2008]
Perhaps, this is a new way for Google to deal with spammmers, get more useful content to swamp it! I wonder what constitutes a 'useful' site and what the selection process is.
Looks to me like Google is moving away from its algo based approach.
Maybe I'm a suspicious type, but I just don't buy Google's explanation for doing this. It directly contradicts a number of prior statements, it will result in the discovery of the type of content they dump into the supplemental index, and worse still, it will cause pages to appear in the index that webmasters never intended to be spidered.
Where is the benefit to either searchers or webmasters?
I'm no fan of proprietary extensions to the robots exclusions protocol but a directive that amounts to 'no forms please' would get the thumbs up from me.
"Thank you, your request has been sent successfully. A member of our staff will contact you shortly."
"Sorry, there is no availability for your stay on those dates."
"I'm afraid we are out of stock of that particular product, perhaps you would like it in pink?"
Quit it, now!
Site selection is interesting. My site is in a small niche. It's key in the niche, but it's a small niche. It does show the supplemental results in searches. As for the small number of forms pulled, it went from googlebot pulling about 300-400 pages a day to 5,000-10,000 a day.
The forms were all radio buttons & dropdown selects. There are combinations no human would choose, but googlebot seemed to try every conceivable cross-indexed possibility. The form results include a lot of numbers and are very useful when produced, but the data changes hourly and is completely useless after several hours. I can't imagine it drawing clicks from google searchers. Googlebot started following forms on this site in September 2007.
It's only logical they'd take a shot at forms. Remember when search engines couldn't follow cgi scripts? That was the hidden web then. Now they're good at it. If good stuff can be found and Yahoo and MSN can't keep up, Google has a bigger edge. Of course, many forms lead to random gibberish if you didn't submit intelligently.
I'm all the time fighting referral spam through some of our search forms so I get notified of some of the odd requests.
How does google know what should be posted thorugh there.
Guess I'll have to block that directory from bots, unfortunately that also informs those referral spammers exactly where to do their dirty work .. great.
Only a small number of particularly useful sites receive this treatment,amazon. interesting
[edited by: Bewenched at 6:43 pm (utc) on April 12, 2008]
The other versions should not be indexed. Now I have to do extra work to stop that happening. Bad move by Google. They just crossed a line.
They're getting too nosey. Back off.
Given the amount of posters contibuting here and admitting to have been visited by googlebot on this issue, the phrase "small number of particularly usefull sites" seems a bit strange. Two explanations remain:
A) google is just clapping our shoulders for whatever reason
B) Many of us posting here are viewed as particularly trustworthy by the google-engineers. To a high percentage our websites are "clean" (i.e. "accessible" and syntactically correct), so that they make up a good starting point for testing new crawl-algos.
[edited by: Oliver_Henniges at 12:06 am (utc) on April 13, 2008]
It was just a basic drop down select box that may well have looked like site navigation so I can sort of understand that.
Goodness knows why they now feel the need to perform searches though, I can see no sense in trying to find stuff that way.
He says that this change is for the sort of corporate site that has a list of dropdowns on the home page to select a country or region.
Personally I think that this is a strange excuse - which corporate site has not already added text links to help the spiders? And if they haven't done that yet, I'm not sure they deserve to be indexed at all!
I don't quite understand what benefit this change will bring to users or Google - its just going to add a whole load of low quality content to the SERPs...
Does anyone LIKE this move?
I don't know about actively like but, as I've said (aside from my concerns about duplication), the way that Google have indexed my particular site does have potential benefits to searchers. If it helps people to arrive at my knowledge base via more keywords than before, then more power to them.
Having said that, looking at the wider picture and hearing from other people, it does seem like a rather unusual move. But Google have stated that these crawls are treated differently from normal indexes (see my post above) so, in theory we shouldn't really have much to fear from negative effects.
As regards actual visitors, I can't say that I noticed any major increase in referrals from Google before I disallowed the bots. But there was some.
I'm still in two minds about whether to keep disallowing.