Forum Moderators: open
Basically that makes for the most simple form possible, all of the possible result pages can easily be derived by taking the form action, select name and option values.
This morning I find that Googlebot has managed to crawl 17 of the results pages despite there being no other links to them, my first reaction was that it was probably the Toolbar visit - crawling thing that has been the subject of much debate - but on investigation I found at least one page that has been crawled before anyone had ever viewed it.
To my mind that rules out Toolbar theories and the usual Googleguy response about referal leakages, so I am left with the idea that the form has been crawled - does anyone have any better ideas?
form action starts [<...]
Good point, but no it is a relative link.Why would G want to follow form links?
That is what I am wondering, if these pages end up in the index then it really doesn't add anything useful - but I guess there may be cases where site navigation is done using this method.
My first thoughts when i read the thread title are that they are determining what class the page with the forms belongs too? The theme of that URL! But googlebot could also learn about other URLs by following and indexing pages those pages, instead of just checking them.
Very interesting I hope enough people post more information to follow this detail.
I had a shock a few weeks ago when I thought I had seen the same thing and even went so far as changing the form to a POST to see if googlebot would follow that too.
I had a random question appear on my site, with select or radio inputs for the answer. I saw in my logs a request that looked like this
pagename.php?question_id=7 coming from googlebot. I thought I was seeing something new until I realised that I had put a link at the bottom of the form saying 'click here to see results' and that googlebot was following that link. When a question is answered, the request actually looks like this pagename.php?question_id=7&answer_id=15 because of a hidden variable containing the ID of the question that is being answered. With no question ID, I'd made the default action to be to display the previous answers to the questions. Maybe something like this is happening for you too? If not then perhaps you want to try changing the form to a POST request to see if googlebot continues to crawl the form. Just out of interest...
As an experiment I put the following form on one of my pages - but it was only served up to Googlebot, no actual browsers saw it. (I should add I am not normally into cloaking!)
<form name="AForm" method="get" action="/spider-trap.php">
<select name="choice" size="1">
<option value="25">25. The Spider Trap</option>
</select>
<input type="submit" value="Result">
</form>
For me that pretty much confirms it has effectively crawled the form by piecing together the fairly simple bits of information.
I offer a service that has a ludicrously simple sign-up procedure. You select your timezone (as an offset from GMT) from a drop-down box and hit "Go".
If I hadn't got the headsup on this, I'd have had Googlebot happily creating itself 25 odd accounts!
Like many sites, certain actions on my sites trigger email to be sent to both my and certain other addresses - again not something I want to be triggered automatically.
I'd be interested to hear opinions on this aswell. I guess the brains at the 'plex have thought about the consequences and decided to give the 'bot these skills, but i'd like to understand a bit more about their justification.
I think it's a dumb idea too - more often than not, it will find no new links and just mess up people's questionnaires etc.
Olias - Did you notice anything different about the useragent version # or the request headers?