Googlebot crawling GET forms with one variable?

Forum Moderators: open

Message Too Old, No Replies

Googlebot crawling GET forms with one variable?

any better explainations welcome!

olias

5:59 pm on Aug 22, 2003 (gmt 0)

Last night I added a little quiz with 24 options to one of my sites, people have to submit their choice from a SELECT box.

Basically that makes for the most simple form possible, all of the possible result pages can easily be derived by taking the form action, select name and option values.

This morning I find that Googlebot has managed to crawl 17 of the results pages despite there being no other links to them, my first reaction was that it was probably the Toolbar visit - crawling thing that has been the subject of much debate - but on investigation I found at least one page that has been crawled before anyone had ever viewed it.

To my mind that rules out Toolbar theories and the usual Googleguy response about referal leakages, so I am left with the idea that the form has been crawled - does anyone have any better ideas?

ciml

6:06 pm on Aug 22, 2003 (gmt 0)

If the form action starts http:// then I'd expect it to be crawled. At some point (maybe already?), other links may be followed too.

jimbeetle

6:12 pm on Aug 22, 2003 (gmt 0)

The question I have is: Why would G want to follow form links? Can't see it adding to the quality of its index, more probably degrading it. Seems like it would be simpler and better if G just ignored forms altogether.

olias

7:05 pm on Aug 22, 2003 (gmt 0)

form action starts [<...]
Good point, but no it is a relative link.
Why would G want to follow form links?

That is what I am wondering, if these pages end up in the index then it really doesn't add anything useful - but I guess there may be cases where site navigation is done using this method.

SebastianX

7:16 pm on Aug 22, 2003 (gmt 0)

Lots of scripts use default values, so even without a user input the output makes sense. If not, the useless page won't rank high and the quality of the index is not affected.
AFAIK Googlebot follows absolute and relative URLs in form action.

wasmith

6:14 am on Aug 23, 2003 (gmt 0)

I've considered the value of looking at forms some time ago. Normally (best of the that class of sites) provides quality content! but not always, sometimes they lead to nothing more then PPC listing from a search engine or website made affilate program directery [spelling bad, websites bad].

My first thoughts when i read the thread title are that they are determining what class the page with the forms belongs too? The theme of that URL! But googlebot could also learn about other URLs by following and indexing pages those pages, instead of just checking them.

Very interesting I hope enough people post more information to follow this detail.

sabai

7:22 am on Aug 23, 2003 (gmt 0)

olias - I'm sorry, I don't want to offend you, but are you sure?

I had a shock a few weeks ago when I thought I had seen the same thing and even went so far as changing the form to a POST to see if googlebot would follow that too.

I had a random question appear on my site, with select or radio inputs for the answer. I saw in my logs a request that looked like this

pagename.php?question_id=7

coming from googlebot. I thought I was seeing something new until I realised that I had put a link at the bottom of the form saying 'click here to see results' and that googlebot was following that link. When a question is answered, the request actually looks like this

pagename.php?question_id=7&answer_id=15

because of a hidden variable containing the ID of the question that is being answered. With no question ID, I'd made the default action to be to display the previous answers to the questions.

Maybe something like this is happening for you too? If not then perhaps you want to try changing the form to a POST request to see if googlebot continues to crawl the form. Just out of interest...

olias

6:16 pm on Aug 29, 2003 (gmt 0)

Thanks for all your thoughts and ideas.

As an experiment I put the following form on one of my pages - but it was only served up to Googlebot, no actual browsers saw it. (I should add I am not normally into cloaking!)

<form name="AForm" method="get" action="/spider-trap.php">
<select name="choice" size="1">
<option value="25">25. The Spider Trap</option>
</select>
<input type="submit" value="Result">
</form>

Sure enough a couple of days later /spider-trap.php?choice=25 was crawled by Googlebot.

For me that pretty much confirms it has effectively crawled the form by piecing together the fairly simple bits of information.

dmorison

6:26 pm on Aug 29, 2003 (gmt 0)

I'm not sure this is a good idea.

I offer a service that has a ludicrously simple sign-up procedure. You select your timezone (as an offset from GMT) from a drop-down box and hit "Go".

If I hadn't got the headsup on this, I'd have had Googlebot happily creating itself 25 odd accounts!

Like many sites, certain actions on my sites trigger email to be sent to both my and certain other addresses - again not something I want to be triggered automatically.

I'd be interested to hear opinions on this aswell. I guess the brains at the 'plex have thought about the consequences and decided to give the 'bot these skills, but i'd like to understand a bit more about their justification.

sabai

5:47 am on Aug 30, 2003 (gmt 0)

One thing googlebot might be looking for is 'quick link' select boxes... only real use I can think of... it could normally just parse those links from the form though.

I think it's a dumb idea too - more often than not, it will find no new links and just mess up people's questionnaires etc.

Olias - Did you notice anything different about the useragent version # or the request headers?