http://www.webmasterworld.com Welcome to WebmasterWorld Guest from 38.103.63.16
register, login, search, glossary, subscribe, help, library, PubCon, announcements , recent posts, unanswered posts
Accredited PayPal World Seller
Home / Forums Index / The Google World / Google Search News
Forum Library : Charter : Moderators: Robert Charlton & lawman & mack & tedster

Google Search News

Featured Home Page Discussion
This 42 message thread spans 2 pages: < < 42 ( 1 [2]   
Googlebot Now Crawls via HTML Forms
GrendelKhan TSU


#:3626285
 9:57 am on April 14, 2008 (utc 0)

sorry typos in last post so maybe ppl didn't understand my last question....

"I mean, what would people be search FOR that they would pages behind a form to be in their SERP?

what I mean was:
what would netizens be searching for that they would find from (benefit from) content that was behind a form?

the way that Google have indexed my particular site does have potential benefits to searchers.

as above, can you be more specific?

even mattcutts explanation is vague about it (lumped with dropdowns) and saids its being more about "discovering new links". But he doesn't give a hard example about forms. Country links... ok. I relate to that. Crawling those behind dropdowns is hard... but still not clear on forms.

netizen searches for: ________?
serp give them useful Result: ____________? <<< (this came from content in a form.)

I'm kinda with pico and can only imagine something like:
"Thank you, your request has been sent successfully. A member of our staff will contact you shortly."

what link is it discovering from that?
+_+

[edited by: GrendelKhan_TSU at 10:02 am (utc) on April 14, 2008]

wheel


#:3626339
 12:20 pm on April 14, 2008 (utc 0)

I'm horrified at this possibility. I provide calculators to an industry via a form they put on their website. If they start indexing these forms there's all sorts of problems - duplicate content not the least of them. Most importantly my competitors now have a potentially easy way to find my entire client list.

I have a friend in the industry who publishes his client list for the public - put in your zip code and find someone who sells in your area. Again he's done it behind a form specifically so that it's not indexed - and specifically so that people don't have an easy way to crack a list of suppliers.

Ranting aside, will the action of the form receive any linkweight from this? My goodness, if it does, I suspect the dormant domain I use to serve the calculators from is liable to hit the top in a very competitive industry. That could change the serps in some industries.

Romeo


#:3626351
 12:35 pm on April 14, 2008 (utc 0)

There are sites with database query forms to explicitly send POST requests just because they don't want to show half a million direct deep GET links to the search engines.
Other sites takes queries to convert values by a special service algo, where there could be trillions of theoretically explorable possible values.

And now it looks as they want to deliberately sabotage these set ups.

Or are they suggesting that we should cloak our forms to hide them?

No, thanks. It is rude.

Some examples:

Imagine a website that simply asks for a web colour code (between #000000 and #FFFFFF) to show a resulting colour.gif: Wow! that would give a boost of 16,777,215 most usefull hits.

Or how about a money currency converter between US$ and GB-Pounds?
And can we have decimals, please, too, to get more fun for all possible values between 0.01 and 9999999999999.99 ....

Or how about a small page just asking
"Hi, put in some meaningless value: [nnn]"
to give back an
"Oh, so you have put in 'nnn'. Thank you so much. Want to try another one? [nnn]".

Or a navigational page to "select 1, 2 or 3" being programmed to always give back a 200 "Sorry, your selection 7885365... is invalid, try a valid one." instead of a 404 or 403.

All this can get ballooned to many million/billion/trillion different valid and even content-rich pages.
There are so many different ways to get it completely wrong.

... ah, enough of that.
I am now just off now for some minutes to do some funny coding, and the GOOG bot is invited to come by shortly after then ...

Kind regards,
R.

Oliver Henniges


#:3626473
 2:58 pm on April 14, 2008 (utc 0)

Silly question: Does anyone really use form-elements with the GET-method?

Personallay I only use <form> in combination with POST. Instead of using GET with a submit-elemtent I'd always prefer the complete target-location as href-link. Obvioulsy, I did not fully understand what the GET-method was made for.

Receptional Andy


#:3626489
 3:27 pm on April 14, 2008 (utc 0)

* Use GET if:
o The interaction is more like a question (i.e., it is a safe operation such as a query, read operation, or lookup).
* Use POST if:
o The interaction is more like an order, or
o The interaction changes the state of the resource in a way that the user would perceive (e.g., a subscription to a service), or
o The user be held accountable for the results of the interaction.

http://www.w3.org/2001/tag/doc/whenToUseGet.html#checklist

GET forms have a number of advantages for users. As an example, it allows the sending/bookmarking of links to Google search results pages, or a link to HTML validation results, etc.

ddogg


#:3626519
 4:13 pm on April 14, 2008 (utc 0)

This makes no sense for most sites, obviously if we want it crawled we will make the content available on our sites through links.

GrendelKhan TSU


#:3626521
 4:18 pm on April 14, 2008 (utc 0)


I'm horrified at this possibility. I provide calculators to an industry via a form they put on their website. If they start indexing these forms there's all sorts of problems - duplicate content not the least of them. Most importantly my competitors now have a potentially easy way to find my entire client list.

I have a friend in the industry who publishes his client list for the public - put in your zip code and find someone who sells in your area. Again he's done it behind a form specifically so that it's not indexed - and specifically so that people don't have an easy way to crack a list of suppliers.

this is exactly what I was worried about. I'm in the DB industry myself.

Even if there are nofollows or robots.txt or whatever ways to deal with the issue... it's still creating more work and more things to have to worry about. Thing that I don't necessarily why see why we SHOULD have to worry about. Let's not forget the cost of making the change to block the crawl (for most sites), we are going to have all the of ppl out there (probably majority) that won't know they were supposed to worry about it till waaaay too late.

But forget the risky downside for a sec (it's obvious, there is a lot one would NOT want netizens to see from forms getting crawled):

I'm still not seeing what a webmaster (whitehat) would want to be crawled from a form, even theoretically.

anyone? +_+

[edited by: GrendelKhan_TSU at 4:22 pm (utc) on April 14, 2008]

wheel


#:3626672
 7:41 pm on April 14, 2008 (utc 0)

I went back and read Matt Cutt's blog from the link earlier in this thread. I'll comment at the risk of the standard google flame (let's not).

He suggested that example.com use a list of links by country at the bottom of the page. Ewwwww. A site with a block of links at the bottom for SE purposes? I tend to assume stuff like that won't pass a hand inspection. I'd love to do that for better local rankings myself, I don't because it looks horrible and is 100% SE driven. Yes my competitors do this and rank better for local search terms....I've deliberately decided not to do that.

Overall he's advocating specific design implementations that are strictly for the search engines and not for the users. The entire post seems to be written from that perspective. I'd suggest that some of the suggestions are actually detrimental to the user.

Seems like a slippery slope. Following forms is going to create a worse mess than the problem they're trying to solve. I also suspect that in the example he's used (clickable map of the world with drop down to select region) that most people would be using POST, not GET. Crawling forms doesn't solve the problem.

Silvery


#:3628328
 4:14 pm on April 16, 2008 (utc 0)

While it's nice that they don't spider/index results if you've disallowed search URLs in robots.txt, this is still a bit too after-the-fact. After all, major search engines didn't previously execute automated queries through search forms uninvited in the past, so webmasters now may be dealing with inconvenient, bulk automated queries that they had no reason to expect.

I can appreciate the desire to index the unavailable content, but there are lots of folks in the internet industry who rely on their site usage statistics for various reasons, and having sudden spikes in search queries may not be something they can easily correct in their reports. Not to mention cases others have cited here where no-value pages are getting indexed and logfiles get suddenly clogged up with the automated query records.

This really should have been announced prior to deployment out in the wild.

Many webmasters won't have their search URLs disallowed in robots.txt, because they've never previously needed to do so.

This is just a bit like how Network Solutions once coopted traffic from all non-registered domain names -- suddenly changing legacy paradigms on the internet without advanced notice is both rude and unwise.

g1smd


#:3628496
 6:55 pm on April 16, 2008 (utc 0)

I have now added meta robots noindex nofollow tags to all the pages with forms on.

I can't afford any time to look into the potential effects of not doing so, so have just gone with a blanket "keep out".

I do wonder if using the robots.txt disallow would instead be a better option.

Receptional Andy


#:3628518
 7:16 pm on April 16, 2008 (utc 0)

g1smd: if the activity I saw (in one of the linked threads by the OP) is attributable to this new behaviour, the sheer volume of crawling was a nuisance. Directives for robots in meta elements encourage re-crawling, so I opted for a disallow.

The way some sites are set up, I may just disallow any URL containing a question mark using wildcards.

Frankly, I don't consider this new spidering behaviour to be at all desirable. It creates a lot of work that I believe is wholly unneccesary.

Tastatura


#:3634660
 1:04 am on April 25, 2008 (utc 0)

At Faculty Summit in Zurich on 02/14/08, Google's Alfred Spector (VP of Research and Special Initiatives) briefly talked about some current projects including "Deep Web" - all html indexable data through forms.

I hope they give googlebot a credit card to allow successful completion of my order forms

Actually he mentioned that they have to be careful not to buy anything (not click on "buy" button), and what form parameters to to use as an input.

Here is link to the video:
http://video.google.com/videoplay?docid=-3486163868527136264

and Deep Web starts around 44/45th minute.

List of some of the current projects (as mentioned in the video)
-Voice
-Machine Translation
-Entity/Relation Extraction
-Deep Web

Also listen to the Q/A from the audience at around 59th minute regarding Deep Web/semantic web/ meta data - very interesting answer :)

[edited by: Tastatura at 1:10 am (utc) on April 25, 2008]

This 42 message thread spans 2 pages: < < 42 ( 1 [2]
 

Home / Forums Index / The Google World / Google Search News
All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
WebmasterWorld ® and PubCon ® are a Registered Trademarks of WebmasterWorld Inc.
© WebmasterWorld Inc. / SearchEngineWorld 1996-2008 all rights reserved