homepage Welcome to WebmasterWorld Guest from 23.22.2.150
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Google.com now has no robots.txt?
rfgdxm1

WebmasterWorld Senior Member rfgdxm1 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 16587 posted 10:11 pm on Aug 30, 2003 (gmt 0)

Google
Error

Not Found
The requested URL /robots.txt was not found on this server.
------

Looks like a big "D'oh!" ;)

 

tribal

10+ Year Member



 
Msg#: 16587 posted 3:03 pm on Sep 1, 2003 (gmt 0)

They probably included something in their algo like
if (url<>www.google.*) then process();

Seems to me like you have too much time on your hands ;)

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 16587 posted 3:05 pm on Sep 1, 2003 (gmt 0)

Whoa - that's a major find.

Wouldn't tha be the first major search engine to allow full spider access.

Sinner_G

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 3:08 pm on Sep 1, 2003 (gmt 0)

The cached version is still there though :).

TomWaits

10+ Year Member



 
Msg#: 16587 posted 3:14 pm on Sep 1, 2003 (gmt 0)

Someday soon we'll be able to search Wisenut for a Google-cached Looksmart page.

lazerzubb

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 3:20 pm on Sep 1, 2003 (gmt 0)

I think it only means an increase in the index.

Google have always been very well on spidering serp's on google.com itself, maybe that's the "the" reason.

Fun thing to do is to enter portals and search engines in google and do a SITE: on them, and see how many of the results they've indexed, then check out the robots.txt on those sites ;)

dmorison

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 4:10 pm on Sep 1, 2003 (gmt 0)

Google.com now has no robots.txt

Probably making sure they get well indexed by the next big thing in Internet search... ;)

lasko

10+ Year Member



 
Msg#: 16587 posted 4:30 pm on Sep 1, 2003 (gmt 0)


Hmmmm

How to increase the number of indexed pages in your own search engine?

Erm...Delete the robots.txt file of course!

So next months total number of pages indexed by Google is

6 billion

that should make us the biggest serch engine again.

Never mind Alltheweb, you did beat them for a short while :)

rise2it

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 5:02 pm on Sep 1, 2003 (gmt 0)

Making sure they are indexed properly by Microsofts new 'spider'.....

Jeff_H

10+ Year Member



 
Msg#: 16587 posted 6:20 pm on Sep 1, 2003 (gmt 0)

You must all be crazy! It's loading fine for me.

[google.com...]

deft_spyder

10+ Year Member



 
Msg#: 16587 posted 6:22 pm on Sep 1, 2003 (gmt 0)

i see it too. looks like a google burp.

SlowMove

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 6:25 pm on Sep 1, 2003 (gmt 0)

Try reloading the page. It's there about 3 out of 4 times.

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 6:33 pm on Sep 1, 2003 (gmt 0)

[google.com...] (no robots.txt)
[google.com...]

jesserud

10+ Year Member



 
Msg#: 16587 posted 7:08 pm on Sep 1, 2003 (gmt 0)

google.com (no www) doesn't need a robots.txt because every page there redirects to the same page on www.google.com.

DerekH

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 7:15 pm on Sep 1, 2003 (gmt 0)

So let me get this right...
If google has no robots.txt, it will index all its cached pages, and put them in its cache, with the google header on top. And then next month...

Look out Capn, it's gonna blow!
DerekH

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 8:01 pm on Sep 1, 2003 (gmt 0)

google.com (no www) doesn't need a robots.txt because every page there redirects to the same page on www.google.com

Google doesn't use invisible redirects so while your browser may follow a 301 or 302 redirect, a robot will not.

As it currently stands, any bot looking for a robots.txt on google.com (without the www) will not find one and will have every right under the standards set forth under the robots.txt exclusion protocol to fully spider any portion of their website.

jesserud

10+ Year Member



 
Msg#: 16587 posted 8:40 pm on Sep 1, 2003 (gmt 0)

[google.com...] is a 302 redirect. A spider could index the 302, but there's no "real" content for a spider to index (or links for a spider to follow) at [google.com...]

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 9:22 pm on Sep 1, 2003 (gmt 0)

jesserud,

What you fail to understand is a robot can follow any www.google.com link it finds (or on any subdomain at Google for that matter) if it can't find a robots.txt at google.com. Robots.txt files are not supposed to redirect. Even GoogleGuy would agree with that.

But I think they're working on fixing the problem now.

rfgdxm1

WebmasterWorld Senior Member rfgdxm1 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 16587 posted 9:52 pm on Sep 1, 2003 (gmt 0)

Key_Master is right. This isn't the way things should be.

Slud

10+ Year Member



 
Msg#: 16587 posted 9:59 pm on Sep 1, 2003 (gmt 0)

The robots.txt has been updated to block /answers/search?q= .

I've seen a number of queries in my area where Google Answers were turning up (when the pdf's weren't crowding them out, har har).

mbauser2

10+ Year Member



 
Msg#: 16587 posted 6:40 am on Sep 2, 2003 (gmt 0)

What you fail to understand is a robot can follow any www.google.com link it finds (or on any subdomain at Google for that matter) if it can't find a robots.txt at google.com.

That makes no damn sense. You're falling for the "example.com == www.example.com" fallacy. It's not true. "google.com" and "www.google.com" are allowed to be completely different servers (content-wise, physically, and administratively). Quoting from that most Holy of SEO Holies, the robots.txt spec [robotstxt.org]:

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt".

"local URL", in this (slightly archaic) instance means "relative URI". google.com/robots.txt is not a relative URI for anything on www.google.com, because they've got different absolute roots.

The only robots.txt that applies to a given URI is the robots.txt accessed using the same fully-qualified domain name. That's the spec, and it's the spec for a damn good reason: managing webservers in the .uk TLD (among others) would be chaos if things were done your way.

mack

WebmasterWorld Administrator mack us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 16587 posted 6:44 am on Sep 2, 2003 (gmt 0)

Surely Google have the ability to control what their own bot indexes on their own site withot having to specify in robots.txt?

I would think the bot has this info programed in?

Mack.

dmorison

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 7:07 am on Sep 2, 2003 (gmt 0)

Surely Google have the ability to control what their own bot indexes on their own site withot having to specify in robots.txt?

Of course; but given that Google need to use a robots.txt to block other crawlers there is no reason why it should not be the only mechanism in place to stop their own.

mil2k

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 7:11 am on Sep 2, 2003 (gmt 0)

So if someone gives a link to google serp other bots can follow and index it?

dmorison

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 16587 posted 8:27 am on Sep 2, 2003 (gmt 0)

So if someone gives a link to google serp other bots can follow and index it?

Not any more; as robots.txt apears to be back in place. Talking about a well behaved bot of course.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved