How has Google found a page with no links to it?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How has Google found a page with no links to it?

arthur22

10:50 am on Jul 21, 2012 (gmt 0)

Just a quick question - I put up a test forum a few weeks back, haven't linked to it from anywhere at all. And it's been crawled and is in the google index.

No big deal, but I'd like to understand how google found the thing. Does the googlebot just guess at possible urls (surely that can't be true?) How could they possibly have found the page if there are no links to it anywhere on the web?!

tedster

11:49 am on Jul 21, 2012 (gmt 0)

If anyone visits a URL with a browser that has a Google toolbar installed - or using the Google Chrome browser - they know the URL exists.

Pjman

3:09 pm on Jul 21, 2012 (gmt 0)

Also if you have Adsense integrated with the page.

arthur22

6:47 pm on Jul 21, 2012 (gmt 0)

Thanks tedster, that was it then - I've checked the page in Chrome. I had no idea they used that to gather urls.

zeus

7:06 pm on Jul 21, 2012 (gmt 0)

It can also be a link from all those #*$!rapers, which is not listed yet.

LavingRunatic

7:28 pm on Jul 21, 2012 (gmt 0)

The same thing happened to me and it turned out that my hosting company published a page of newly created domains. Google crawled that page and ended up indexing my test site. Now when I create a new site I noindex it until I *want* it in the index.

lucy24

8:35 pm on Jul 21, 2012 (gmt 0)

Domains don't need your hosting company's help to be discovered. It's public record. The original question was about an unlinked page within a domain.

Hmm. Surprised g### doesn't do like all those bad robots and run around with a shopping list of likely directory names.
/admin/ ? Check.
/images/ ? Check.
/styles/ ? Check.

And so on.

Now that they've indexed everyone's sitemap.xml and robots.txt, you'd think they would be getting hungry again.

g1smd

8:37 pm on Jul 21, 2012 (gmt 0)

Don't "noindex" a dev site, block access with htpasswd rules.

Robert Charlton

8:47 pm on Jul 21, 2012 (gmt 0)

My guess would be not the toolbar, but publicly available server logs. In the past, Google has denied that the Toolbar was involved, and currently it appears that (by omission) they would perhaps be denying the same thing about Chrome. I'm including dates in the references below to indicate the history. We had a discussion about a similar problem here. Note the Google FAQ quote I posted about the "secret" web server...

Why is Google indexing my entire web server?
July, 2007
http://www.webmasterworld.com/google/3396393.htm [webmasterworld.com]

That quote has gone 404 several times, but was quoted specifically by Matt Cutts on his blog (note the title of the article)...

Generic Toolbar Indexing Debunk Post
by Matt Cutts on July 19, 2008
http://www.mattcutts.com/blog/toolbar-indexing-debunk-post/ [mattcutts.com]

The Google FAQ link is also broken in Matt's post. I'll quote it here....

Why is Googlebot downloading information from our "secret" web server?

It�s almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if there�s a link to your "secret" web server or page on the web anywhere, it�s likely that Googlebot and other web crawlers will find it.

Part of this quote is now incorporated into current Google WMT documentation...

Googlebot - Webmaster Tools Help
updated 06/07/2012
[support.google.com...]

Chrome isn't mentioned. Can't say whether they would use it for Googlebot discovery or not.

I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.

Bottom line, if you don't want something indexed, use noindex and/or password protection.

MikeNoLastName

11:46 pm on Jul 22, 2012 (gmt 0)

Didn't I hear somewhere recently that G is now scanning non-linking text on a page for possible new sites to index?

I've also just noticed that G started indexing some random files in our disllowed directories. They're in cgi-bin and show up in supplemental results when doing site: example.com. We've had /cgi-bin disallowed for all robots for many years, AND I made sure nothing there is included in the sitemap.xml, yet I can still see at least 3 files: 2 .htm's and one called x.out indexed just with the URL and no descriptions. Can't put noindex in them because they are not actually complete .htm files, they're used as template includes for other pages and in the case of x.out is just a plain old data base!

I did a fetch as googlebot in WMT and they all came back as "Denied by robots.txt", but you can't submit them in that case, so hopefully G will get a clue anyway.

g1smd

12:38 am on Jul 23, 2012 (gmt 0)

G is now scanning non-linking text

They do, and it leads to a lot of incomplete and truncated URLs showing as 404 errors in WMT.

If you have files that are used only as server-side includes, you should use htaccess to block access to the folder from the web.

anallawalla

9:26 am on Jul 23, 2012 (gmt 0)

There are also some sites that monitor every domain that has been sold and when they find a website hosted there, they publish a link to it.

phranque

1:09 am on Jul 24, 2012 (gmt 0)

the best way to stop a bot from crawling AND indexing a requested resource is to respond with a 401 Unauthorized status code.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.2:

10.4.2 401 Unauthorized
The request requires user authentication. The response MUST include a WWW-Authenticate header field (section 14.47) containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field (section 14.8). If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication"

14.47 WWW-Authenticate:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.47

14.8 Authorization:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8

HTTP Authentication: Basic and Digest Access Authentication:
http://www.ietf.org/rfc/rfc2617.txt [ietf.org]

here's the Authentication and Authorization How-To for the Apache HTTP Server:
http://httpd.apache.org/docs/2.2/howto/auth.html [httpd.apache.org]

and the Windows Server documentation to Configure Basic Authentication (IIS 7):
http://technet.microsoft.com/en-us/library/cc772009(v=ws.10).aspx [technet.microsoft.com]

Basic Authentication <basicAuthentication> : Configuration Reference : The Official Microsoft IIS Site:
http://www.iis.net/ConfigReference/system.webServer/security/authentication/basicAuthentication [iis.net]