Welcome to WebmasterWorld Guest from 107.22.30.186

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

How has Google found a page with no links to it?

     
10:50 am on Jul 21, 2012 (gmt 0)

New User

joined:Jan 19, 2012
posts: 20
votes: 0


Just a quick question - I put up a test forum a few weeks back, haven't linked to it from anywhere at all. And it's been crawled and is in the google index.

No big deal, but I'd like to understand how google found the thing. Does the googlebot just guess at possible urls (surely that can't be true?) How could they possibly have found the page if there are no links to it anywhere on the web?!
11:49 am on July 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


If anyone visits a URL with a browser that has a Google toolbar installed - or using the Google Chrome browser - they know the URL exists.
3:09 pm on July 21, 2012 (gmt 0)

Preferred Member

5+ Year Member

joined:Mar 22, 2011
posts:399
votes: 0


Also if you have Adsense integrated with the page.
6:47 pm on July 21, 2012 (gmt 0)

New User

joined:Jan 19, 2012
posts: 20
votes: 0


Thanks tedster, that was it then - I've checked the page in Chrome. I had no idea they used that to gather urls.
7:06 pm on July 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 28, 2002
posts:3444
votes: 1


It can also be a link from all those #*$!rapers, which is not listed yet.
7:28 pm on July 21, 2012 (gmt 0)

New User

joined:Apr 20, 2012
posts:34
votes: 0


The same thing happened to me and it turned out that my hosting company published a page of newly created domains. Google crawled that page and ended up indexing my test site. Now when I create a new site I noindex it until I *want* it in the index.
8:35 pm on July 21, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13210
votes: 347


Domains don't need your hosting company's help to be discovered. It's public record. The original question was about an unlinked page within a domain.

Hmm. Surprised g### doesn't do like all those bad robots and run around with a shopping list of likely directory names.
/admin/ ? Check.
/images/ ? Check.
/styles/ ? Check.

And so on.

Now that they've indexed everyone's sitemap.xml and robots.txt, you'd think they would be getting hungry again.
8:37 pm on July 21, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Don't "noindex" a dev site, block access with htpasswd rules.
8:47 pm on July 21, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:11527
votes: 225


My guess would be not the toolbar, but publicly available server logs. In the past, Google has denied that the Toolbar was involved, and currently it appears that (by omission) they would perhaps be denying the same thing about Chrome. I'm including dates in the references below to indicate the history. We had a discussion about a similar problem here. Note the Google FAQ quote I posted about the "secret" web server...

Why is Google indexing my entire web server?
July, 2007
http://www.webmasterworld.com/google/3396393.htm [webmasterworld.com]

That quote has gone 404 several times, but was quoted specifically by Matt Cutts on his blog (note the title of the article)...

Generic Toolbar Indexing Debunk Post
by Matt Cutts on July 19, 2008
http://www.mattcutts.com/blog/toolbar-indexing-debunk-post/ [mattcutts.com]

The Google FAQ link is also broken in Matt's post. I'll quote it here....

Why is Googlebot downloading information from our "secret" web server?

Itís almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if thereís a link to your "secret" web server or page on the web anywhere, itís likely that Googlebot and other web crawlers will find it.

Part of this quote is now incorporated into current Google WMT documentation...

Googlebot - Webmaster Tools Help
updated 06/07/2012
[support.google.com...]

Chrome isn't mentioned. Can't say whether they would use it for Googlebot discovery or not.

I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.

Bottom line, if you don't want something indexed, use noindex and/or password protection.
11:46 pm on July 22, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 10, 2003
posts: 927
votes: 11


Didn't I hear somewhere recently that G is now scanning non-linking text on a page for possible new sites to index?

I've also just noticed that G started indexing some random files in our disllowed directories. They're in cgi-bin and show up in supplemental results when doing site: example.com. We've had /cgi-bin disallowed for all robots for many years, AND I made sure nothing there is included in the sitemap.xml, yet I can still see at least 3 files: 2 .htm's and one called x.out indexed just with the URL and no descriptions. Can't put noindex in them because they are not actually complete .htm files, they're used as template includes for other pages and in the case of x.out is just a plain old data base!

I did a fetch as googlebot in WMT and they all came back as "Denied by robots.txt", but you can't submit them in that case, so hopefully G will get a clue anyway.
12:38 am on July 23, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


G is now scanning non-linking text

They do, and it leads to a lot of incomplete and truncated URLs showing as 404 errors in WMT.

If you have files that are used only as server-side includes, you should use htaccess to block access to the folder from the web.
9:26 am on July 23, 2012 (gmt 0)

Moderator from AU 

WebmasterWorld Administrator anallawalla is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 3, 2003
posts:3728
votes: 9


There are also some sites that monitor every domain that has been sold and when they find a website hosted there, they publish a link to it.
1:09 am on July 24, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10553
votes: 13


the best way to stop a bot from crawling AND indexing a requested resource is to respond with a 401 Unauthorized status code.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.2:
10.4.2 401 Unauthorized
The request requires user authentication. The response MUST include a WWW-Authenticate header field (section 14.47) containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field (section 14.8). If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication"


14.47 WWW-Authenticate:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.47

14.8 Authorization:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8

HTTP Authentication: Basic and Digest Access Authentication:
http://www.ietf.org/rfc/rfc2617.txt [ietf.org]


here's the Authentication and Authorization How-To for the Apache HTTP Server:
http://httpd.apache.org/docs/2.2/howto/auth.html [httpd.apache.org]


and the Windows Server documentation to Configure Basic Authentication (IIS 7):
http://technet.microsoft.com/en-us/library/cc772009(v=ws.10).aspx [technet.microsoft.com]

Basic Authentication <basicAuthentication> : Configuration Reference : The Official Microsoft IIS Site:
http://www.iis.net/ConfigReference/system.webServer/security/authentication/basicAuthentication [iis.net]