Welcome to WebmasterWorld Guest from 54.161.64.174

Message Too Old, No Replies

How has Google found a page with no links to it?

     
10:50 am on Jul 21, 2012 (gmt 0)



Just a quick question - I put up a test forum a few weeks back, haven't linked to it from anywhere at all. And it's been crawled and is in the google index.

No big deal, but I'd like to understand how google found the thing. Does the googlebot just guess at possible urls (surely that can't be true?) How could they possibly have found the page if there are no links to it anywhere on the web?!
11:49 am on Jul 21, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If anyone visits a URL with a browser that has a Google toolbar installed - or using the Google Chrome browser - they know the URL exists.
3:09 pm on Jul 21, 2012 (gmt 0)



Also if you have Adsense integrated with the page.
6:47 pm on Jul 21, 2012 (gmt 0)



Thanks tedster, that was it then - I've checked the page in Chrome. I had no idea they used that to gather urls.
7:06 pm on Jul 21, 2012 (gmt 0)

WebmasterWorld Senior Member zeus is a WebmasterWorld Top Contributor of All Time 10+ Year Member



It can also be a link from all those #*$!rapers, which is not listed yet.
7:28 pm on Jul 21, 2012 (gmt 0)



The same thing happened to me and it turned out that my hosting company published a page of newly created domains. Google crawled that page and ended up indexing my test site. Now when I create a new site I noindex it until I *want* it in the index.
8:35 pm on Jul 21, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Domains don't need your hosting company's help to be discovered. It's public record. The original question was about an unlinked page within a domain.

Hmm. Surprised g### doesn't do like all those bad robots and run around with a shopping list of likely directory names.
/admin/ ? Check.
/images/ ? Check.
/styles/ ? Check.

And so on.

Now that they've indexed everyone's sitemap.xml and robots.txt, you'd think they would be getting hungry again.
8:37 pm on Jul 21, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Don't "noindex" a dev site, block access with htpasswd rules.
8:47 pm on Jul 21, 2012 (gmt 0)

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



My guess would be not the toolbar, but publicly available server logs. In the past, Google has denied that the Toolbar was involved, and currently it appears that (by omission) they would perhaps be denying the same thing about Chrome. I'm including dates in the references below to indicate the history. We had a discussion about a similar problem here. Note the Google FAQ quote I posted about the "secret" web server...

Why is Google indexing my entire web server?
July, 2007
http://www.webmasterworld.com/google/3396393.htm [webmasterworld.com]

That quote has gone 404 several times, but was quoted specifically by Matt Cutts on his blog (note the title of the article)...

Generic Toolbar Indexing Debunk Post
by Matt Cutts on July 19, 2008
http://www.mattcutts.com/blog/toolbar-indexing-debunk-post/ [mattcutts.com]

The Google FAQ link is also broken in Matt's post. I'll quote it here....

Why is Googlebot downloading information from our "secret" web server?

Itís almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if thereís a link to your "secret" web server or page on the web anywhere, itís likely that Googlebot and other web crawlers will find it.

Part of this quote is now incorporated into current Google WMT documentation...

Googlebot - Webmaster Tools Help
updated 06/07/2012
[support.google.com...]

Chrome isn't mentioned. Can't say whether they would use it for Googlebot discovery or not.

I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.

Bottom line, if you don't want something indexed, use noindex and/or password protection.
11:46 pm on Jul 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Didn't I hear somewhere recently that G is now scanning non-linking text on a page for possible new sites to index?

I've also just noticed that G started indexing some random files in our disllowed directories. They're in cgi-bin and show up in supplemental results when doing site: example.com. We've had /cgi-bin disallowed for all robots for many years, AND I made sure nothing there is included in the sitemap.xml, yet I can still see at least 3 files: 2 .htm's and one called x.out indexed just with the URL and no descriptions. Can't put noindex in them because they are not actually complete .htm files, they're used as template includes for other pages and in the case of x.out is just a plain old data base!

I did a fetch as googlebot in WMT and they all came back as "Denied by robots.txt", but you can't submit them in that case, so hopefully G will get a clue anyway.
12:38 am on Jul 23, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



G is now scanning non-linking text

They do, and it leads to a lot of incomplete and truncated URLs showing as 404 errors in WMT.

If you have files that are used only as server-side includes, you should use htaccess to block access to the folder from the web.
9:26 am on Jul 23, 2012 (gmt 0)

WebmasterWorld Administrator anallawalla is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There are also some sites that monitor every domain that has been sold and when they find a website hosted there, they publish a link to it.
1:09 am on Jul 24, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



the best way to stop a bot from crawling AND indexing a requested resource is to respond with a 401 Unauthorized status code.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.2:
10.4.2 401 Unauthorized
The request requires user authentication. The response MUST include a WWW-Authenticate header field (section 14.47) containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field (section 14.8). If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication"


14.47 WWW-Authenticate:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.47

14.8 Authorization:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8

HTTP Authentication: Basic and Digest Access Authentication:
http://www.ietf.org/rfc/rfc2617.txt [ietf.org]


here's the Authentication and Authorization How-To for the Apache HTTP Server:
http://httpd.apache.org/docs/2.2/howto/auth.html [httpd.apache.org]


and the Windows Server documentation to Configure Basic Authentication (IIS 7):
http://technet.microsoft.com/en-us/library/cc772009(v=ws.10).aspx [technet.microsoft.com]

Basic Authentication <basicAuthentication> : Configuration Reference : The Official Microsoft IIS Site:
http://www.iis.net/ConfigReference/system.webServer/security/authentication/basicAuthentication [iis.net]
 

Featured Threads

Hot Threads This Week

Hot Threads This Month