homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
cPanel login pages indexed by Google
How did Google find them?
penders




msg:4674172
 11:21 am on May 24, 2014 (gmt 0)

Has anyone experienced cPanel login pages being indexed by Google? (Possibly a link-only result in the SERPs since they are generally blocked by robots.txt.)

If so, how do you think Google found these files? I'm struggling to believe that anything but a direct link to these files resulted in Google finding them.

This is primarily in relation to shared servers where the cPanel URL is often cpanel.example.com and always the same protocol eg. example.com:2082 - these are easily found by a bit of trial and error - but does Google use "trial and error"?

Is there any way that Google could have found these pages, other than by stumbling across a direct link?

 

bilalseo




msg:4678822
 6:50 pm on Jun 10, 2014 (gmt 0)

The robots.txt file is not generally accessible at the root url for those ports.

In any case the robots.txt file ONLY specifies what is nto to be crawled, not what must nto appear in the index. So even if disallowed in the robots.txt file (assuming a robots.txt file is available there), then just the urls will appear in the index.

What is needed is for a robots noindex meta tag to be placed on those pages, instead of getting them disallowed in robots.txt.

phranque




msg:4678861
 8:28 pm on Jun 10, 2014 (gmt 0)

the method of URL discovery is irrelevant.
the only thing that really matters is your response to the request.

lucy24




msg:4678877
 9:14 pm on Jun 10, 2014 (gmt 0)

does Google use "trial and error"

In some areas, sure. For example: If you've got a page with URL ending in a slash, like
example.com/directory/
then search engines will occasionally ask for both
example.com/directory/index.html
and
example.com/directory
(This is one of several search-engine behaviors that I never noticed until I moved sites and therefore paid unusually close attention to requests.)

And you know all those robots that come by asking for the top 87 permutations of "wp-admin" on the off chance that they might get in? It doesn't seem likely that a Ukrainian robot would know something that Google doesn't.

Has anyone experienced cPanel login pages being indexed by Google? (Possibly a link-only result in the SERPs since they are generally blocked by robots.txt.)

Was this a hypothetical question, or have you been seeing it yourself? In order for something to appear in the index, there has to be some concrete reason for the search engine to believe the page exists: either because they're seen it, or because it's listed in your sitemap, or because someone has linked to it. I can't think of a fourth possibility.

penders




msg:4678882
 9:31 pm on Jun 10, 2014 (gmt 0)

The robots.txt file is not generally accessible at the root url for those ports.


There is a robots.txt served (with a valid 200 response) from this location that contains a single Disallow: / directive. It's obviously not the same robots.txt file used by the main site, but it is a robots.txt file.

What is needed is for a robots noindex meta tag to be placed on those pages, instead of getting them disallowed in robots.txt.


Exactly. I can only think that cPanel's decision to block with robots.txt is to save bandwidth by preventing unnecessary crawling.

However, it is my understanding that Google will only index these pages (or rather allow the pages to appear in the SERPS - link only style) if it has found that other pages are linking to them. Which, to the best of my knowledge they aren't, which is really the point of my question - how did Google find these pages?

I have seen far too many cPanel login pages indexed in this way to be just a one-off. So, they are being found somehow.

Unfortunately, on a shared we don't have access to this area to do anything about it.

the only thing that really matters is your response to the request.


The pages return a 401 Access Denied.

penders




msg:4678885
 9:53 pm on Jun 10, 2014 (gmt 0)

It doesn't seem likely that a Ukrainian robot would know something that Google doesn't.


Maybe the Ukrainian bot is exposing the URLs for Google to find?!

Was this a hypothetical question, or have you been seeing it yourself?


I've been seeing this quite a lot. And it was discussed in the cPanel forums [forums.cpanel.net] some years ago, with a stated fix sometime later - which doesn't seem to have happened? (The "fix" was to remove robots.txt and instead include a noindex robots meta tag or X-Robots-Tag HTTP response header.)

Admittedly, if there are many pages indexed on the site then these pages can be hard to find (they are, after all, link-only results). You might need to "repeat the search with the omitted results included.". However, this hit me in the face recently when I purposefully deindexed a site. Once the main site pages had dropped from the index there were still a page of results in the SERPs for the cPanel subdomain, 2082 port address and associated URLs - which I have no control over!?

phranque




msg:4678912
 12:54 am on Jun 11, 2014 (gmt 0)

Has anyone experienced cPanel login pages being indexed by Google? (Possibly a link-only result in the SERPs since they are generally blocked by robots.txt.)

The pages return a 401 Access Denied.


if the crawler respects robots.txt, it won't see the 401 response and the url could get indexed without being crawled.
once the crawler gets the 401 response, the url will be dropped from the index.

bilalseo




msg:4679112
 4:22 pm on Jun 11, 2014 (gmt 0)

something really getting out of my experience. :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved