homepage Welcome to WebmasterWorld Guest from 54.242.18.232
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Googlebot possibly ignoring robots.txt
Andem




msg:4446140
 6:04 pm on Apr 26, 2012 (gmt 0)

I currently have a new site under development and I brought it live last week for the purpose of testing and development. It is filled with content, but since I'm still working on it in a sandbox, I have the follow robots.txt:

User-agent: *
Disallow: /


Nobody should know about this site and I do admit, I had the toolbar activated for a while when testing.

Either way, I noticed that Googlebot has been crawling the site and now has 5,010 pages from the site live in the search results!

Baidu and Yandex somehow also know about the site but neither have it in their respective indexes.

Since I don't plan on creating a webmaster tools account, how the heck can I prevent Google from trespassing?

 

goodroi




msg:4446162
 7:04 pm on Apr 26, 2012 (gmt 0)

First, I just want to mention a friendly reminder that robots.txt is a voluntary protocol. If you want to really block access to a project you should think about using .htaccess file.

Also you might want to double check that there are no typos in your existing robots.txt file. Most of the previous times that other members have reported Google not honoring robots.txt, it turned out to not be Googlebot's error.

Andem




msg:4446227
 9:18 pm on Apr 26, 2012 (gmt 0)

>> First, I just want to mention a friendly reminder that robots.txt is a voluntary protocol.

Well, that may be true but I'm giving Google and anybody else specific instructions not to index and/or scrape my content.

I pasted the contents of my robots.txt file in the OP. Did I miss something?

Cheers

g1smd




msg:4446231
 9:26 pm on Apr 26, 2012 (gmt 0)

Is the file saved from a text-editor or from a word-processor?

Is the robots.txt file called exactly "robots.txt" and located at example.com/robots.txt in the root folder of the site?

lucy24




msg:4446297
 11:15 pm on Apr 26, 2012 (gmt 0)

... but once all of that's taken care of, you still have the cases where, for example,

Disallow: /piwik

has to be supplemented with

RewriteCond %{REMOTE_ADDR} ^(207\.46|157\.5[4-9]|157\.60|209\.8[45])\. [OR]
RewriteCond %{HTTP_USER_AGENT} (Bluecoat|Bot|facebook|Google|Preview) [NC]
RewriteRule piwik\.(js|php)$ - [F]

Andem




msg:4446370
 3:19 am on Apr 27, 2012 (gmt 0)

g1smd: file was created with nano and last saved with the same application. The robots.txt file is exactly that and accessible via domain.com/robots.txt.

lucy24: I don't run Apache anymore but the lovely IP regex will certainly come in handy.. whether or not I implement my anti-Google solution via an nginx rewrite or PHP :) Highly appreciated.

I was actually appalled by the fact that Google was ignoring my very specific robots.txt directive.

phranque




msg:4446394
 5:40 am on Apr 27, 2012 (gmt 0)

I currently have a new site under development and I brought it live last week for the purpose of testing and development.

there is one optimal solution for a development/testing/staging site and you should always implement HTTP Basic Authentication:
http://wiki.nginx.org/HttpAuthBasicModule [wiki.nginx.org]

that means any request from an unauthenticated visitor will get a 401 Unauthorized response.

satty




msg:4446851
 12:26 pm on Apr 28, 2012 (gmt 0)

no it not get ignored, its a door for crawlers where only we can restrict the crawlers for some specific contents.

Andem




msg:4448135
 8:42 pm on May 1, 2012 (gmt 0)

>> you should always implement HTTP Basic Authentication

Thanks for the tip! I've now implemented that. Something to stop Google from crawling everything and sending me traffic.

>> no it not get ignored, its a door for crawlers where only we can restrict the crawlers for some specific contents.

I don't understand.

enigma1




msg:4448426
 1:10 pm on May 2, 2012 (gmt 0)

The problem with the authorization passwords is that if you need to test handshaking between your server and another (eg: payment processors) it blocks connections and needs various workarounds. Best course is do not publish the test folder or test domain or perhaps use an IP instead of a domain name.

phranque




msg:4448641
 8:37 pm on May 2, 2012 (gmt 0)

Best course is do not publish the test folder or test domain or perhaps use an IP instead of a domain name.

if the response is a 200 OK for any requested url then that url is "published".
i've seen plenty of unwanted duplicate content in the index under IP addresses.
"security through obscurity" is not the solution here.

if you need to test handshaking between your server and another (eg: payment processors)


Require valid-user
Allow from nnn.nnn.nnn.nnn
Satisfy Any

dstiles




msg:4448668
 9:44 pm on May 2, 2012 (gmt 0)

It's easy enough, using PHP or ASP, to include a check for the browsing IP. If it's not from a specified IP (or list of) then return an error code of your choice on an otherwise blank page.

If your IP is dynamic the list may be large-ish (use CIDR or range notation) or you may have to change the code when the IP changes, but that shouldn't be more than once a day and often enough once every few weeks.

enigma1




msg:4448726
 1:34 am on May 3, 2012 (gmt 0)

if the response is a 200 OK for any requested url then that url is "published".
i've seen plenty of unwanted duplicate content in the index under IP addresses.
"security through obscurity" is not the solution here.

Only if the bots know the test location, I don't see how, unless you explicitly publish it or perhaps give some traces to spiders about them. But that's up to how each one of us does testing. I find it easier than digging out IP ranges.

Test environments aren't continuously active. They can also be as secure as the main site so it's not matter of security.

phranque




msg:4448764
 4:00 am on May 3, 2012 (gmt 0)

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182072 [support.google.com]:
It's almost impossible to keep a web server secret by not publishing links to it.

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708 [support.google.com]:
If you need to keep confidential content on your server, save it in a password-protected directory. Googlebot and other spiders won't be able to access the content. This is the simplest and most effective way to prevent Googlebot and other spiders from crawling and indexing content on your site.

enigma1




msg:4448831
 9:02 am on May 3, 2012 (gmt 0)

Didn't I say above this?
unless you explicitly publish it or perhaps give some traces to spiders about them

dstiles




msg:4449059
 7:55 pm on May 3, 2012 (gmt 0)

Oh, that page on that domain is password protected. Do we have any reference to the password in our gmail, GTB, android, googlebot or web preview scrapes? :)

> It's almost impossible to keep a web server secret by not publishing links to it.

And how is the URL discovered? Not usually legitimately (or at least, with legitimate aims), that's for sure. If I do not notify anyone of a web domain or subdomain it can only be found by scraping DNS. After that it's usually a case of an automatic home page name or trying the usual index/default with a choice of extensions such as html, asp, php etc.

Remember that .com/.org/.net domains are known by G as soon as they are registered. This does not apply to many TLDs registered in countries outside the US (eg UK - as far as I know).

Too much laxness in the US registry; too much power given to G; too much nosiness by G.

phranque




msg:4449075
 8:20 pm on May 3, 2012 (gmt 0)

And how is the URL discovered?

it doesn't have to be "explicitly published".
examples of how are given in the linked google support thread.
"unintended" urls can "legitimately" be harvested from referrers in published log files, browser toolbars, gmail, ...

aakk9999




msg:4479687
 9:29 am on Jul 28, 2012 (gmt 0)

I have the following situation which I think proves Google is NOT honouring robots.txt:

- the page was blocked by robots.txt since it was created. It is blocked by user agent * and I have no other user agents specified in robots.txt
- in WMT, if I test this URL, it shows as "blocked by line nn"
- however, in WMT, under "internal links" section, if I hover over that URL, the page preview shows the screenshot of the blocked page.

So despite the page being explicitly blocked, google HAS visited it in order to create screenshot.

So it is blatantly obvious that Google is not honouring robots.txt as I cannot see any other way how Google would obtain the page screenshot other than visiting the page.

lucy24




msg:4479694
 10:50 am on Jul 28, 2012 (gmt 0)

You must have missed a few threads :)

Google Preview is a completely separate animal from the ordinary googlebot. It does not even look at robots.txt. Same goes for Google Translate.

g1smd




msg:4479771
 7:46 pm on Jul 28, 2012 (gmt 0)

The only surefire way to keep bots and snoopers out is .htpasswd access control.

aakk9999




msg:4479802
 10:06 pm on Jul 28, 2012 (gmt 0)

You must have missed a few threads :)

Yes, I have, thanks Lucy

phranque




msg:4479824
 1:49 am on Jul 29, 2012 (gmt 0)

So it is blatantly obvious that Google is not honouring robots.txt ...


this is only obvious if you see a request of an excluded url by googlebot or another google crawler in your server access log file.

lucy24




msg:4479856
 6:27 am on Jul 29, 2012 (gmt 0)

Y'know, I was trying to avoid saying "Preview is not a robot". But what the heck.

See, when I say something is not a robot, I'm being satirical. If there's no human with a browser-or-equivalent at the other end, it's a robot. But when most people say "It isn't a robot" they mean more narrowly: It isn't a crawler.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved