goodroi

msg:4446162 | 7:04 pm on Apr 26, 2012 (gmt 0) |
First, I just want to mention a friendly reminder that robots.txt is a voluntary protocol. If you want to really block access to a project you should think about using .htaccess file. Also you might want to double check that there are no typos in your existing robots.txt file. Most of the previous times that other members have reported Google not honoring robots.txt, it turned out to not be Googlebot's error.
|
Andem

msg:4446227 | 9:18 pm on Apr 26, 2012 (gmt 0) |
>> First, I just want to mention a friendly reminder that robots.txt is a voluntary protocol. Well, that may be true but I'm giving Google and anybody else specific instructions not to index and/or scrape my content. I pasted the contents of my robots.txt file in the OP. Did I miss something? Cheers
|
g1smd

msg:4446231 | 9:26 pm on Apr 26, 2012 (gmt 0) |
Is the file saved from a text-editor or from a word-processor? Is the robots.txt file called exactly "robots.txt" and located at example.com/robots.txt in the root folder of the site?
|
lucy24

msg:4446297 | 11:15 pm on Apr 26, 2012 (gmt 0) |
... but once all of that's taken care of, you still have the cases where, for example, Disallow: /piwik has to be supplemented with RewriteCond %{REMOTE_ADDR} ^(207\.46|157\.5[4-9]|157\.60|209\.8[45])\. [OR] RewriteCond %{HTTP_USER_AGENT} (Bluecoat|Bot|facebook|Google|Preview) [NC] RewriteRule piwik\.(js|php)$ - [F]
|
Andem

msg:4446370 | 3:19 am on Apr 27, 2012 (gmt 0) |
g1smd: file was created with nano and last saved with the same application. The robots.txt file is exactly that and accessible via domain.com/robots.txt. lucy24: I don't run Apache anymore but the lovely IP regex will certainly come in handy.. whether or not I implement my anti-Google solution via an nginx rewrite or PHP :) Highly appreciated. I was actually appalled by the fact that Google was ignoring my very specific robots.txt directive.
|
phranque

msg:4446394 | 5:40 am on Apr 27, 2012 (gmt 0) |
| I currently have a new site under development and I brought it live last week for the purpose of testing and development. |
| there is one optimal solution for a development/testing/staging site and you should always implement HTTP Basic Authentication: http://wiki.nginx.org/HttpAuthBasicModule [wiki.nginx.org] that means any request from an unauthenticated visitor will get a 401 Unauthorized response.
|
satty

msg:4446851 | 12:26 pm on Apr 28, 2012 (gmt 0) |
no it not get ignored, its a door for crawlers where only we can restrict the crawlers for some specific contents.
|
Andem

msg:4448135 | 8:42 pm on May 1, 2012 (gmt 0) |
>> you should always implement HTTP Basic Authentication Thanks for the tip! I've now implemented that. Something to stop Google from crawling everything and sending me traffic. >> no it not get ignored, its a door for crawlers where only we can restrict the crawlers for some specific contents. I don't understand.
|
enigma1

msg:4448426 | 1:10 pm on May 2, 2012 (gmt 0) |
The problem with the authorization passwords is that if you need to test handshaking between your server and another (eg: payment processors) it blocks connections and needs various workarounds. Best course is do not publish the test folder or test domain or perhaps use an IP instead of a domain name.
|
phranque

msg:4448641 | 8:37 pm on May 2, 2012 (gmt 0) |
| Best course is do not publish the test folder or test domain or perhaps use an IP instead of a domain name. |
| if the response is a 200 OK for any requested url then that url is "published". i've seen plenty of unwanted duplicate content in the index under IP addresses. "security through obscurity" is not the solution here. | if you need to test handshaking between your server and another (eg: payment processors) |
|
Require valid-user Allow from nnn.nnn.nnn.nnn Satisfy Any
|
dstiles

msg:4448668 | 9:44 pm on May 2, 2012 (gmt 0) |
It's easy enough, using PHP or ASP, to include a check for the browsing IP. If it's not from a specified IP (or list of) then return an error code of your choice on an otherwise blank page. If your IP is dynamic the list may be large-ish (use CIDR or range notation) or you may have to change the code when the IP changes, but that shouldn't be more than once a day and often enough once every few weeks.
|
enigma1

msg:4448726 | 1:34 am on May 3, 2012 (gmt 0) |
if the response is a 200 OK for any requested url then that url is "published". i've seen plenty of unwanted duplicate content in the index under IP addresses. "security through obscurity" is not the solution here. |
| Only if the bots know the test location, I don't see how, unless you explicitly publish it or perhaps give some traces to spiders about them. But that's up to how each one of us does testing. I find it easier than digging out IP ranges. Test environments aren't continuously active. They can also be as secure as the main site so it's not matter of security.
|
phranque

msg:4448764 | 4:00 am on May 3, 2012 (gmt 0) |
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182072 [support.google.com]: | It's almost impossible to keep a web server secret by not publishing links to it. |
| http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93708 [support.google.com]: | If you need to keep confidential content on your server, save it in a password-protected directory. Googlebot and other spiders won't be able to access the content. This is the simplest and most effective way to prevent Googlebot and other spiders from crawling and indexing content on your site. |
|
|
enigma1

msg:4448831 | 9:02 am on May 3, 2012 (gmt 0) |
Didn't I say above this? | unless you explicitly publish it or perhaps give some traces to spiders about them |
|
|
dstiles

msg:4449059 | 7:55 pm on May 3, 2012 (gmt 0) |
Oh, that page on that domain is password protected. Do we have any reference to the password in our gmail, GTB, android, googlebot or web preview scrapes? :) > It's almost impossible to keep a web server secret by not publishing links to it. And how is the URL discovered? Not usually legitimately (or at least, with legitimate aims), that's for sure. If I do not notify anyone of a web domain or subdomain it can only be found by scraping DNS. After that it's usually a case of an automatic home page name or trying the usual index/default with a choice of extensions such as html, asp, php etc. Remember that .com/.org/.net domains are known by G as soon as they are registered. This does not apply to many TLDs registered in countries outside the US (eg UK - as far as I know). Too much laxness in the US registry; too much power given to G; too much nosiness by G.
|
phranque

msg:4449075 | 8:20 pm on May 3, 2012 (gmt 0) |
| And how is the URL discovered? |
| it doesn't have to be "explicitly published". examples of how are given in the linked google support thread. "unintended" urls can "legitimately" be harvested from referrers in published log files, browser toolbars, gmail, ...
|
aakk9999

msg:4479687 | 9:29 am on Jul 28, 2012 (gmt 0) |
I have the following situation which I think proves Google is NOT honouring robots.txt: - the page was blocked by robots.txt since it was created. It is blocked by user agent * and I have no other user agents specified in robots.txt - in WMT, if I test this URL, it shows as "blocked by line nn" - however, in WMT, under "internal links" section, if I hover over that URL, the page preview shows the screenshot of the blocked page. So despite the page being explicitly blocked, google HAS visited it in order to create screenshot. So it is blatantly obvious that Google is not honouring robots.txt as I cannot see any other way how Google would obtain the page screenshot other than visiting the page.
|
lucy24

msg:4479694 | 10:50 am on Jul 28, 2012 (gmt 0) |
You must have missed a few threads :) Google Preview is a completely separate animal from the ordinary googlebot. It does not even look at robots.txt. Same goes for Google Translate.
|
g1smd

msg:4479771 | 7:46 pm on Jul 28, 2012 (gmt 0) |
The only surefire way to keep bots and snoopers out is .htpasswd access control.
|
aakk9999

msg:4479802 | 10:06 pm on Jul 28, 2012 (gmt 0) |
| You must have missed a few threads :) |
| Yes, I have, thanks Lucy
|
phranque

msg:4479824 | 1:49 am on Jul 29, 2012 (gmt 0) |
| So it is blatantly obvious that Google is not honouring robots.txt ... |
| this is only obvious if you see a request of an excluded url by googlebot or another google crawler in your server access log file.
|
lucy24

msg:4479856 | 6:27 am on Jul 29, 2012 (gmt 0) |
Y'know, I was trying to avoid saying "Preview is not a robot". But what the heck. See, when I say something is not a robot, I'm being satirical. If there's no human with a browser-or-equivalent at the other end, it's a robot. But when most people say "It isn't a robot" they mean more narrowly: It isn't a crawler.
|
|