Forum Moderators: open
Privacy and security concerns aside, what effect does this have on a site's theme and overall PR?
<rant>GoogleGuy, what part of Disallow: /cgi-bin/ does your robot not understand? How would you like it if I sent my spider into your protected areas? :(</rant>
No respect!
www.altavista.com/cgi-bin/query
http://www.altavista.com/robots.txt
www.listbot.com/cgi-bin/subscriber
http://www.listbot.com/robots.txt
babelfish.altavista.com/cgi-bin/translate
http://babelfish.altavista.com/robots.txt
www.nsf.gov/cgi-bin/getpub?gpg
http://www.nsf.gov/robots.txt
Hehe! [url=http://www.google.com/search?q=%22google.%2Bcom/cgi-bin/%22]Googlebot[/url] doesn't even obey the rules at home.
labs.google.com/cgi-bin/keys
http://labs.google.com/robots.txt
Like the labs one:
[google.com...]
That's all fine because the link is built from inbound links. Google's not under obligation to "not list" those.
It also doesn't say that Google can't crawl anything that is blocked by a robots.txt? As far as I know, a robots.txt does specifically have to block a crawler - just the listing/indexing. However, the se is still free to do what Google is doing and build a listing based solely on inbound links.
I don't care for that either KM.
<edit>fixed url</edit>
[edited by: Brett_Tabke at 7:39 am (utc) on July 4, 2002]
It also doesn't say that Google can't crawl anything that is blocked by a robots.txt. As far as I know, a robots.txt does specifically have to block a crawler - just the listing/indexing. However, the se is still free to do what Google is doing and build a listing based solely on inbound links.
What's the purpose of a robots.txt then? It can crawl and list sites blocked in robots.txt and still be compliant? No offense, but that sounds like a bunch of $#&*$^!
The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.
When does that go into effect? What about Google that apparently has separate crawlers that "do the deed" vs those that grab an occasional copy of robots.txt.
I do agree with your strict interpretation of it. I've grown used to Googles behavior since they've done it on/off since the start.
[robotstxt.org...]
If Google can't obey robots.txt (the strict way, not some made up Googly way) the consequences are going to be disastrous for Web security. And I though Microsoft was bad.
Hide those credit card lists! You have to presume Google will index it if it finds it. Don't rely on robots.txt for peace of mind.
But the message still applies though to not keep sensitive information on an open server or, if you must, then secure it with passwords and the likes.
There was a case a while ago about credit card or banking details being indexed by the big G was there not? Can't remember the details but a worthwhile wake up call non the less.
Onya
Woz
If I find someone with unencrypted credit card numbers on my servers then I throw them off. It's just not right, even if the numbers are outside of document root. If numbers are stored encrypted, then it is not right to have the decryption key on the server either.
Despite a cgi-bin disallow that Google has respected for six months now, it crawled nearly 5,000 cgi links today before I caught and blocked them.
Each of these 5,000 generates a social network diagram that requires very substantial CPU activity.
The speed as which Google was grabbing these turned our machine into a boat anchor; the CPU load was way over the top.
Bad boy Googlebot, bad boy.
I think the problem is that Google picked up all these cgi-bin links from my static files. It got thousands of these static files, as usual, in the normal crawl that ended a couple days ago. A lot of these cgi-bin links ended up as "uncrawled" link listings in their SERPs from the June crawl and update. I just noticed this in June; I don't know if this is new or not, but I would have preferred that Google not list them in the SERPs at all.
Now it's the July crawl and, it seems, Google is in Stage Two of the crawl, where they go back and get the uncrawled links from Stage One. The only problem is that they forgot to check the robots.txt before launching Stage Two. The robots.txt was the reason they didn't get crawled in Stage One!
I have noticed in the past that Google ends up with a handful of cgi-bin pages from my site due to external links (from weblogs and such) that appear now and then. It's not in the habit of checking robots.txt for these one-shot GETs. But this latest Googlebot behavior is a bug at a more fundamental level. If Google doesn't get it straightened out, the next step is to block them at our router.
For some reason, .htaccess in my www.domain.org root doesn't work on our cgi-bin, in the sense that a working "Deny" in the root for www.domain.org will still allow access to www.domain.org/cgi-bin/... It's probably a configuration issue. All of the various domains on our Class C use the same common cgi-bin directory.
Does anyone know of how to put something in Apache's httpd.conf to give me .htaccess control over cgi-bin?
GET /robots.txt HTTP/1.1Host: mydomain.com
Connection: close
User-Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)HTTP/1.1 200 OK
Date: Thu, 04 Jul 2002 22:19:56 GMT
Server: Apache/1.3.20 (Unix) ApacheJServ/1.1.2 PHP/4.2.1 FrontPage/5.0.2.2510 Rewrit/1.1a
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain
Input?
KM, do you do any user-agent/IP delivery stuff? Just curious, because I'm seeing some weird stuff..
LMAO! :)
Weird stuff maybe (in a nerdy type of way). IP delivery, somewhat. For example, I feed Googlebot a
<meta name="googlebot" content="noarchive">
tag and MSIE browsers a
<meta name="MSSmartTagsPreventParsing" content="TRUE">
tag. I also conduct studies on the elusive type of spiders that use common user-agent strings (e.g. Mozilla/x.x (compatible; MSIE x.x; Windows NT). Sorry if that was you who were banned. The lack of a referrer will trigger the deny plugin for that user-agent. Other than that, I don't cloak content (nor could I with my current setup). I'm just interested in preventing unauthorized access to my content. You could call it a hobby of mine. ;)
<!-- Name Version (URL) -->
tag anymore. I should have mentioned, it's all done via SSI (no mod_rewrite or URL redirections). For .html pages the tag is inserted between the <head></head>. For .txt pages you would only see
<!-- Name Version (URL) -->
without the <meta> tags or other HTML elements being shown.
The deny plugin can be seen in the plugin graphic available in the site listed in my profile. I don't wan't to elaborate more than that out of respect for the TOS here. You can, however, stickymail me or E-mail me directly for more details if you like.
[big]GoogleGuy[/big] 7:55 pm on July 4, 2002 (utc -4)KM, our folks are seeing 403's when we try to fetch your robots.txt.
You know, I wonder if anybody at, say, Microsoft is answering user queries on a public bulletin board on the 4th of July? (US Independence Day [archives.gov], for the international crowd. That's why your US referrals vanished today.)
This raises two other possibilities, though: (1) GoogleGuy is in fact the night watchman at the GooglePlex and just makes all this stuff up from some secretary's computer that got left on by mistake; or (2) the GooglePlex is a facade and the whole operation is run out of Indonesia.
RJO ;)
Other than that, no other hits were logged with a 403 (which could only mean your requests that resulted in a 403 never reached our server).
Here's a sample from the log file showing a 403:
216.239.45.4 - - [04/Jul/2002:17:57:12 -0400] "GET /robots.txt HTTP/1.0" 403 - "-" "Wget/1.8.2"
I've had around a hundred people (literally) from around the world check on this robots.txt file and every one got a 200.
straight Apache server, static robots.txt text file with vanilla set-up
It is a static robots.txt file. Don't know what you mean by straight Apache server though. I don't have, use, or need mod_rewrite if that is what you are alluding to.
I never mentioned it before, just fumed and stomped around a little, but thought I'd add my 2 cents today.
Just a quick note to other people who think that we've got problems with our robots.txt: We hadn't heard any complaints about robots.txt for several months, but we always want to make sure that we're handling robots.txt correctly. If you've seen some behavior that you think might be bad, you can mail to search-quality@google.com with a subject line like "robots.txt".
Here are a few things that you should check for first though:
1. If you try to fetch www.yourdomain.com/robots.txt, do you get a valid page, preferably with a 200 status code? You should not return a 403 (forbidden error).
2. If you see result pages returned that are supposed to be forbidden, are you sure we crawled the page? Google can return a link as a result even if we didn't crawl the page. One key thing to look for is that if the page doesn't have a snippet, then we probably didn't crawl the page.
Bonus points if you run a simple server (no cloaking/IP or user agent delivery, etc.), and if your configuration has been the same for several months. For better or for worse, >99% of robots.txt complaints turn out to be due to a webmaster mistake, so doing these simple checks will act as a prefilter and let people know that they should take your email seriously.
Hope this helps,
GoogleGuy