Forum Moderators: open
GG, you mentioned frames and javascript - well I don't envy your engineers the task of dealing with javascript (so I guess that may be down the line) but since I use frames extensively I would certainly be interested to know what changes are coming.
I imagine that CSS files will also be scanned in the future. I guess we'd all like the head's up on that.
Also here's a suggestion I posted a couple of weeks ago.
From comments by GoogleGuy, I think it is safe to assume that it is possible to create plain html links that will not be followed by Googlebot. All you need to do is add something that looks like a session id to the url.However, this is untidy, therefore I propose a very simple exclusion protocol just add
?...&robots=nofollow
to the url.
When a robot sees this parameter in an url it should not follow it.
The standard should allow other fields and fields in any order so that the following would be legal
?...&robots=newparam,nofollow,anotherparam
This would make it easy for webmasters to avoid setting spider traps. It would allow creators of shopping cart software to ensure that their products don't set spider traps. Since it is probably the existence of such problems that has caused some hosts to ban Googlebot (amongst others) it would help to solve this problem over time.
A standard such as this should have been agreed years ago. However, if we wait for a standards organisation to ratify this it'll take years. On the other hand, if Google were to unilaterally adopt such a standard, other robots would adopt it too.
Kaled.
Dayo_UK, it's more the latter. It's not like the new user-agent bot will be some brand-new "superbot" that can understand everything that webservers will offer. But it does lay the groundwork, so that if in the future we want to add a new superbot-like feature, things will be smoother for everyone (both webmasters and us).
On Apache, the fix can be as simple as putting
Options -MultiViews Jim
I'll sticky you the log files if you want.
Interesting that GG didn't respond to the several posts on Googlebot/Test
Multiviews is a server-level setting, independent of (well, above) individual pages. You may have to ask your host if it's enabled. Basically, your server and Googlebot could not agree on a MIME type that was acceptable to both. You might also want to check your server headers [webmasterworld.com] and make sure they're correct. You should get a MIME-type of text/html for a plain-vanilla html page.
The problem with Googlebot and MultiViews started about 01/Feb/2004 according to this thread [webmasterworld.com].
More recent discussion [webmasterworld.com].
Jim
2004-03-06 11:06:59 wfp2.almaden.ibm.com - W3xxnxnnn Wxxnnn ip.232.131.ip 80
GET /pdfs/Some_File.pdf - 406 0 4066 195 HTTP/1.0
[almaden.ibm.com...] -
.
Heh, even found someone browsing the site from a mobile phone.
2004-03-06 11:34:21 216.239.39.5 - W3xxnxnnn Wxxnnn ip.232.131.ip 80
GET /index.htm - 200 0 1069 565 HTTP/1.0
Nokia3510i/1.0+(04.01)+Profile/MIDP-1.0+Configuration/CLDC-1.0+(Google+WAP+Proxy/1.0) -
2004-03-05 06:11:01 65.54.188.8 GET /c***-r****-a**essment.doc 406 4066 216 www.site.org msnbot/0.11+(+http://search.msn.com/msnbot.htm) -
The G ones were .htm.
The testbot and the 406's haven't shown up again, so no problem.
Thanks for the input, GG.
This user agent really upped the page views starting last night and it is going for our regular HTML pages. In the past it had a few crawls and now it is far more aggressive for its visit cycle. It is also visiting new pages it has not crawled before.
I'm concerned about non-Google scraping of pages. If a fake, I'd like to ban it.
Here are the IPs and view counts (past month) identifying itself as the "Google/Test" bot.
IP Page Views
64.68.91.131 106
64.68.91.53 54
64.68.91.164 53
64.68.81.18 50
64.68.91.54 34
64.68.89.141 14
64.68.91.33 12
64.68.91.59 12
64.68.83.188 7
64.68.89.167 6
64.68.91.64 6
64.68.83.79 5
64.68.91.188 5
64.68.81.154 3
64.68.89.154 3
64.68.83.132 2
64.68.83.182 2
64.68.88.18 2
64.68.88.191 2
64.68.89.173 2
64.68.89.180 2
64.68.91.39 2
64.68.83.144 1
64.68.89.179 1
Cable & Wireless SC3-1 (NET-64-68-64-0-1)
64.68.64.0 - 64.68.95.255
Google Inc. EC12-1-GOOGLE (NET-64-68-80-0-1)
64.68.80.0 - 64.68.87.255
Who has answers? The help would be most appreciated.
User-agent: Googlebot
Disallow: /
BTW, Googlebot test is going mad at the moment on my site too - perhaps the word test will be dropped soon :)