Forum Moderators: open
Gigabot/2.0/gigablast.com/spider.html
when checking the page it returns Error = not found
So much for additional information.
Just noticed the same here...
66.154.102.55 - - [13/Jan/2006:16:30:33 -0700] "GET /robots.txt HTTP/1.0" 403 289 "-" "Gigabot/2.0/gigablast.com/spider.html"
66.154.102.69 - - [13/Jan/2006:16:30:33 -0700] "GET / HTTP/1.0" 403 316 "-" "Gigabot/2.0/gigablast.com/spider.html"
Still can't seem to figure out a 403.
Where available and quickly copy-pastable, I also include the appropriate section of the ISP's Acceptable Use or similar policy the bot-runner is clearly violating.
(This sounds more tedious than it is. After you do it a few times, just recycle a prior notice and plug in the bot- and ISP-specific details.)
If I don't get a response from one e-mail address, I search for and contact others, one time even including the office of the head of a foreign government because someone using the state-owned ISP attempted to scrape us raw. (We have hundreds of thousands of files/posts so scraping is a BIG deal.) Got a very prompt, polite and effective response from that one!
Actually, most smaller ISPs have been immediately helpful, replying within a day or two with personal messages and profuse apologies. Alas, larger companies typically require repeated e-mails, and the largest I sometimes end up calling.
So if you haven't already done so, try e-mailing abuse@ or support@ or whatever address(es) you can find via WHOIS. And if that and all else fails -- you do have the IP, so if you also have a firewall, nuke 'em!
The URL for Gigablast's Gigabot Spider is not the bot's *entire* UA string, a.k.a. ...
Gigabot/2.0/gigablast.com/spider.html
...rather, it's *in* that string:
gigablast.com/spider.html
When you go there [gigablast.com], you'll find out about the spider, including how it "obeys the robots.txt standard."
And it does as described on my sites. For example, this generic robots.txt directive...
User-agent: *
Disallow: /
...results in the desired behavior:
HOST: www.gigablast.com
UA: Gigabot/2.0/gigablast.com/spider.html
FYI: Date Page Status Referer
01/20 21:55:14 /robots.txt 200 -
Hope that helps.
When bots relentlessly ignore or ask for robots.txt, and/or repeatedly ignore 403 Forbiddens, I e-mail a CEASE-AND-DESIST NOTICE to the ISP including a brief excerpt of my access_log and stating such unauthorized activities constitute attacks.
Then is it safe to assume you've been removed from Google?
It's my experience that Googlebot's very-frequent spidering is more than outweighed by its convenient, very-frequent use by people looking for us and/or for the info we provide. (We're on page 1 or 2 for related major keywords and I understand that's A-OK for an ad-free site neither familiar with nor competing in The SERP Wars.) Thus Googlebot is allowed and continues to heed my robots.txt instructions. Having recently added NOARCHIVE tags, I've found it's promptly heeding those as well, with Cached links vanishing ever since.
That said, in those directories where I absolutely, positively do not want any spiders ever, even top level-allowed crawlers are blocked using mod_rewrite, just in case:)
I deny Gigablast because I've seen them selling what they spider (cached pages) to other engines which display them regardless of a NOARCHIVE meta tag.
I too deny Gigablast ... one for not respecting robots.txt, second for selling their search data, and third because I've found there to be no 'legitiment' traffic from it.
not respecting robots.txt
Interesting you say that as I have a specific spider trap page in robots.txt and neither Yahoo, MSN, Teoma/Jeeves or Gigablast has ever hit that page, only Google.
Nothing signigicant as far as traffic is concerned so I'm no sure why I let them crawl except their search engine has some interesting features that could catch on.
has ever hit that page, only Google.
The only not-honoring of robots that has occurred on my sites (by google) is in a recent trend to spider PDF files which I have in robots-denied images folder.
Initially I thought perhaps my robots text might be configured wrong, however none of the other "major" bots are grabbing PDF's.
If you have denied a robot's IP address, then it cannot obey robots.txt because it can't fetch robots.txt. So, make sure that you allow robots.txt to be fetched regardless of IP-address-based access restrictions.
Neither Googlebot nor Gigablast has ever hit a trap on my sites. Maybe just luck.
Jim
66.249.71.44 - - [08/Feb/2006:04:59:28 -0600] "GET /robots.txt HTTP/1.0" 200 111 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
But every now and then they step off the path and I've heard this from many people.
Back to Gigeblast, I'm seeing them come from a whole new batch of IPs all of a sudden, or maybe they've just never crawled me with these IPs.
Probably some effort to sidestep people that blocked their IP range ;)
I deny Gigablast because I've seen them selling what they spider (cached pages) to other engines which display them regardless of a NOARCHIVE meta tag - volatilegx
Gigablest does post a link to the Wayback machine (Internet Archive) which I have no problem with since I block caching with them via mod_rewrite.
2.) FWIW, I grepped my largest access_log for January and of 77 Gigabot hits to robots.txt, only 11 actually came from gigablast.com:
www.gigablast.com - - [17/Jan/2006:08:16:46 -0800] "GET /robots.txt HTTP/1.0" 200 4087 "-"
"Gigabot/2.0/gigablast.com/spider.html"
3.) The remaining 66 hits all came from an apparent server farm in Vancouver, CA, using similar IPs and UAs, in very specific and curiously rhythmic five- and ten-day chunks --
22 hits by IP-in-host-name
22 hits by that host's IP only
22 hits by a similar IP only
(Looks like a Monk-ish coder at work:)
The Gigabot UAs used by this assertive network were similar for the most part but would alternate from day to day, and even while using the same IP:
Gigabot/2.0
Gigabot/2.0/gigablast.com/spider.html
4.) The Good News is that all hits by all Gigabots -- company and Canadian -- were only to robots.txt. But after 66 hits from who-knows-who in Canada for who-knows-what purpose, I decided it was firewall time for their IP blocks.