GigaBot - which engine sent this crawling? - Crawler, Spider, and User Agent ID forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

GigaBot - which engine sent this crawling?

Event_King

3:03 pm on Dec 12, 2005 (gmt 0)

GigaBot hmmmmm, is this Gigablast's little bot?

volatilegx

8:09 pm on Dec 12, 2005 (gmt 0)

Yes, sir.

bumpaw

2:43 pm on Jan 12, 2006 (gmt 0)

It appears to be running through the 20,000 pages on one of my sites even though it's only getting a 403.
It's IP range is blocked. I think it must be using the google sitemap.xml.gz for the site. It doesn't seem to care about robots.txt

Staffa

3:26 pm on Jan 12, 2006 (gmt 0)

Same here, the IP range is blocked yet it has been trying to crawl one of my sites for the last 4 days.

Staffa

10:00 pm on Jan 13, 2006 (gmt 0)

I just noticed that Gigablast changed its bot from
Gigabot/2.0 to
Gigabot/2.0/gigablast.com/spider.html

when checking the page it returns Error = not found
So much for additional information.

bobothecat

11:34 pm on Jan 13, 2006 (gmt 0)

Gigabot/2.0/gigablast.com/spider.html
when checking the page it returns Error = not found
So much for additional information.

Just noticed the same here...
66.154.102.55 - - [13/Jan/2006:16:30:33 -0700] "GET /robots.txt HTTP/1.0" 403 289 "-" "Gigabot/2.0/gigablast.com/spider.html"

66.154.102.69 - - [13/Jan/2006:16:30:33 -0700] "GET / HTTP/1.0" 403 316 "-" "Gigabot/2.0/gigablast.com/spider.html"

Still can't seem to figure out a 403.

Staffa

11:55 am on Jan 14, 2006 (gmt 0)

"Still can't seem to figure out a 403. "

Exactly, the bot has been trying to spider one of my sites continuously every day using about all the IP numbers in their range. (42 attempts yesterday)

My blocking returns a 404 and the bot ignores that too.

Pfui

9:03 pm on Jan 14, 2006 (gmt 0)

When bots relentlessly ignore or ask for robots.txt, and/or repeatedly ignore 403 Forbiddens, I e-mail a CEASE-AND-DESIST NOTICE to the ISP including a brief excerpt of my access_log and stating such unauthorized activities constitute attacks.

Where available and quickly copy-pastable, I also include the appropriate section of the ISP's Acceptable Use or similar policy the bot-runner is clearly violating.

(This sounds more tedious than it is. After you do it a few times, just recycle a prior notice and plug in the bot- and ISP-specific details.)

If I don't get a response from one e-mail address, I search for and contact others, one time even including the office of the head of a foreign government because someone using the state-owned ISP attempted to scrape us raw. (We have hundreds of thousands of files/posts so scraping is a BIG deal.) Got a very prompt, polite and effective response from that one!

Actually, most smaller ISPs have been immediately helpful, replying within a day or two with personal messages and profuse apologies. Alas, larger companies typically require repeated e-mails, and the largest I sometimes end up calling.

So if you haven't already done so, try e-mailing abuse@ or support@ or whatever address(es) you can find via WHOIS. And if that and all else fails -- you do have the IP, so if you also have a firewall, nuke 'em!

Pfui

8:32 am on Jan 21, 2006 (gmt 0)

P.S.

The URL for Gigablast's Gigabot Spider is not the bot's *entire* UA string, a.k.a. ...

Gigabot/2.0/gigablast.com/spider.html

...rather, it's *in* that string:

gigablast.com/spider.html

When you go there [gigablast.com], you'll find out about the spider, including how it "obeys the robots.txt standard."

And it does as described on my sites. For example, this generic robots.txt directive...

User-agent: *
Disallow: /

...results in the desired behavior:

HOST: www.gigablast.com
UA: Gigabot/2.0/gigablast.com/spider.html

FYI: Date Page Status Referer
01/20 21:55:14 /robots.txt 200 -

Hope that helps.

incrediBILL

5:56 am on Feb 7, 2006 (gmt 0)

When bots relentlessly ignore or ask for robots.txt, and/or repeatedly ignore 403 Forbiddens, I e-mail a CEASE-AND-DESIST NOTICE to the ISP including a brief excerpt of my access_log and stating such unauthorized activities constitute attacks.

Then is it safe to assume you've been removed from Google?

Pfui

6:51 am on Feb 7, 2006 (gmt 0)

Do you ask because Googlebot ignores your robots.txt and/or spiders too often?

It's my experience that Googlebot's very-frequent spidering is more than outweighed by its convenient, very-frequent use by people looking for us and/or for the info we provide. (We're on page 1 or 2 for related major keywords and I understand that's A-OK for an ad-free site neither familiar with nor competing in The SERP Wars.) Thus Googlebot is allowed and continues to heed my robots.txt instructions. Having recently added NOARCHIVE tags, I've found it's promptly heeding those as well, with Cached links vanishing ever since.

That said, in those directories where I absolutely, positively do not want any spiders ever, even top level-allowed crawlers are blocked using mod_rewrite, just in case:)

larryhatch

6:54 am on Feb 7, 2006 (gmt 0)

I let Gigablast / Gigabot spider all they want to.
Its an actual search engine after all, not at all like those black holes
that suck down your site and never present the results in public SERPs. -Larry

GaryK

2:52 pm on Feb 7, 2006 (gmt 0)

Sorry to go a little bit off-topic. At one time didn't this bot visit from Hurricane Electric IP Addresses?

volatilegx

4:58 pm on Feb 7, 2006 (gmt 0)

I deny Gigablast because I've seen them selling what they spider (cached pages) to other engines which display them regardless of a NOARCHIVE meta tag.

bobothecat

9:53 pm on Feb 7, 2006 (gmt 0)

I deny Gigablast because I've seen them selling what they spider (cached pages) to other engines which display them regardless of a NOARCHIVE meta tag.

I too deny Gigablast ... one for not respecting robots.txt, second for selling their search data, and third because I've found there to be no 'legitiment' traffic from it.

incrediBILL

10:44 pm on Feb 7, 2006 (gmt 0)

not respecting robots.txt

Interesting you say that as I have a specific spider trap page in robots.txt and neither Yahoo, MSN, Teoma/Jeeves or Gigablast has ever hit that page, only Google.

Nothing signigicant as far as traffic is concerned so I'm no sure why I let them crawl except their search engine has some interesting features that could catch on.

wilderness

12:33 pm on Feb 8, 2006 (gmt 0)

has ever hit that page, only Google.

The only not-honoring of robots that has occurred on my sites (by google) is in a recent trend to spider PDF files which I have in robots-denied images folder.

Initially I thought perhaps my robots text might be configured wrong, however none of the other "major" bots are grabbing PDF's.

jdMorgan

1:49 pm on Feb 8, 2006 (gmt 0)

One more point, just in case:

If you have denied a robot's IP address, then it cannot obey robots.txt because it can't fetch robots.txt. So, make sure that you allow robots.txt to be fetched regardless of IP-address-based access restrictions.

Neither Googlebot nor Gigablast has ever hit a trap on my sites. Maybe just luck.

Jim

incrediBILL

5:16 pm on Feb 8, 2006 (gmt 0)

Google is definitely getting the robots.txt file:

66.249.71.44 - - [08/Feb/2006:04:59:28 -0600] "GET /robots.txt HTTP/1.0" 200 111 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

But every now and then they step off the path and I've heard this from many people.

Back to Gigeblast, I'm seeing them come from a whole new batch of IPs all of a sudden, or maybe they've just never crawled me with these IPs.

Probably some effort to sidestep people that blocked their IP range ;)

keyplyr

5:56 pm on Feb 8, 2006 (gmt 0)

I deny Gigablast because I've seen them selling what they spider (cached pages) to other engines which display them regardless of a NOARCHIVE meta tag - volatilegx

I use NOARCHIVE tags and I see no evidence they've ever sold cached copies of my site. Within 3 weeks after I installed the tags, the cached copies dissapeared from their SERP, about the same time it took MSN, Yahoo and Google.

Gigablest does post a link to the Wayback machine (Internet Archive) which I have no problem with since I block caching with them via mod_rewrite.

Pfui

7:30 pm on Feb 8, 2006 (gmt 0)

1.) Bill, Gigablast licenses its software for personal and professional use (including, e.g., "Gigablast provides the support, software, hardware and bandwidth..." -- ref [gigablast.com].), so perhaps that explains the "whole new batch of IPs" you're seeing.

2.) FWIW, I grepped my largest access_log for January and of 77 Gigabot hits to robots.txt, only 11 actually came from gigablast.com:

www.gigablast.com - - [17/Jan/2006:08:16:46 -0800] "GET /robots.txt HTTP/1.0" 200 4087 "-"
"Gigabot/2.0/gigablast.com/spider.html"

3.) The remaining 66 hits all came from an apparent server farm in Vancouver, CA, using similar IPs and UAs, in very specific and curiously rhythmic five- and ten-day chunks --

22 hits by IP-in-host-name
22 hits by that host's IP only
22 hits by a similar IP only

(Looks like a Monk-ish coder at work:)

The Gigabot UAs used by this assertive network were similar for the most part but would alternate from day to day, and even while using the same IP:

Gigabot/2.0
Gigabot/2.0/gigablast.com/spider.html

4.) The Good News is that all hits by all Gigabots -- company and Canadian -- were only to robots.txt. But after 66 hits from who-knows-who in Canada for who-knows-what purpose, I decided it was firewall time for their IP blocks.