Fluffy the spider

Forum Moderators: open

Message Too Old, No Replies

Fluffy the spider

Ahhh.....thats sweet

Skylo

11:55 am on May 30, 2003 (gmt 0)

Hey everyone. Looking in my nettracker logs today and see this little girl was my 4th deepest crawler yesterday and so far she is the biggest today!

I just want to know who she originates from. Whose spider it is? Also just while you are all here what spiders are these: InternetSeer
Internet Archive
NPBot
Thanks alot
Happy Surfing;-)

creative craig

12:05 pm on May 30, 2003 (gmt 0)

Internet Seer are a free web site monioring service, I use them for a site, not bad :)

Craig

Brad

12:13 pm on May 30, 2003 (gmt 0)

Searchhippo has a spider named "Fluffy"

jim_w

12:20 pm on May 30, 2003 (gmt 0)

NPBot is the Name Protect bot.

wilderness

1:22 pm on May 30, 2003 (gmt 0)

Fluffy has apparently expanded their IP ranges?
From March of 2002 I have, 208.148.122.28 03/31/02

Last night 206.40.146.58
206.40.128.0 - 206.40.159.255

They didn't read robots. Just jumped right in :(
In spite of my "new leaf" I'm going to deny them.
I've been getting some mild undientified traffic from Missouri for some time. Too bad I didn't document the IP.

dmorison

1:29 pm on May 30, 2003 (gmt 0)

Internet Seer are the X10 of log file spamming.

balam

5:38 pm on May 30, 2003 (gmt 0)

InternetSeer: [sitecheck.internetseer.com...]
Internet Archive: [archive.org...]
NPBot: [nameprotect.com...]

balam

DavidT

6:50 pm on May 30, 2003 (gmt 0)

Fluffy, from my experience, visits very occasionally and takes only the index page. Never takes robots.txt, but I at least get a steady trickle of visitors from Search Hippo (mainly from the speeling-challenged), so am not very concerned about it and really if you are going to deny something like Fluffy in robots, well I don't know.

kmarcus

4:41 am on Jun 1, 2003 (gmt 0)

Fluffy is me. I have some issues with the robots.txt thing in the first place but putting them aside basically the issue that causes me some problems are redirects. In a nutshell, I do not just the robots.txt each time that i pull a url, but rather instead i pull the robots.txt for each *domain* that I have every few months and use this to filter out urls from a big list of urls that i have to spider. The issue, again though is that if there is a url not excluded, but redirects there is no realtime check.

I recently upgraded my bandwidth and switched isps which is why the netblcok has changed. THe old ip address will go away in about 30 days or so. As you could imagine, with the extra bandwidth, i thoguht it would be appropriate to do a slightly deeper crawl than i usually do. I am also experimenting with some new algorithms (again) to try and solve some of the issues i currently have.

anyway, if you're in that seo world, go to the 'about' section on the site and there is a hoarde of information about how you can optimize if you care...

jdMorgan

4:59 am on Jun 1, 2003 (gmt 0)

I would like to add that, in my experience, kmarcus responds to serious inquiries and problem reports about Fluffy the Spider very quickly, and that - within the limitations he described above - Fluffy has been well-behaved on my sites.

Not fetching robots.txt every so many hours is not a problem. Fetching Disallowed pages is a problem. But Fluffy doesn't fetch disallowed pages - at least not on my sites.

If you add a new page you want disallowed, always update your robots.txt before you post the page you don't want spidered - This applies to all search engines.

Jim

wilderness

5:22 am on Jun 1, 2003 (gmt 0)

Jim
I had the 208 range of Fluffy denied.
The recent IP change allowed Fluffy tempoaray access on 5/29.

The bot does have some sort of glitch when it comes to CASE.
Some of my early pages and folder creations still exist (despite webtrends and methods) which involve the use of UPPER case.
Nowhere on the internet are these pages in lower case and yet fluffy attempted to read them as such. Generating 404's in the process.

In fluffy's defense, he is not the only one. Most of the APNIC and a few RIPE specific bots do the same thing on the case sensitive folder/pages.

Fluffy may in fact be an excellent SE. For me it's a matter of my visitors finding obscure SE's and the advantages that my content provide to the SE rather than the other way around.

I rarely uses robots.txt these days. The only general exception to that is if I add a folder which I do not desire the majors SE's to navigate.

Don

kmarcus

2:29 pm on Jun 1, 2003 (gmt 0)

The case sensitive issue thing is a big one tht everyone faces. it gets worse too, because you can a lot of times make up urls that give the same pages
(blah.com/?1 blah.com/?2 blah.com/?3 etc.)

so this causes me grief and I intentionally lowercase everything as part of the normalization process. I did some preliminary work a few months back with a case sensitiveless version that merely used reference coutns to decide what the case should be and it seemed to work pretty well, i was just not yet ready to use and deploy the new system when I started thislatest round of crawling.

wkitty42

2:42 am on Jun 3, 2003 (gmt 0)

kmarcus,

of course, you understand that many *nix servers are case sensitive...

i've one site that has URLs that are all uppercase and any
lowercase search for them will result in 404s... no, i don't
see a need to "normalize" them to lowercase... why? mainly
because there are additioal URLs that respond to oThErCaSe
URL requests...

maybe normalizing is not such a good idea for a spiderbot?

carfac

2:32 pm on Jun 5, 2003 (gmt 0)

Well, I caught Fluffy in the out of bounds area of my site...

Twice in the last week!

I contacted kmarcus off the boards, and he was more than helpful figuring out why (Stale version of robots.txt in Fluffy!)

Anyway, I just wanted to comment that while there may be the occasional problem, kmarcus does seem to be very serious about solving them!

dave